Kafka

The Kafka origin reads data from one or more topics in an Apache Kafka cluster. All messages in a batch must use the same schema. The origin supports Apache Kafka 0.10 and later. When using a Cloudera distribution of Apache Kafka, use CDH Kafka 3.0 or later.

The Kafka origin can read messages from a list of Kafka topics or from topics that match a pattern defined in a Java-based regular expression. When reading topics in the first batch, the origin can start from the first message, the last message, or a particular position in a partition. In subsequent batches, the origin starts from the last-saved offset.

When configuring the Kafka origin, you specify the Kafka brokers that the origin can initially connect to, the topics the origin reads, and where to start reading each topic. You can configure the origin to connect securely to Kafka. You specify the maximum number of messages to read from any partition in each batch. You can configure the origin to include Kafka message keys in records. You can also specify additional Kafka configuration properties to pass to Kafka.

You can also use a connectionconnection to configure the origin.

You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.