Kudu

The Kudu origin reads all available data from a Kudu table. You can also use this origin to read a Kudu table created by Impala.

The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. The origin can read all of the columns from a table or only the specified columns from a table.

When you configure the Kudu origin, you specify the connection information for one or more Kudu masters. You configure the table to read, and optionally define the columns to read from the table. When needed, you can specify a maximum batch size for the origin.

You can also use a connection connection connection to configure the origin.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.

Note: Due to a Kudu limitation on Spark, pipeline validation does not validate Kudu stage configuration.