Kudu
The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. The origin can read all of the columns from a table or only the specified columns from a table.
When you configure the Kudu origin, you specify the connection information for one or more Kudu masters. You configure the table to read, and optionally define the columns to read from the table. When needed, you can specify a maximum batch size for the origin.
You can also use a connection to configure the origin.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.
Configuring a Kudu Origin
Configure a Kudu origin to read data from a Kudu table.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Kudu tab, configure the following properties:
Kudu Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Kudu Masters Comma-separated list of Kudu masters used to access the Kudu table. For each Kudu master, specify the host and port in the following format:
<host>:<port>
Kudu Table Name of the Kudu table to read. To read from a Kudu table created by Impala, use the following format:impala::default.<table name>
Columns to Read Columns to read from each table. If you specify no columns, the origin reads all the columns in each table. Specified columns must exist in all the tables that the origin reads.
Click the Add icon to specify an additional column. You can use simple or bulk edit mode to configure the columns.
-
On the Advanced tab, optionally configure the following
properties:
Advanced Property Description Max Batch Size Maximum number of records to read in a batch. -1
uses the batch size configured for the Spark cluster.Maximum Number of Worker Threads Maximum number of threads to use to perform processing for the stage.
Default is the Kudu default – twice the number of available cores on each processing node in the Spark cluster.
Use this property to limit the number of threads that can be used. To use the Kudu default, set to 0.