Kudu

The Kudu origin reads all available data from a Kudu table. You can also use this origin to read a Kudu table created by Impala.

The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. The origin can read all of the columns from a table or only the specified columns from a table.

When you configure the Kudu origin, you specify the connection information for one or more Kudu masters. You configure the table to read, and optionally define the columns to read from the table. When needed, you can specify a maximum batch size for the origin.

You can also use a connection to configure the origin.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.

Note: Due to a Kudu limitation on Spark, pipeline validation does not validate Kudu stage configuration.

Configuring a Kudu Origin

Configure a Kudu origin to read data from a Kudu table.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run.

    Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.

    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Kudu tab, configure the following properties:
    Kudu Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    Kudu Masters Comma-separated list of Kudu masters used to access the Kudu table.

    For each Kudu master, specify the host and port in the following format: <host>:<port>

    Kudu Table Name of the Kudu table to read.
    To read from a Kudu table created by Impala, use the following format:
    impala::default.<table name> 
    Columns to Read Columns to read from each table. If you specify no columns, the origin reads all the columns in each table.

    Specified columns must exist in all the tables that the origin reads.

    Click the Add icon to specify an additional column. You can use simple or bulk edit mode to configure the columns.

  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Max Batch Size Maximum number of records to read in a batch.

    -1 uses the batch size configured for the Spark cluster.

    Maximum Number of Worker Threads

    Maximum number of threads to use to perform processing for the stage.

    Default is the Kudu default – twice the number of available cores on each processing node in the Spark cluster.

    Use this property to limit the number of threads that can be used. To use the Kudu default, set to 0.