Kudu

The Kudu destination writes data to a Kudu table. You can also use the destination to write to a Kudu table created by Impala.

The destination writes record fields to table columns by matching names. The Kudu destination can insert or upsert data to the table.

When you configure the Kudu destination, you specify the connection information for one or more Kudu masters. You configure the table and write mode to use. When needed, you can specify a maximum batch size for the destination.

You can also use a connection to configure the destination.

Note: Due to a Kudu limitation on Spark, pipeline validation does not validate Kudu stage configuration.

Configuring a Kudu Destination

Configure a Kudu destination to write to a Kudu table.
  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Kudu tab, configure the following properties:
    Kudu Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    Kudu Masters Comma-separated list of Kudu masters used to access the Kudu table.

    For each Kudu master, specify the host and port in the following format: <host>:<port>

    Kudu Table Name of the table to write to.
    To write to a Kudu table created by Impala, use the following format:
    impala::default.<table name> 
    Write Operation Operation to perform when writing to Kudu:
    • Insert - Inserts all data to the table.
    • Upsert - Inserts new data to the table and updates existing data.
  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Write Batch Size Maximum number of records to write to Kudu in a batch.

    -1 uses the batch size configured for the Spark cluster.

    Maximum Number of Worker Threads

    Maximum number of threads to use to perform processing for the stage.

    Default is the Kudu default – twice the number of available cores on each processing node in the Spark cluster.

    Use this property to limit the number of threads that can be used. To use the Kudu default, set to 0.