Kudu

Available when using an authoring Data Collector version 3.19.0 or later.

To create a Kudu connection, one of the following stage libraries must be installed on the selected authoring Data Collector:
  • Cloudera CDH, streamsets-datacollector-cdh_<version>-lib
  • Cloudera CDP, streamsets-datacollector-cdp_<version>-lib
Tip: To view the complete list of supported stage libraries, expand the list of libraries next to the Test Connection button when you create or edit a connection.

For a description of the Kudu connection properties, see Kudu Connection Properties.

After you create a Kudu connection, you can use the connection in the following stages:
Engine Stages

Data Collector 3.19.0 or later

  • Kudu Lookup processor
  • Kudu destination

Transformer 3.16.0 or later

  • Kudu origin
  • Kudu destination

Kudu Connection Properties

When creating a Kudu connection, configure the following properties on the Kudu tab:
Kudu Property Description
Kudu Masters Comma-separated list of Kudu masters used to access the Kudu table.

For each Kudu master, specify the host and port in the following format: <host>:<port>

Optionally, configure the following properties on the Advanced tab.

The defaults for these properties should work in most cases:
Advanced Property Description
Maximum Number of Worker Threads

Maximum number of threads to use to perform processing for the stage.

Default is the Kudu default – twice the number of available cores on the processing machine. For a Data Collector pipeline, the processing machine is the Data Collector machine. For a Transformer pipeline, the processing machine is each node in the Spark cluster.

Use this property to limit the number of threads that can be used. To use the Kudu default, set to 0. Default is 2.

Operation Timeout (milliseconds) Number of milliseconds to allow for operations such as writes or lookups.

Default is 10000, or 10 seconds.

Note: Used in Data Collector pipelines only.
Admin Operation Timeout (milliseconds) Number of milliseconds to allow for admin-type operations, such as opening a table or getting a table schema.

Default is 30000, or 30 seconds.

Note: Used in Data Collector pipelines only.