Kudu

Available when using an authoring Data Collector version 4.0.0 or later.

To create a Kudu connection, the Cloudera CDP stage library, streamsets-datacollector-cdp_<version>-lib, must be installed on the selected authoring Data Collector.

Tip: To view the complete list of supported stage libraries, expand the list of libraries next to the Test Connection button when you create or edit a connection.

For a description of the Kudu connection properties, see Kudu Connection Properties.

After you create a Kudu connection, you can use the connection in the following stages:


Engine	Stages
Data Collector 4.0.0 or later	Kudu Lookup processor Kudu destination
Transformer 4.0.0 or later	Kudu origin Kudu destination

Kudu Connection Properties

When creating a Kudu connection, configure the following properties on the Kudu tab:


Kudu Property	Description
6.1 and later Kudu Primary Nodes 6.0 Kudu Masters	Comma-separated list of Kudu primary nodes used to access the Kudu table. For each Kudu primary node, specify the host and port in the following format: `<host>:<port>`

Optionally, configure the following properties on the Advanced tab.

The defaults for these properties should work in most cases:


Advanced Property	Description
Maximum Number of Worker Threads	Maximum number of threads to use to perform processing for the stage. Default is the Kudu default – twice the number of available cores on the processing machine. For a Data Collector pipeline, the processing machine is the Data Collector machine. For a Transformer pipeline, the processing machine is each node in the Spark cluster. Use this property to limit the number of threads that can be used. To use the Kudu default, set to 0. Default is 2.
Operation Timeout (milliseconds)	Number of milliseconds to allow for operations such as writes or lookups. Default is 10000, or 10 seconds. Note: Used in Data Collector pipelines only.
Admin Operation Timeout (milliseconds)	Number of milliseconds to allow for admin-type operations, such as opening a table or getting a table schema. Default is 30000, or 30 seconds. Note: Used in Data Collector pipelines only.