Register Data Collector with Control Hub

You must register a Data Collector to work with StreamSets Control Hub. When you register a Data Collector, Data Collector generates an authentication token that it uses to issue authenticated requests to Control Hub.

The method you use to register a Data Collector depends on the Data Collector installation type:
Tarball installation
You can register the Data Collector from the command line interface or from Control Hub.
RPM installation
You must register the Data Collector from Control Hub.
Cloudera Manager installation
You must register the Data Collector from Cloudera Manager.

You can optionally configure each registered Data Collector to use an authenticated HTTP proxy server to access Control Hub.

Before you register a Data Collector, perform the necessary prerequisite.

Registration Prerequisite

Before you register Data Collectors, complete the following prerequisite:

Verify that the statistics stage library is installed.

Control Hub requires that the statistics stage library be installed on each registered Data Collector. Control Hub requires the library to run system pipelines on the Data Collector. By default, all Data Collector installations include the statistics stage library.

To verify that a Data Collector has the statistics stage library installed, run the following command from the $SDC_DIST directory: bin/streamsets stagelibs -list.

Registering from the Command Line Interface

For a Data Collector tarball installation, you can register the Data Collector with Control Hub using the Data Collector command line interface. The Data Collector must be running before you can use the command line interface.

For a Data Collector RPM installation, you must use Control Hub to register the Data Collector. For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.

Note: To use an automation tool such as Ansible, Chef, or Puppet to automate the registering of Data Collectors, configure the tool to use the streamsets sch register command. The command must be run on the local machine where Data Collector is installed. The Data Collector does not need to be running. See the command line help for the list of available options. If you choose to skip updating the dpm.properties configuration file, you must configure the automation tool to update the file.

Start Data Collector, and then use the system enableDPM command to register the Data Collector.

Use the command from the $SDC_DIST directory as follows:
bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
system enableDPM \
(--dpmUrl <dpmBaseURL>) \
(--dpmUser <dpmUserID>) \
(--dpmPassword <dpmUserPassword>) \
[(--labels <labels>)]

When using the system enableDPM command, the following basic options are required:

Basic Option Description
-U <sdcURL>

or

--url <sdcURL>
Required. URL of the Data Collector.

The default URL is http://localhost:18630.

-D <dpmURL>

or

--dpmURL <dpmURL>

Required. Enter the appropriate URL:
  • For Control Hub cloud, enter https://cloud.streamsets.com.
  • For Control Hub on-premises, enter the URL provided by your system administrator. For example, https://<hostname>:18631.

The following table describes the enableDPM options:

Enable DPM Option Description
--dpmUrl <dpmBaseURL> Required. Enter the appropriate URL:
  • For Control Hub cloud, enter https://cloud.streamsets.com.
  • For Control Hub on-premises, enter the URL provided by your system administrator. For example, https://<hostname>:18631.
--dpmUser <dpmUserID> Required. Enter your Control Hub user ID using the following format:
<ID>@<organization ID>
--dpmPassword <dpmUserPassword> Required. Enter the password for your Control Hub user account.
--labels <labels> Optional. Assign a label to this Data Collector. You can enter multiple labels separated by commas. Labels that you assign here are defined in the Control Hub configuration file, $SDC_CONF/dpm.properties. To remove these labels after you register the Data Collector, you must modify the configuration file.

Use labels to group Data Collectors registered with Control Hub. If you know how you want to group your Data Collectors, you can assign labels now. Or you can assign labels in Control Hub after you register the Data Collector.

Default is "all", which you can use to run a job on all registered Data Collectors.

For example, the following command registers a Data Collector with Control Hub and assigns three labels to the Data Collector:
bin/streamsets cli -U http://localhost:18630 -D https://cloud.streamsets.com system enableDPM --dpmUrl https://cloud.streamsets.com --dpmUser alison@MyOrg --dpmPassword MyPassword --labels Finance,Accounting,Development
Important: Open the Data Collector configuration file, $SDC_CONF/sdc.properties, and verify that the http.authentication.login.module property is set to file. Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.

Restart the Data Collector to apply the changes.

Registering from Cloudera Manager

If you installed Data Collector through Cloudera Manager, you must use Cloudera Manager to register the Data Collector with Control Hub.

  1. In Cloudera Manager, select the StreamSets service, then click Configuration.
  2. Enter "Control Hub" in the search field to display the Control Hub configuration properties.
  3. Configure the following properties:
    Property Description
    Enable Control Hub Select to enable Control Hub.
    Control Hub URL Required. Enter the appropriate URL:
    • For Control Hub cloud, enter https://cloud.streamsets.com.
    • For Control Hub on-premises, enter the URL provided by your system administrator. For example, https://<hostname>:18631.
    Control Hub User ID Enter your Control Hub user ID using the following format:
    <ID>@<organization ID>
    Control Hub Password Enter the password for your Control Hub user account.
    Control Hub Labels Assign a label to this Data Collector. Labels that you assign here are defined in the Control Hub configuration file, $SDC_CONF/dpm.properties. To remove these labels after you register the Data Collector, you must modify the labels through Cloudera Manager.

    Use labels to group Data Collectors registered with Control Hub. If you know how you want to group your Data Collectors, you can assign labels now. Or you can assign labels in Control Hub after you register the Data Collector.

    Default is "all", which you can use to run a job on all registered Data Collectors.
  4. Set the HTTP Authentication Login Module property to file.
    Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.
  5. Click Save Changes.
  6. Click Actions > Restart to restart the Data Collector.

Using an HTTP or HTTPS Proxy Server

You can configure each registered Data Collector to use an authenticated HTTP or HTTPS proxy server for outbound requests made to Control Hub. Define the proxy properties in the SDC_JAVA_OPTS environment variable.

Modify environment variables using the method required by your installation type.

Add the following Java options to the SDC_JAVA_OPTS environment variable:

  • https.proxyUser
  • https.proxyPassword
  • https.proxyHost
  • https.proxyPort

If the proxy server uses HTTP instead of HTTPS, use http.<property name> for each property.

For example, to configure a Data Collector to use an HTTPS proxy server on host 138.0.0.1 and port 3138, define SDC_JAVA_OPTS as follows:

export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -Xmx1024m -Xms1024m -Dhttps.proxyUser=MyName -Dhttps.proxyPassword=MyPsswrd -Dhttps.proxyHost=138.0.0.1 -Dhttps.proxyPort=3138 -server" 
Note: Oracle JDK disabled HTTP proxy authentication for HTTPS URLs in JDK 8 update 111. If Data Collector runs on a machine with Java 8u111 or later, consider using an HTTPS proxy server. Or as a workaround, consider adding the following Java property to the SDC_JAVA_OPTS environment variable, setting the property to an empty string:
-Djdk.http.auth.tunneling.disabledSchemes=''

However, use this workaround with caution since it exposes credentials by sending them through an unencrypted proxy.

Using a Publicly Accessible URL

If you register a Data Collector that is installed on a cloud-computing platform such as Amazon Elastic Compute Cloud (EC2), configure the Data Collector to use a publicly accessible URL.

When you register a Data Collector with Control Hub, the Data Collector sends its URL to Control Hub in the format http://<hostname>:<http.port>, where <hostname> is the value defined in the http.bindHost property in the Data Collector configuration file, $SDC_CONF/sdc.properties. If the host name is not defined in http.bindHost, Data Collector runs the following command to determine the host name: hostname -f

For most cloud-computing platforms, the hostname -f command returns the private IP address of the machine on the cloud platform. Control Hub includes the private IP address in theData Collector URL displayed in Control Hub. However, when you click the Data Collector URL, you cannot access the Data Collector because you must use a public IP address to access a cloud machine.

To access a Data Collector installed on a cloud-computing platform from Control Hub, uncomment the sdc.base.http.url property in the Data Collector configuration file, $SDC_CONF/sdc.properties, and then configure it to use the publicly accessible URL to that Data Collector.

After modifying the configuration file, restart Data Collector for the changes to take effect.