Manually Administered Data Collectors

Manually administering authoring and execution Data Collectors involves installing Data Collectors on-premises or on a protected cloud computing platform and then registering them to work with Control Hub.

For instructions on installing Data Collectors, see Installation in the Data Collector documentation.

When you register a Data Collector, you generate an authentication token for that Data Collector. Each Data Collector also has a universally unique ID (UUID), generated upon initial start and stored in the sdc.id file in the $SDC_Data directory.

To authenticate a registered Data Collector, Control Hub verifies that both the authentication token and UUID exist and are unchanged. Data Collector includes its authentication token and UUID in requests to Control Hub. Because all communication between Control Hub and any registered Data Collector uses HTTPS, the authentication token and UUID are kept confidential.

To manually administer Data Collectors, click Administration > Data Collectors in the Navigation panel. You can complete the following administrative tasks:
  • Register Data Collectors with Control Hub.
  • For Data Collectors installed on a cloud computing platform, configure the Data Collector to use a publicly accessible URL.
  • Delete unregistered authentication tokens if you generated tokens but did not record the tokens.
  • Regenerate an authentication token for a registered Data Collector.
  • Unregister Data Collectors from Control Hub.

Register Data Collectors

To register a Data Collector with Control Hub, you generate an authentication token and modify the Data Collector configuration files.

The method you use to register a Data Collector depends on the Data Collector installation type:
Tarball installation
You can register the Data Collector from the Data Collector UI, from the command line interface, or from Control Hub.
RPM installation
You must register the Data Collector from Control Hub.
Cloudera Manager installation
You must register the Data Collector from Cloudera Manager.

A registered Data Collector communicates with Control Hub at regular intervals. If a Data Collector cannot connect to Control Hub, due to a network or system outage, then the Data Collector uses the Control Hub disconnected mode.

Before you register a Data Collector, perform the necessary prerequisites. The prerequisites include verifying that the statistics stage library is installed and creating new Control Hub users and groups. After registration, be sure to transfer Data Collector permissions to the new Control Hub users and groups.

Registration Prerequisites

Before you register Data Collectors, complete the following prerequisites:

Verify that the statistics stage library is installed.

Control Hub requires that the statistics stage library be installed on each registered Data Collector. Control Hub requires the library to run system pipelines on the Data Collector. By default, all Data Collector installations include the statistics stage library.

To verify that a Data Collector has the statistics stage library installed, click the Package Manager icon () to display the list of installed stage libraries. If the statistics library was uninstalled, see installing additional stage libraries in the Data Collector documentation.
Create a Control Hub user account and group for each Data Collector user and group.

If your organization currently uses Data Collector, you must create a Control Hub user account and group for each Data Collector user account and group, assigning the corresponding Data Collector roles to each. After a Data Collector is registered with Control Hub, the Data Collector uses Control Hub user authentication. Only Control Hub user accounts can log in to registered Data Collectors.

If your organization is not using Data Collector, you can skip this prerequisite. You can create new Control Hub user accounts and groups at any time.

Registered Data Collectors belong to the organization of the user who completes the registration. All user accounts that belong to the same organization and that have the appropriate roles can log in to the registered Data Collectors.

For more information, see Create User Accounts for Your Organization.
Tip: If Data Collector uses file-based authentication and if you register the Data Collector from the Data Collector UI, you can skip this step and create Control Hub user accounts during the registration process, as described in Registering from Data Collector.

Registering from Control Hub

When you register from Control Hub, you generate an authentication token. Then, you edit Data Collector configuration files to register the token with the Data Collector and to enable communication with Control Hub.

Note: For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.

When you register from Control Hub, you can generate multiple authentication tokens at one time.

  1. Log in to Control Hub using your Control Hub user account.
  2. In the Navigation panel, click Administration > Data Collectors.
    Control Hub displays the Data Collectors that have already been registered.
  3. Click the Generate Authentication Tokens icon .
  4. Enter the number of tokens to generate.
  5. Click Generate.
    The Authentication Tokens window displays each generated token.
  6. Record the generated tokens.
    You can copy the tokens from the window. Or, you can click Download to download all generated tokens to a JSON file named authTokens.json.
    Note: If you close the window before recording the tokens, you cannot retrieve the tokens. You can delete unregistered authentication tokens, as described in Deleting Unregistered Tokens.
  7. Click Close in the Authentication Tokens window.
  8. Complete the following steps for each Data Collector that you want to register:
    1. Open the $SDC_CONF/application-token.txt file for the Data Collector, and copy a token into the file.
      Each Data Collector must use a unique authentication token.
    2. Open the $SDC_CONF/dpm.properties file for the Data Collector and edit the following properties:
      Property Description
      dpm.enabled Set to true.
      dpm.base.url

      Set to https://cloud.streamsets.com.

    3. Open the $SDC_CONF/sdc.properties file for the Data Collector and verify that the http.authentication.login.module property is set to file.
      Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.
    4. Restart the Data Collector.

    When you log in to the registered Data Collector using a Control Hub user account, the following StreamSets Control Hub Enabled icon displays:

    To enable Control Hub users to work with pipelines, use Data Collector to transfer permissions to the Control Hub users and groups.

Registering from Data Collector

When you register from the Data Collector UI, Data Collector generates the authentication token and modifies the configuration files for you. You can also create Control Hub user accounts and groups during the registration process.

Note: For a Data Collector RPM installation, you must use Control Hub to register the Data Collector. For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.
  1. Log in to Data Collector using your Data Collector user account.
  2. Click Administration > Enable Control Hub.
  3. Enter the following information in the Enable Control Hub window:
    Property Description
    Control Hub Base URL URL to access Control Hub.

    Set to https://cloud.streamsets.com.

    Control Hub User ID Enter your Control Hub user ID using the following format:
    <ID>@<organization ID>
    Control Hub User Password Enter the password for your Control Hub user account.
    Labels for this Data Collector Assign a label to this Data Collector. Labels that you assign here are defined in the Control Hub configuration file, $SDC_CONF/dpm.properties. To remove these labels after you register the Data Collector, you must modify the configuration file.

    Default is all, which you can use to run a job on all registered Data Collectors.

    For more information about labels, see Labels Overview.

  4. Click Enable Control Hub.
  5. Optionally, you can choose to create a Control Hub user account and group for each Data Collector user account and group.
    The Create Control Hub Groups and Users window maps all existing Data Collector user accounts and groups to Control Hub user accounts and groups. You can remove users or groups and can edit the IDs, names, and email addresses as needed.

    When you have finished reviewing the users and groups to create, click Create. Each new user is assigned a default set of Control Hub roles. Groups are not assigned any roles. After you log in to Control Hub, change those role assignments as needed to secure the integrity of your organization and data.

  6. Restart Data Collector.
  7. Open the Data Collector configuration file, $SDC_CONF/sdc.properties, and verify that the http.authentication.login.module property is set to file.
    Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.

    When you log in to the registered Data Collector using a Control Hub user account, the following StreamSets Control Hub Enabled icon displays:

    To enable Control Hub users to work with pipelines, use Data Collector to transfer permissions to the Control Hub users and groups.

Registering from the Command Line Interface

When you register from the Data Collector command line interface, Data Collector generates the authentication token and modifies the configuration files for you. The Data Collector must be running before you can use the command line interface.

For a Data Collector RPM installation, you must use Control Hub to register the Data Collector. For a Data Collector installation with Cloudera Manager, you must use Cloudera Manager to register the Data Collector.

Note: To use an automation tool such as Ansible, Chef, or Puppet to automate the registering of Data Collectors, configure the tool to use the streamsets sch register command. The command must be run on the local machine where Data Collector is installed. The Data Collector does not need to be running. See the command line help for the list of available options. If you choose to skip updating the dpm.properties configuration file, you must configure the automation tool to update the file.

Start Data Collector, and then use the system enableDPM command to register the Data Collector.

Use the command from the $SDC_DIST directory as follows:

bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
system enableDPM \
(--dpmUrl <dpmBaseURL>) \
(--dpmUser <dpmUserID>) \
(--dpmPassword <dpmUserPassword>) \
[(--labels <labels>)]

When using the system enableDPM command, the following basic options are required:

Basic Option Description
-U <sdcURL>

or

--url <sdcURL>
Required. URL of the Data Collector.

The default URL is http://localhost:18630.

-D <dpmURL>

or

--dpmURL <dpmURL>

Required. URL to access Control Hub.

Set to https://cloud.streamsets.com.

The following table describes the enableDPM options:

Enable DPM Option Description
--dpmUrl <dpmBaseURL> URL to access Control Hub.

Set to https://cloud.streamsets.com.

--dpmUser <dpmUserID> Required. Enter your Control Hub user ID using the following format:
<ID>@<organization ID>
--dpmPassword <dpmUserPassword> Required. Enter the password for your Control Hub user account.
--labels <labels> Optional. Assign a label to this Data Collector. You can enter multiple labels separated by commas. Labels that you assign here are defined in the Control Hub configuration file, $SDC_CONF/dpm.properties. To remove these labels after you register the Data Collector, you must modify the configuration file.

Default is "all", which you can use to run a job on all registered Data Collectors.

For more information about labels, see Labels Overview.

For example, the following command registers a Data Collector with Control Hub and assigns three labels to the Data Collector:
bin/streamsets cli -U http://localhost:18630 -D https://cloud.streamsets.com system enableDPM --dpmUrl https://cloud.streamsets.com --dpmUser alison@MyOrg --dpmPassword MyPassword --labels Finance,Accounting,Development
Important: Open the Data Collector configuration file, $SDC_CONF/sdc.properties, and verify that the http.authentication.login.module property is set to file. Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.

Restart the Data Collector to apply the changes.

When you log in to the registered Data Collector using a Control Hub user account, the following StreamSets Control Hub Enabled icon displays:

To enable Control Hub users to work with pipelines, use Data Collector to transfer permissions to the Control Hub users and groups.

Registering from Cloudera Manager

If you installed Data Collector through Cloudera Manager, you must use Cloudera Manager to register the Data Collector with Control Hub.

  1. In Cloudera Manager, select the StreamSets service, then click Configuration.
  2. Enter "Control Hub" in the search field to display the Control Hub configuration properties.
  3. Configure the following properties:
    Property Description
    Enable Control Hub Select to enable Control Hub.
    Control Hub URL URL to access Control Hub.

    Set to https://cloud.streamsets.com.

    Control Hub User ID Enter your Control Hub user ID using the following format:
    <ID>@<organization ID>
    Control Hub Password Enter the password for your Control Hub user account.
    Control Hub Labels Assign a label to this Data Collector. Labels that you assign here are defined in the Control Hub configuration file, $SDC_CONF/dpm.properties. To remove these labels after you register the Data Collector, you must modify the labels through Cloudera Manager.

    Default is "all", which you can use to run a job on all registered Data Collectors.

    For more information about labels, see Labels Overview.

  4. Set the HTTP Authentication Login Module property to file.
    Control Hub requires that each registered Data Collector be configured for file-based authentication. After a Data Collector is registered with Control Hub, the Data Collector uses the authentication method enabled for Control Hub.
  5. Click Save Changes.
  6. Click Actions > Restart to restart the Data Collector.

    When you log in to the registered Data Collector using a Control Hub user account, the following StreamSets Control Hub Enabled icon displays:

    To enable Control Hub users to work with pipelines, use Data Collector to transfer permissions to the Control Hub users and groups.

Using a Publicly Accessible URL

If you register a Data Collector that is installed on a cloud computing platform such as Amazon Elastic Compute Cloud (EC2), configure the Data Collector to use a publicly accessible URL.

When you register a Data Collector with Control Hub, the Data Collector sends its URL to Control Hub in the format http://<hostname>:<http.port>, where <hostname> is the value defined in the http.bindHost property in the Data Collector configuration file, $SDC_CONF/sdc.properties. If the host name is not defined in http.bindHost, Data Collector runs the following command to determine the host name: hostname -f

For most cloud computing platforms, the hostname -f command returns the private IP address of the machine on the cloud platform. Control Hub includes the private IP address in the Data Collector URL displayed in Control Hub. However, when you click the Data Collector URL, you cannot access the Data Collector because you must use a public IP address to access a cloud machine.

To access a Data Collector installed on a cloud computing platform from Control Hub, uncomment the sdc.base.http.url property in the Data Collector configuration file, $SDC_CONF/sdc.properties, and then configure it to use the publicly accessible URL to that Data Collector.

After modifying the configuration file, restart Data Collector for the changes to take effect.

Deleting Unregistered Tokens

Delete unregistered authentication tokens when you used Control Hub to generate the tokens, but did not copy or download the tokens from the Generate Authentication Token window.

Control Hub lists the number of unregistered authentication tokens in the Data Collector Administration view as follows:

  1. In the Navigation panel, click Administration > Data Collectors.
  2. Click the Toggle Filter Column icon () to view the number of unregistered authentication tokens.
  3. Click the More icon , and then click Delete Unregistered Authentication Tokens.

Regenerate a Token

You can regenerate an authentication token for a Data Collector. You might need to regenerate a token to replace a token that has been compromised or to follow your organization's security policy.

For a Cloudera Manager installation, you must regenerate authentication tokens from Cloudera Manager. For all other installations, you regenerate authentication tokens from Control Hub.

Regenerating from Control Hub

When you regenerate an authentication token for a Data Collector, you replace the previous authentication token with a new one. You must copy the new token into the $SDC_CONF/application_token.txt file for the Data Collector.

  1. In the Navigation panel, click Administration > Data Collectors.
  2. Select a registered Data Collector to display its details.
  3. Click Regenerate Authentication Token.
    The Authentication Tokens window displays the regenerated token.
  4. Record the regenerated token.
    You can copy the token from the window. Or, you can click Download to download the token to a JSON file named authTokens.json.
    Note: If you close the window before recording the token, you cannot retrieve the token. You can delete unregistered authentication tokens, as described in Deleting Unregistered Tokens.
  5. To register the Data Collector with the newly generated token, copy the token to the $SDC_CONF/application-token.txt file and restart the Data Collector.

Regenerating from Cloudera Manager

If you installed Data Collector through Cloudera Manager, you must regenerate an authentication token from Cloudera Manager.

When you regenerate an authentication token, you replace the previous authentication token with a new one. You can regenerate a token for a single Data Collector instance or for all Data Collector instances included in the StreamSets service.

  1. In Cloudera Manager, select the StreamSets service, and then click Actions > Stop.
  2. Complete one of the following actions, based on whether you are regenerating tokens for all Data Collector instances or for a single instance:
    • All instances - Click Actions > Regenerate Control Hub Tokens.
    • Single instance - Click Instances, select a Data Collector instance, and then click Actions > Regenerate Control Hub Token.
  3. Click Actions > Restart to restart the Data Collector.

Unregister Data Collectors

You can unregister a Data Collector from Control Hub when you no longer want to use that Data Collector installation with Control Hub.

When you restart an unregistered Data Collector, previously-configured Data Collector user accounts become immediately available unless changed in the interim. Use your Data Collector user account to log in.

Pipeline permissions, however, are not automatically reverted. To ensure that users can access pipelines, use Data Collector to transfer pipeline permissions from the obsolete Control Hub users and groups back to Data Collector users and groups. Or, you can edit pipeline permissions individually.

The method you use to unregister a Data Collector depends on the Data Collector installation type:
Tarball installation
You can unregister the Data Collector from the Data Collector UI, from the command line interface, or from Control Hub.
RPM installation
You must unregister the Data Collector from Control Hub.
Cloudera Manager installation
You must unregister the Data Collector from both Control Hub and Cloudera Manager.

Unregistering from Control Hub

When you unregister a Data Collector from Control Hub, Control Hub deactivates the authentication token. Then, you modify Data Collector configuration files to remove the token from the Data Collector and to disable communication with Control Hub.

Note: For a Data Collector installation with Cloudera Manager, you must use both Control Hub and Cloudera Manager to unregister the Data Collector.
  1. In Control Hub, stop all jobs running on the Data Collector.
  2. Log in to the Data Collector, and shut it down.
  3. In Control Hub, click Execute > Data Collectors in the Navigation panel.
  4. Hover over the Data Collector that you shut down, and then click the Delete icon.
  5. In the confirmation dialog box, click Delete and Unregister.
  6. On the machine where the Data Collector is installed, open the $SDC_CONF/application-token.txt file, and remove the authentication token from the file.
  7. Open the $SDC_CONF/dpm.properties file for the Data Collector, and set the dpm.enabled property to false.

After restarting Data Collector, use your Data Collector user account to log in.

To ensure that users can access pipelines, use Data Collector to transfer pipeline permissions from the obsolete Control Hub users and groups back to Data Collector users and groups. Or, you can edit Data Collector pipeline permissions individually.

Unregistering from Data Collector

You can unregister a Data Collector from Control Hub using the Data Collector UI. When you unregister from the Data Collector UI, Data Collector deactivates the authentication token and modifies the configuration files for you.

Note: For a Data Collector RPM installation, you must use Control Hub to unregister the Data Collector. For a Data Collector installation with Cloudera Manager, you must use both Control Hub and Cloudera Manager to unregister the Data Collector.

  1. In Control Hub, stop all jobs running on the Data Collector.
  2. Log in to the Data Collector and click Administration > Disable Control Hub.
    The Disable Control Hub Confirmation dialog box appears.
  3. To disable Data Collector from working with Control Hub, click Yes.
  4. Restart Data Collector.

After restarting Data Collector, use your Data Collector user account to log in.

To ensure that users can access pipelines, use Data Collector to transfer pipeline permissions from the obsolete Control Hub users and groups back to Data Collector users and groups. Or, you can edit Data Collector pipeline permissions individually.

Unregistering from the Command Line Interface

You can unregister a Data Collector from Control Hub using the Data Collector command line interface. When you unregister from the Data Collector command line interface, Data Collector deactivates the authentication token and modifies the configuration files for you.

For a Data Collector RPM installation, you must use Control Hub to unregister the Data Collector. For a Data Collector installation with Cloudera Manager, you must use both Control Hub and Cloudera Manager to unregister the Data Collector.

Note: To use an automation tool such as Ansible, Chef, or Puppet to automate the unregistering of Data Collectors, configure the tool to use the streamsets sch unregister command. The command must be run on the local machine where Data Collector is installed. The Data Collector does not need to be running. See the command line help for the list of available options. If you choose to skip updating the dpm.properties configuration file, you must configure the automation tool to update the file.

Start the Data Collector, and then use the system disableDPM command to unregister the Data Collector.

Use the command from the $SDC_DIST directory as follows:
bin/streamsets cli \
(-U <sdcURL> | --url <sdcURL>) \
[(-a <sdcAuthType> | --auth-type <sdcAuthType>)] \
[(-D <dpmURL> | --dpmURL <dpmURL>)] \ 
[(-u <sdcUser> | --user <sdcUser>)] \
[(-p <sdcPassword> | --password <sdcPassword>)] \
system disableDPM

When using the system disableDPM command, the following basic options are required:

Basic Option Description
-U <sdcURL>

or

--url <sdcURL>
Required. URL of the Data Collector.

The default URL is http://localhost:18630.

-a <sdcAuthType>

or

--auth-type <sdcAuthType>
Required. Authentication type used by the Data Collector. Set to dpm.

If you omit this option, Data Collector uses the Form authentication type, which causes the disableDPM command to fail.

-D <dpmURL>

or

--dpmURL <dpmURL>

Required. URL to access Control Hub.

Set to https://cloud.streamsets.com.

-u <sdcUser>

or

--user <sdcUser>

Required. Enter your Control Hub user ID using the following format:
<ID>@<organization ID>

If you omit this option, Data Collector uses the admin user account, which causes the disableDPM command to fail.

-p <sdcPassword>

or

--password <sdcPassword>

Required. Enter the password for your Control Hub user account.
For example, the following command unregisters a Data Collector with Control Hub:
bin/streamsets cli -U http://localhost:18630 -a dpm -D https://cloud.streamsets.com -u alison@MyOrg -p MyPassword system disableDPM

Restart the Data Collector to apply the changes.

After restarting Data Collector, use your Data Collector user account to log in.

To ensure that users can access pipelines, use Data Collector to transfer pipeline permissions from the obsolete Control Hub users and groups back to Data Collector users and groups. Or, you can edit Data Collector pipeline permissions individually.

Unregistering from Control Hub and Cloudera Manager

If you installed a Data Collector through Cloudera Manager, you must use both Control Hub and Cloudera Manager to unregister the Data Collector.

You use Control Hub to deactivate the authentication token. Then, you use Cloudera Manager to modify Data Collector configuration properties and files.

  1. In Control Hub, stop all jobs running on the Data Collector.
  2. In Cloudera Manager, shut down the Data Collector.
  3. In Control Hub, click Execute > Data Collectors in the Navigation panel.
  4. Hover over the Data Collector that you shut down, and then click the Delete icon.
  5. In the confirmation dialog box, click Delete and Unregister.
  6. In Cloudera Manager, select the StreamSets service, then click Configuration.
  7. Enter "Control Hub" in the search field to display the Control Hub configuration properties.
  8. Clear the Enable Control Hub property.
  9. Find the location to the authentication token file in the Control Hub Token Location property, and then delete the file.

After restarting Data Collector, use your Data Collector user account to log in.

To ensure that users can access pipelines, use Data Collector to transfer pipeline permissions from the obsolete Control Hub users and groups back to Data Collector users and groups. Or, you can edit Data Collector pipeline permissions individually.