Data Collector Configuration

You can configure Data Collector configuration properties, such as the host name and port number, when you configure the deployment.

You can protect sensitive data in Data Collector configuration properties by storing the data in an external location and then using functions provided with the StreamSets expression language to retrieve the data. You can also reference information in an environment variable.

You can define runtime properties in Data Collector configuration properties or in a separate file. For more information, see Runtime Properties.

Kerberos Authentication

You can use Kerberos authentication to connect to external systems as well as YARN clusters.

By default, Data Collector uses the user account who started it to connect to external systems. When you enable Kerberos, it can use the Kerberos principal to connect to external systems.

You can configure Kerberos authentication for the following stages:
  • Hadoop FS Standalone origin
  • Kafka Multitopic Consumer origin
  • MapR FS Standalone origin
  • HBase Lookup processor
  • Hive Metadata processor
  • Kudu Lookup processor
  • Cassandra destination, when the DataStax Enterprise Java driver is installed
  • Hadoop FS destination
  • HBase destination
  • Hive Metastore destination
  • Kafka Producer destination
  • Kudu destination
  • MapR DB destination
  • MapR FS destination
  • Solr destination
  • HDFS File Metadata executor
  • MapR FS File Metadata executor
  • MapReduce executor
  • Spark executor

To enable Data Collector to use Kerberos authentication, use the required procedure for your installation type.

Enabling Kerberos for Tarball

To enable Kerberos authentication for a tarball installation, perform the following steps:

  1. On Linux, install the following Kerberos client packages on the Data Collector machine:
    • krb5-workstation
    • krb5-client
  2. Copy the Kerberos configuration file, krb5.conf, to the Data Collector machine. The default location is /etc/krb5.conf.

    The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of host names onto Kerberos realms.

  3. Configure Data Collector to use Kerberos based on the stage types. If enabling Kerberos for both Kafka and non-Kafka stages, use both methods.
    • Non-Kafka stages - To enable Kerberos for non-Kafka stages, configure Data Collector to use Kerberos by modifying the Data Collector configuration properties. Data Collector uses the same Kerberos principal for each stage.

      In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration. To enable Kerberos and define the principal and keytab, configure the following Kerberos properties:

      • kerberos.client.enabled
      • kerberos.client.principal
      • kerberos.client.keytab
    • Kafka stages - To enable Kerberos for Kafka stages, configure the Kerberos properties in the Java Authentication and Authorization Service (JAAS) configuration file used by Data Collector when you configure the stage to use Kerberos. You can configure each Kafka stage to use a different Kerberos principal.
  4. Restart Data Collector.
  5. Configure the stage to use Kerberos.

Sending Email

You can configure email configuration properties to enable Data Collector to send email notifications.

For Data Collector pipelines, Data Collector can send email in the following ways:
  • Email alert - Sends a basic email when an email-enabled alert is triggered, such as when the error record threshold has been reached.
  • Pipeline notification - Sends a basic email when the pipeline state changes to a specified state. For example, you might use pipeline notification to send an email when a pipeline transitions to a Run_Error or Finished state.
  • Email executor - Sends a custom email upon receiving an event from an event-generating stage. Use in an event stream to send a user-defined email. You can include expressions to provide information about the pipeline or event in the email.

    For example, you might use an Email executor to send an email upon receiving a failed-query event from the Hive Query executor, and you can include the failed query in the message.

To enable sending email, in the Data Collector configuration properties, configure the mail.transport.protocol property, and then configure the smtp/smtps properties and the xmail properties. For more information, see Configuring Data Collector.

Protecting Sensitive Data in Configuration Properties

You can protect sensitive data in Data Collector configuration properties by storing the data in an external location and then using the file or exec function to retrieve the data.

You configure Data Collector configuration properties in the advanced configuration properties of the deployment.

Some configuration properties, such as the https.keystore.password property, require that you enter a password. Instead of entering the password in clear text, you can store the password outside of the configuration properties and then use the file or exec function to retrieve the sensitive data.

You can use functions to retrieve sensitive data in the following ways:
From a file
Store the sensitive data in a separate file and then use the file function in the configuration properties to retrieve the data as follows:
${file("<filename>")}
For example, if you configure the xmail.username property as follows, Data Collector retrieves the user name from the email_username.txt file uploaded as an external resource for the deployment, as described in the Control Hub documentation:
xmail.username=${file("email_username.txt")}
Retrieving sensitive data from another file provides some level of security. However, the sensitive data in the additional file is still entered in clear text and thus vulnerable for others to access. For increased security, use a script or executable to retrieve the sensitive data.
Using a script or executable
For increased security, develop a script or executable that retrieves the sensitive data from an external location. For example, you can develop a script that decrypts an encrypted file containing a password. Or you can develop a script that calls an external REST API to retrieve a password from a remote vault system.
Use the exec function in the configuration properties to call the script or executable as follows:
${exec("<script name>")} 
For example, if you configure the xmail.password property as follows, Data Collector runs the email_pwd.sh script to retrieve the password:
xmail.password=${exec("email_pwd.sh")}

When you use either the file or the exec function, Data Collector uses the exact output of the file or script. So if the output produces a password and then a newline character, Data Collector uses the value with the newline character. This causes Data Collector to use a password that is not valid. Carefully design and test how you define the output of the file or script to ensure that the functions return only the expected sensitive data.

Retrieving Sensitive Data from Files

Use the file function in a configuration property to retrieve sensitive data from a local file.

You can store a single piece of information in a file. When Data Collector starts, it retrieves the sensitive data from the referenced files.

  1. Create a text file for each configuration value that you want to safeguard. Include only one configuration value in each file.
    Ensure that the file does not include extra characters, such as a newline character, after the sensitive data. For example, you might run the following command to ensure that the file does not include a newline character:
    echo -n '<password>' > password-file.txt
  2. Save the file in a local directory that Data Collector can access.
    You can save the file in the configuration directory and then simply enter the file name when you use the file function.
  3. In the Data Collector configuration properties, set the relevant value to the file function and the appropriate file path and name.
    You can enter an absolute path to the file or you can simply enter the file name if you stored the file in the Data Collector configuration directory, <installation_dir>/etc. For example:
    ${file("password-file.txt")}

Retrieving Sensitive Data Using Scripts

Use the exec function in a configuration property to call a script or executable that retrieves sensitive data from an external location.

You must save the script on the local machine where Data Collector runs. When Data Collector starts, it runs the script to retrieve the sensitive data.

  1. Develop a script or executable to retrieve each configuration value that you want to safeguard.
    Ensure that the script or executable does not include extra characters, such as a newline character, after the sensitive data.
  2. Save the script or executable on the local machine where Data Collector runs.
  3. In the Data Collector configuration properties, set the relevant value to the exec function and use the script or executable file name for the argument. Use the required syntax as follows:
    ${exec("<script name>")}
    If you save the script in the Data Collector configuration directory, <installation_dir>/etc, enter just the script name for the argument, for example:
    ${exec("email_pwd.sh")}
    If you save the script outside of the Data Collector configuration directory, enter an absolute path for the script name, for example:
    ${exec("/tmp/email_pwd.sh")}
    Important: Enter only the script or executable file name as the function argument. You cannot include parameters for the script within the argument. For example, ${exec("email_pwd.sh -name my_user")} is not a valid argument. If the script or executable requires parameters, design a wrapper script to call the original script with the corresponding parameters and then call the wrapper script from the exec function.

Referencing Environment Variables

You can reference an environment variable in the Data Collector configuration properties, as follows:
${env("<environment variable>")} 

You can also use this format to define runtime properties in the Data Collector configuration properties.

Running Multiple Concurrent Pipelines

By default, Data Collector can run approximately 22 standalone pipelines concurrently. If you plan to run a larger number of pipelines at the same time, increase the thread pool size.

The runner.thread.pool.size property in the Data Collector configuration properties, determines the number of threads in the pool that are available to run standalone pipelines. One running pipeline requires five threads, and pipelines share threads in the pool.

  1. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration.
  2. Calculate the approximate runner thread pool size by multiplying the number of running pipelines by 2.2.
  3. Set the runner.thread.pool.size property to your calculated value.
  4. Save the changes to the deployment and restart all engine instances.

Hadoop Impersonation Mode

You can configure how Data Collector impersonates a Hadoop user when performing tasks, such as reading or writing data, in Hadoop systems.

By default, Data Collector impersonates Hadoop users as follows:
  • As the user defined in stage properties - When configured, Data Collector uses the user defined in Hadoop-related stages.
  • As the currently logged in Data Collector user who starts the pipeline - When no user is defined in a Hadoop-related stage, Data Collector uses the user who starts the pipeline.
Note: In both cases, the Hadoop systems must be configured to allow the impersonation.

The system administrator can configure Data Collector to always use the user who starts the pipeline by enabling the stage.conf_hadoop.always.impersonate.current.user property in the Data Collector configuration properties. When enabled, configuring a user within a stage is not allowed.

Configure Data Collector to always impersonate as the user who starts the pipeline when you want to prevent access to data in Hadoop systems by stage-level user properties.

For example, say you use roles, groups, and pipeline permissions to ensure that only authorized operators can start pipelines. You expect that the operator user accounts are used to access all external systems. But a pipeline developer can specify a HDFS user in a Hadoop stage and bypass your attempts at security. To close this loophole, configure Data Collector to always use the currently logged in Data Collector user to read from or write to Hadoop systems.

To always use the user who starts the pipeline, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration. Uncomment the stage.conf_hadoop.always.impersonate.current.user property and set it to true.

With this property enabled, Data Collector prevents configuring an alternate user in the following Hadoop-related stages:
  • Hadoop FS Standalone origin and Hadoop FS destination
  • MapR FS Standalone origin and MapR FS destination
  • HBase lookup and destination
  • MapR DB destination
  • HDFS File Metadata executor
  • MapR FS File Metadata executor
  • MapReduce executor

Lowercasing User Names

When Data Collector impersonates Hadoop users to perform tasks in Hadoop systems, you can also configure Data Collector to lowercase all user names before passing them to Hadoop.

When the Hadoop system is case sensitive and the user names are lower case, you might use this property to lowercase mixed-case user names that might be returned, for example, from a case-insensitive LDAP system.

To lowercase user names before passing them to Hadoop, uncomment the stage.conf_hadoop.always.lowercase.user property and set it to true.

Using a Partial Control Hub User ID

You can configure Data Collector to use an abbreviated version of the Control Hub user ID to impersonate a Hadoop user.

By default, when using Hadoop impersonation mode, Data Collector uses the full Control Hub user ID as the Hadoop user name, as follows:
<ID>@<organization ID>

You can configure Data Collector to use only the ID, ignoring "@<organization ID>". For example, using myname instead of myname@org as the user name.

You might need to use a partial Control Hub user ID when the Hadoop system uses Kerberos, LDAP, or other user authentication methods with user name formats that conflict with the Control Hub format.

To enable using a partial Control Hub user ID, uncomment the dpm.alias.name.enabled property in the Data Collector configuration properties.

Working with HDFS Encryption Zones

Hadoop systems use the Hadoop Key Management Server (KMS) to obtain encryption keys. Data Collector requires a truststore file to verify the identity of the KMS server.

To enable access to HDFS encryption zones while using proxy users, configure KMS to allow the same user impersonation as you have configured for HDFS.

To create a truststore file, follow the same steps as documented for the Syslog destination. See Enabling SSL.

To allow Data Collector as a proxy user, add the following properties to the KMS configuration file and configure the values for the properties:
  • hadoop.kms.proxyuser.sdc.groups
  • hadoop.kms.proxyuser.sdc.hosts

For example, the following properties allows users in the Ops group access to the encryption zones:

<property>
<name>hadoop.kms.proxyuser.sdc.groups</name>
<value>Ops</value>
</property>
<property>
<name>hadoop.kms.proxyuser.sdc.hosts</name>
<value>*</value>
</property>

Note that the asterisk (*) indicates no restrictions.

For more information about configuring KMS proxy users, see the KMS documentation for the Hadoop distribution that you are using. For example, for Apache Hadoop, see the Apache Hadoop documentation.

Blacklist and Whitelist for Stage Libraries

By default, almost all installed stage libraries are available for use in Data Collector. You can use blacklist and whitelist properties to limit the stage libraries that can be used.

To limit the stage libraries created by StreamSets, use one of the following properties:

system.stagelibs.whitelist
system.stagelibs.blacklist
To limit stage libraries created by other parties, use one of the following properties:
user.stagelibs.whitelist
user.stagelibs.blacklist
Warning: Use only the whitelist or blacklist for each set of libraries. Using both can cause Data Collector to fail to start.

The MapR stage libraries are blacklisted by default. To use one of the MapR stage libraries, run the MapR setup script as described in MapR Prerequisites.

Advanced Thread Pool Properties

The Data Collector configuration properties include a runner.thread.pool.size property described in Running Multiple Concurrent Pipelines.

Though the existing Data Collector configuration properties provide the configuration abilities that most users generally need, when necessary, you can add and configure advanced thread pool properties.

The following table lists the additional thread pool properties that you can add to the Data Collector configuration properties. However, note that the default values used by Data Collector are typically sufficient in most cases.
Advanced Thread Pool Property Description
runner_stop.thread.pool.size Thread pool size used to force stop pipelines.

Default is the value set for the runner.thread.pool.size property.

event.executor.thread.pool.size Thread pool size used to react to pipeline events.

Default is the value set for the runner.thread.pool.size property.

manager.executor.thread.pool.size Thread pool size used to manage background processes.

Default is 4.

bundle.executor.thread.pool.size Thread pool size used to create support bundles.

Default is 1.

previewer.thread.pool.size Thread pool size used for data preview. You might increase this setting when previewing multiple pipelines at the same time.

Default is 4.

To configure an advanced thread pool property, add the property to the Data Collector configuration properties.
  1. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration.
  2. Add the advanced thread pool properties that you want to configure, then define values for each property.
  3. Save the changes to the deployment and restart all engine instances.

Configuring Data Collector

You can customize Data Collector by configuring the deployment. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration.

Important: Instead of entering sensitive data such as passwords in clear text in the configuration properties, you can protect the sensitive data by storing the data in an external location and then using functions to retrieve the data.
Data Collector includes the following general configuration properties:
General Property Description
sdc.base.http.url Data Collector URL that is included in emails sent for metric and data alerts.

Default is http://<hostname>:<http.port> where <hostname> is the value defined in the http.bindHost property. If the host name is not defined in http.bindHost, Data Collector runs the following command to determine the host name: hostname -f

Be sure to uncomment the property if you change the value.

http.bindHost Host name or IP address that Data Collector binds to. You might want to configure a specific host or IP address when the machine that Data Collector is installed on has multiple network cards.

Default is 0.0.0.0, which means that Data Collector can bind to any host or IP address. Be sure to uncomment the property if you change the value.

http.maxThreads Maximum number of concurrent threads the Data Collector web server uses to serve UI requests.

Default is 200. Uncomment the property to change the value, but increasing this value is not recommended.

http.port Port number to use for Data Collector.

Default is 18630.

https.port Secure port number for Data Collector. For example, 18636. Any number besides -1 enables the secure port number.

If you use both port properties, the HTTP port bounces to the HTTPS port. Default is -1.

For more information, see Enabling HTTPS.

http2.enable Enables support of the HTTP/2 protocol for the API. To enable HTTP/2, set this property to true and configure the https.port property, above.

Do not use with clients that do not support application layer protocol negotiation (ALPN).

Default is false.

http.enable.forwarded.requests Enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port HTTP request headers issued by a reverse proxy such as HAProxy, ELB, or NGINX.

Set to true when hosting Data Collector behind a reverse proxy or load balancer.

Default is false.

https.keystore.path Keystore path and file name used by Data Collector. Enter an absolute path or a path relative the Data Collector resources directory.
Note: Default is keystore.jks in the Data Collector configuration directory which provides a self-signed certificate that you can use. However, StreamSets strongly recommends that you generate a certificate signed by a trusted CA, as described in Enabling HTTPS.
https.keystore.password Password to the Data Collector keystore file. To protect the password, store the password in an external location and then use a function to retrieve the password.

Default uses the file function to retrieve the password from keystore-password.txt in the Data Collector configuration directory.

https.require.hsts Requires Data Collector to include the HTTP Strict Transport Security (HSTS) response header.

Set to true when Data Collector uses HTTPS to enable HSTS.

Default is false.

http.session.max.inactive.interval Maximum amount of time that Data Collector can remain inactive before the user is logged out. Use -1 to allow user sessions to remain inactive indefinitely.

Default is 86,400 seconds (24 hours).

http.authentication HTTP authentication. Use none, basic, digest, or form.

The HTTP authentication type determines how passwords are transferred from the browser to Data Collector over HTTP. Digest authentication encrypts the passwords. Basic and form authentication do not encrypt the passwords.

When using basic, digest, or form with file-based authentication, use the associated realm.properties file to define user accounts. The realm.properties files are located in the Data Collector configuration directory.

Default is form for Data Collector installations downloaded from the Customer Support portal.

http.authentication.login.module Indicates where user account information resides:
  • Set to file to use the realm.properties files.
  • Set to ldap to use an LDAP server.

Default is file.

http.digest.realm Realm used for HTTP authentication. Use basic-realm, digest-realm, or form-realm. The associated realm.properties file must be located in the Data Collector configuration directory.

Default is <http.authentication>-realm. Be sure to uncomment the property if you change the value.

http.realm.file.permission.check Checks the permissions for the realm.properties file in use:
  • Set to true to ensure that the file allows access only to the owner.
  • Set to false to skip the permission check.

Relevant when http.authentication.login.module is set to file.

http.authentication.ldap.role.mapping Maps groups defined by the LDAP server to Data Collector roles.
Enter a semicolon-separated list as follows:
<ldap group>:<SDC role>,<additional SDC role>...;
<ldap group>:<SDC role>,<additional SDC role>... 

Relevant when http.authentication.login.module is set to ldap.

ldap.login.module.name Name of the JAAS configuration properties in the ldap-login.conf file located in the Data Collector configuration directory.

Default is ldap.

http.access.control.allow.origin List of domains allowed to access the Data Collector REST API for cross-origin resource sharing (CORS). To restrict access to specific domains, enter a comma-separated list as follows:
http://www.mysite.com, http://www.myothersite.com

Default is the asterisk wildcard (*) which means that any domain can access the Data Collector REST API.

http.access.control.allow.headers List of HTTP headers allowed during a cross-domain request.
http.access.control.exposed.headers List of HTTP headers exposed as part of the cross-domain response.
http.access.control.allow.methods List of HTTP methods that can be called during a cross-domain request.
kerberos.client.enabled Enables Kerberos authentication for Data Collector. Must be enabled to allow non-Kafka stages to use Kerberos to access external systems.

For more information, see Kerberos Authentication.

kerberos.client.principal Kerberos principal to use. Enter a service principal.
kerberos.client.keytab Location of the Kerberos keytab file that contains the credentials for the Kerberos principal.

Use a fully-qualified directory or a directory relative to the Data Collector configuration directory.

preview.maxBatchSize Maximum number of records used to preview data.

Default is 10.

preview.maxBatches Maximum number of batches used to preview data.

Default is 10.

production.maxBatchSize Maximum number of records included in a batch when the pipeline runs.

Default is 50000.

parser.limit Maximum parser buffer size that origins can use to process data. Limits the size of the data that can be parsed and converted to a record.

By default, the parser buffer size is 1048576 bytes. To increase the size, uncomment and configure this property. For more information about how this property affects record sizes, see Maximum Record Size.

production.maxErrorRecordsPerStage Maximum number of error records to save in memory for each stage to display in Monitor mode. When the limit is reached, older error records are discarded.

Default is 100.

production.maxPipelineErrors Maximum number of pipeline errors to save in memory to display in monitor mode. When the limit is reached, older errors are discarded.

Default is 100.

max.logtail.concurrent.requests Maximum number of external processes allowed to access the Data Collector log file at the same time through REST API calls.

Default is 5.

max.webSockets.concurrent.requests Maximum number of WebSocket calls allowed.
pipeline.access.control.enabled Enables pipeline permissions and sharing pipelines. With pipeline permissions enabled, a user must have the appropriate permissions to view or work with a pipeline. Only Admin users and pipeline owners have full access to pipelines.

When pipeline permissions are disabled, access to pipelines is based on the roles assigned to the user and its groups. For more information about pipeline permissions, see Pipeline Permissions.

Default is false.

ui.header.title Optional custom header to display in Data Collector next to the StreamSets logo. You can create a header using HTML and include an additional image.

To use an image, place the file in a directory local to the following directory: $SDC_DIST/sdc-static-web/

For example, to add custom text, you might use the following HTML:

<span class="navbar-brand">Dev Data Collector</span>
Or to use an image in the $SDC_DIST/sdc-static-web/ directory, you can use the following HTML:
<img src="<filename>.<extension>">

We recommend using an image no more than 48 pixels high.

ui.local.help.base.url Base URL for the online help installed with Data Collector.

Do not change this value.

ui.hosted.help.base.url Base URL for the online help hosted on the StreamSets website.

Do not change this value.

ui.registration.url URL used to register Data Collector with StreamSets.

Do not change this value.

ui.refresh.interval.ms Interval in milliseconds that Data Collector waits before refreshing the UI.

Default is 2000.

ui.jvmMetrics.refresh.interval.ms Interval in milliseconds that the Data Collector metrics are refreshed.

Default is 4000.

ui.enable.webSocket Enables Data Collector to use WebSocket to gather pipeline information.
ui.undo.limit Number of recent actions stored so you can undo them.
ui.default.configuration.view Displays basic properties for pipelines and pipeline stages by default. Users can choose to show the advanced options when configuring a pipeline or stage.

Uncomment the property and set it to ADVANCED to display advanced options for all new pipelines and new stages added to existing pipelines.

The Data Collector configuration properties includes the following properties for sending email:
Email Property Description
mail.transport.protocol Use smtp or smtps.

Default is smtp.

mail.smtp.host SMTP host name.

Default is localhost.

mail.smtp.port SMTP port number.

Default is 25.

mail.smtp.auth Whether the SMTP host uses authentication. Use true or false.

Default is false.

mail.smtp.starttls.enable Whether the SMTP host uses STARTTLS encryption. Use true or false.

Default is false.

mail.smtps.host SMTPS host name.

Default is localhost.

mail.smtps.port SMTPS port number.

Default is 25.

mail.smtps.auth Whether the SMTPS host uses authentication. Use true or false.

Default is false.

xmail.username User name for the email account to send email.
xmail.password Password for the email account. To protect the password, store the password in an external location and then use a function to retrieve the password.

Default uses the file function to retrieve the password from email-password.txt in the Data Collector configuration directory, <installation_dir>/etc.

xmail.from.address Email address to use to send email.
The Data Collector configuration properties includes the following advanced properties:
Advanced Property Description
runtime.conf.location Location of runtime properties. Use to declare where runtime properties are defined:
  • embedded - Runtime properties are defined in the Data Collector configuration properties.
  • <file path> - Absolute directory and file name where runtime properties are defined. For example: /sdc/streamsets-datacollector-5.8.0/externalResources/resources/test-runtime.properties

    The runtime properties file must be added as an external resource for the deployment, as described in the Control Hub documentation.

The Data Collector configuration properties includes properties with a java.security. prefix which you can use to configure Java security properties. Any Java security properties that you modify in the configuration properties change the JVM configuration. Do not modify the Java security properties when running multiple Data Collector instances within the same JVM.

The Data Collector configuration properties includes the following Java security property:

Java Security Property Description
java.security.networkaddress.cache.ttl
Note: This property has been deprecated and may be removed in a future release.

Number of seconds to cache Domain Name Service (DNS) lookups.

Default is 0, which configures the JVM to use the DNS time to live value. For more information, see the networkaddress.cache.ttl property in the Oracle documentation.

The Data Collector configuration properties includes Security Manager properties that allow you to enable the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in Data Collector configuration, data, and resource directories.

By default, Data Collector uses the Java Security Manager that allows stages to access files in all Data Collector directories.

The Data Collector configuration properties includes the following Security Manager properties:

Security Manager Property Description
security_manager.sdc_manager.enable Enables the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in protected Data Collector directories.

Uncomment the property to enable.

security_manager.sdc_dirs.exceptions Files in protected directories that can be accessed by all stage libraries when the Data Collector Security Manager is enabled.

Generally, you should not need to change this property.

security_manager.sdc_dirs.exceptions.<stage_library_name> Files in protected directories that can be accessed by the specified stage library when the Data Collector Security Manager is enabled.

Generally, you should not need to change this property.

The Data Collector configuration properties includes the following stage-specific properties:
Stage-Specific Properties Description
stage.conf_hadoop.always.impersonate.current.user Ensures that Hadoop-related stages use the currently logged in Data Collector user to perform tasks, such as writing data, in Hadoop systems. With this property enabled, Data Collector prevents configuring an alternate user in Hadoop-related stages.

To use this property, uncomment the property and set it to true.

For more information and a list of affected stages, see Hadoop Impersonation Mode.

stage.conf_hadoop.always.lowercase.user Converts the user name to lowercase before passing it to Hadoop.

Use to lowercase user names from case insensitive systems, such as a case-insensitive LDAP installation, before passing the user names to Hadoop systems.

To use this property, uncomment the property and set it to true.

stage.conf_com.streamsets.pipeline.stage.hive.impersonate.current.user Enables the Hive Metadata processor, the Hive Metastore destination, and the Hive Query executor to impersonate the current user when connecting to Hive.

Default is false.

Set to true to automatically impersonate the current user, without specifying a proxy user in the JDBC URL.

stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load Lists JDBC drivers that Data Collector automatically loads for all pipelines.

To use this property, uncomment the property and set it to a comma-separated list of JDBC drivers.

stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Enables Data Collector to attempt to disable SSL for all JDBC connections.

Many newer JDBC systems enable SSL by default. When you have JDBC pipelines that do not use SSL, you can use this property to handle JDBC systems with SSL enabled. However, some JDBC vendors do not allow disabling SSL.

To use this property, uncomment the property and set it to true.

stage.conf_kafka.keytab.location Storage location for Kerberos keytabs that are specified in Kafka stages. Keytabs are stored only for the duration of the pipeline run.

Generally, you should not need to change this property.

stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.addrecordstoqueue Enables the Oracle CDC Client origin to reduce memory usage when the origin is configured to buffer data locally, in memory.

This property is enabled by default.

Do not disable this property unless recommended by the StreamSets support team.
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.monitorbuffersize Enables Data Collector to report memory consumption when the Oracle CDC Client origin uses local buffers. Reporting reduces pipeline performance, so enable the property only as a temporary troubleshooting measure.

This property is disabled by default.

stage.conf_com.streamsets.pipeline.stage.executor.shell. shell Defines the relative or absolute path to the command line interpreter to use to execute scripts, such as /bin/bash.

Default is sh.

Used by Shell executors.

stage.conf_com.streamsets.pipeline.stage.executor. shell.sudo Defines the relative or absolute path to the sudo to use when executing scripts.

Default is sudo.

Used by Shell executors.

stage.conf_com.streamsets.pipeline.stage.executor.shell. impersonation_mode

Uses the Data Collector user who starts the pipeline to execute shell scripts defined in Shell executors. When not enabled, the operating system user who started Data Collector is used to execute shell scripts.

To enable the secure use of shell scripts through the Shell executor, we highly recommend uncommenting this property.

Requires the user who starts the pipeline to have a matching user account in the operating system. For more information about the security ramifications, see Data Collector Shell Impersonation Mode.

Used by Shell executors.

The Data Collector configuration properties includes the following observer properties, used to process data rules and alerts:
Observer Properties Description
observer.queue.size Maximum queue size for data rule evaluation requests. Each data rule generates an evaluation request for every batch that passes through the stream. When the number of requests outstrips the queue size, requests are dropped.

Default is 100.

observer.sampled.records.cache.size Maximum number of records to be cached for display for each rule. The exact number of records is specified in the data rule.

Default is 100. You can reduce this number as needed.

observer.queue.offer.max.wait.time.ms Maximum number of milliseconds to wait before dropping a data rule evaluation request when the observer queue is full.

The Data Collector configuration properties includes the following miscellaneous properties:

Miscellaneous Property Description
max.stage.private.classloaders Maximum number of stage libraries Data Collector allows.

Default is 50.

runner.thread.pool.size Pre-multiplier size of the thread pool. One running pipeline requires five threads, and pipelines share threads in the pool. To calculate the approximate runner thread pool size, multiply the number of running pipelines by 2.2.

Increasing this value does not increase the parallelization of an individual pipeline.

Default is 50, which is sufficient to run approximately 22 standalone pipelines at the same time.

For information about advanced thread pool properties, see Advanced Thread Pool Properties.

runner.boot.pipeline.restart Automatically restarts all running pipelines on a Data Collector restart.

To disable the automatic restart of pipelines, uncomment this property. Disable only for troubleshooting or in a development environment.

pipeline.max.runners.count Maximum number of pipeline runners to use for a multithreaded pipeline.

Default is 50.

package.manager.repository.links Enables specifying alternate locations for the Package Manager repositories. Use this property to install non-StreamSets stage libraries or to install stage libraries from local or alternate repositories.

To use alternate Package Manager repositories, uncomment the property and specify a comma-separated list of URLs.

bundle.upload.enabled Enables uploading manually-generated support bundles to the StreamSets Support team.

When disabled, you can still generate, download, and email support bundles.

To disable uploads of manually-generated bundles, uncomment this property.

bundle.upload.on_error Enables the automatic generation and upload of support bundles to the StreamSets Support team when pipelines transition to an error state.

Use of this property is not recommended.

The configuration properties includes stage and stage library aliases to enable backward compatibility for pipelines created with earlier versions of Data Collector, such as:
stage.alias.streamsets-datacollector-basic-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget=
streamsets-datacollector-jdbc-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget

library.alias.streamsets-datacollector-apache-kafka_0_8_1_1-lib=
streamsets-datacollector-apache-kafka_0_8_1-lib
Generally, you should not need to change or remove these aliases.
You can optionally add stage libraries to the blacklist or whitelist properties to limit the stage libraries Data Collector uses and include additional configuration properties:
Blacklist / Whitelist Property Description
system.stagelibs.whitelist

system.stagelibs.blacklist

Use one list to limit the StreamSets stage libraries that can be used in Data Collector. Do not use both.
user.stagelibs.whitelist

user.stagelibs.blacklist

Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both.
The Data Collector configuration properties includes the following classpath validation properties:
Classpath Validation Property Description
stagelibs.classpath.validation.enable Allows you to disable classpath validation when necessary.

By default, Data Collector performs classpath validation each time it starts. It writes the results to the Data Collector log.

Though generally unnecessary, you can disable classpath validation by uncommenting this property and setting it to false.

stagelibs.classpath.validation.terminate Prevents Data Collector from starting when it discovers an invalid classpath.

To use enable this behavior, uncomment this property and set it to true.

The Data Collector configuration properties includes the following Health Inspector property:
Health Inspector Property Description
health_inspector.network.host Host name that the Data Collector Health Inspector uses for the ping and traceroute commands.

The Data Collector configuration properties includes the following property that specifies additional configuration properties to include in the Data Collector configuration:

Additional Files Property Description
config.includes Additional configuration properties to include in the Data Collector configuration.

You can enter multiple file names separated by commas. The files are loaded into the Data Collector configuration in the listed order. If the same configuration property is defined in multiple files, the value defined in the last loaded file takes precedence.

By default, credential store, Java. log4j, and security policy properties are included in the Data Collector advanced configuration properties.

The Data Collector configuration properties includes record sampling properties that indicate the size of the sample set chosen from a total population of records. Data Collector uses the sampling properties when you run a pipeline that writes to a destination system using the SDC Record data format and then run another pipeline that reads from that same system using the SDC Record data format. Data Collector uses record sampling to calculate the time that a record stays in the intermediate destination.

By default, Data Collector uses 1 out of 10,000 records for sampling. If you modify the sampling size, simplify the fraction for better performance. For example, configure the sampling size as 1/40 records instead of 250/10000 records. The following properties specify the sampling size:

Record Sampling Property Description
sdc.record.sampling.sample.size Size of the sample set.

Default is 1.

sdc.record.sampling.population.size Size of the total number of records.

Default is 10,000.

The Data Collector configuration properties includes properties that define how Data Collector caches pipeline states. Data Collector can cache the state of pipelines for faster retrieval of those states in the Home page. If Data Collector does not cache pipeline states, it must retrieve pipeline states from the pipeline data files stored in the $SDC_DATA directory. You can configure the following properties that specify how Data Collector caches pipeline states:

Pipeline State Cache Property Description
store.pipeline.state.cache.maximum.size Maximum number of pipeline states that Data Collector caches. When the maximum number is reached, Data Collector evicts the oldest states from the cache.

Default is 100.

store.pipeline.state.cache.expire.after.access Amount of time in minutes that a pipeline state can remain in the cache after the entry's creation, the most recent replacement of its value, or its last access.

Default is 10 minutes.

The Data Collector configuration properties includes the following properties that define how Data Collector works with Control Hub:
General Property Description
dpm.enabled Specifies whether the Data Collector is enabled to work with Control Hub.

Default is false.

dpm.base.url URL to access Control Hub.
dpm.registration.retry.attempts Maximum number of times that Data Collector attempts to register with Control Hub before failing the registration.

Default is 5.

dpm.security.validationTokenFrequency.secs Frequency in seconds that Data Collector validates authentication and user tokens with Control Hub.

Default is 60.

dpm.appAuthToken File located within the Data Collector configuration directory that includes the authentication token for this Data Collector instance.

Generally, you should not need to change this value.

dpm.remote.control.job.labels Labels to assign to this Data Collector. Use labels to group Data Collectors registered with Control Hub. To assign multiple labels, enter a comma-separated list of labels.

Default is "all", which you can use to run a job on all registered Data Collectors.

dpm.remote.control.ping.frequency Frequency in milliseconds that Data Collector notifies Control Hub that it is running.

Default is 5,000.

dpm.remote.control.events.recipient Name of the internal Control Hub application to which Data Collector sends pipeline status updates.

Do not change this value.

dpm.remote.control.process.events.recipients Names of the internal Control Hub applications to which Data Collector sends performance updates - including CPU load and memory usage.

Do not change this value.

dpm.remote.control.status.events.interval Frequency in milliseconds that Data Collector informs Control Hub of the following information:
  • Status of all local and published pipelines that are running on this Data Collector.
  • Performance information for this Data Collector - including CPU load and memory usage.

Default is 60,000.

dpm.remote.deployment.id For provisioned Data Collectors, the ID of the deployment that provisioned the Data Collector.

For manually administered Data Collectors, the value is blank.

Do not change this value.

http.meta.redirect.to.sso Enables the redirect of Data Collector user logins to Control Hub using the HTML meta refresh method. Set to true only if the registered Data Collector is installed as on application on Microsoft Azure HDInsight.

Default is false, which means that Data Collector uses HTTP redirect headers to redirect logins. Use the default for all other Data Collector installation types.

dpm.alias.name.enabled

Enables using an abbreviated Control Hub user ID when Hadoop impersonation mode or shell impersonation mode are used.

By default, when using Hadoop impersonation mode or shell impersonation mode, a Data Collector registered with Control Hub uses the full Control Hub user ID as the user name, as follows:
<ID>@<organization ID>

Enable this property to use only the ID, ignoring "@<organization ID>". For example, using myname instead of myname@org as the user name.

To use a partial Control Hub user ID, uncomment the property and set it to true.

When using Hadoop impersonation mode, the Hadoop system, Data Collector, and the pipeline stages must be properly configured. For more information, see Hadoop Impersonation Mode.

When using shell impersonation mode, Data Collector and the operating system to run the shell script must be properly configured. For more information, see Data Collector Shell Impersonation Mode.

dpm.runHistory.enabled Enables storing information about previous pipeline runs in the data/runHistory folder in the engine installation directory.

Default is true. Generally, you should not need to change this value.