Configuring Data Collector

You can customize Data Collector by editing the Data Collector configuration file, sdc.properties. Use a text editor to edit the Data Collector configuration file, $SDC_CONF/sdc.properties. To enable the changes, restart Data Collector. You can customize Data Collector by configuring the deployment. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Data Collector Configuration.

Important: Instead of entering sensitive data such as passwords in clear text in the configuration fileconfiguration properties, you can protect the sensitive dataprotect the sensitive data by storing the data in an external location and then using functions to retrieve the data.

Data Collector includes the following general configuration properties:


General Property	Description
sdc.base.http.url	Data Collector URL that is included in emails sent for metric and data alerts. Default is `http://<hostname>:<http.port>` where `<hostname>` is the value defined in the http.bindHost property. If the host name is not defined in http.bindHost, Data Collector runs the following command to determine the host name: `hostname -f` Be sure to uncomment the property if you change the value.
http.bindHost	Host name or IP address that Data Collector binds to. You might want to configure a specific host or IP address when the machine that Data Collector is installed on has multiple network cards. Default is 0.0.0.0, which means that Data Collector can bind to any host or IP address. Be sure to uncomment the property if you change the value.
http.maxThreads	Maximum number of concurrent threads the Data Collector web server uses to serve UI requests. Default is 200. Uncomment the property to change the value, but increasing this value is not recommended.
http.port	Port number to use for Data Collector. Default is 18630.
https.port	Secure port number for Data Collector. For example, 18636. Any number besides -1 enables the secure port number. If you use both port properties, the HTTP port bounces to the HTTPS port. Default is -1. For more information, see Enabling HTTPS.
http2.enable	Enables support of the HTTP/2 protocol for the API. To enable HTTP/2, set this property to `true` and configure the https.port property, above. Do not use with clients that do not support application layer protocol negotiation (ALPN). Default is `false`.
http.enable.forwarded.requests	Enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port HTTP request headers issued by a reverse proxy such as HAProxy, ELB, or NGINX. Set to `true` when hosting Data Collector behind a reverse proxy or load balancer. Default is `false`.
https.keystore.path	Keystore path and file name used by Data Collector. Enter an absolute path or a path relative the `$SDC_RESOURCES`Data Collector resources directory. Note: Default is `keystore.jks` in the `$SDC_CONF`Data Collector configuration directory which provides a self-signed certificate that you can use. However, it is best practice to generate a certificate signed by a trusted CA, as described in Enabling HTTPS.
https.keystore.password	Password to the Data Collector keystore file. To protect the passwordprotect the password, store the password in an external location and then use a function to retrieve the password. Default uses the `file` function to retrieve the password from keystore-password.txt in the `$SDC_CONF`Data Collector configuration directory.
https.require.hsts	Requires Data Collector to include the HTTP Strict Transport Security (HSTS) response header. Set to `true` when Data Collector uses HTTPS to enable HSTS. Default is `false`.
http.session.max.inactive.interval	Maximum amount of time that Data Collector can remain inactive before the user is logged out. Use -1 to allow user sessions to remain inactive indefinitely. Default is 86,400 seconds (24 hours).
http.authentication	HTTP authentication. Use `none`, `basic`, `digest`, or `form`. The HTTP authentication type determines how passwords are transferred from the browser to Data Collector over HTTP. Digest authentication encrypts the passwords. Basic and form authentication do not encrypt the passwords. When using `basic`, `digest`, or `form` with file-based authentication, use the associated realm.properties file to define user accounts. The realm.properties files are located in the `$SDC_CONF`Data Collector configuration directory. Default is `form` for Data Collector installations downloaded from the Customer Support portal.
http.authentication.login.module	Indicates where user account information resides: Set to `file` to use the realm.properties files. Set to `ldap` to use an LDAP server. Default is `file`.
http.digest.realm	Realm used for HTTP authentication. Use basic-realm, digest-realm, or form-realm. The associated realm.properties file must be located in the `$SDC_CONF`Data Collector configuration directory. Default is `<http.authentication>-realm`. Be sure to uncomment the property if you change the value.
http.realm.file.permission.check	Checks the permissions for the realm.properties file in use: Set to `true` to ensure that the file allows access only to the owner. Set to `false` to skip the permission check. Relevant when http.authentication.login.module is set to `file`.
http.authentication.ldap.role.mapping	Maps groups defined by the LDAP server to Data Collector roles. Enter a semicolon-separated list as follows: `<ldap group>:<SDC role>,<additional SDC role>...; <ldap group>:<SDC role>,<additional SDC role>...` Relevant when http.authentication.login.module is set to `ldap`.
ldap.login.module.name	Name of the JAAS configuration properties in the $SDC_CONF/ldap-login.confldap-login.conf file located in the Data Collector configuration directory. Default is `ldap`.
http.access.control.allow.origin	List of domains allowed to access the Data Collector REST API for cross-origin resource sharing (CORS). To restrict access to specific domains, enter a comma-separated list as follows: `http://www.mysite.com, http://www.myothersite.com` Default is the asterisk wildcard (*) which means that any domain can access the Data Collector REST API.
http.access.control.allow.headers	List of HTTP headers allowed during a cross-domain request.
http.access.control.exposed.headers	List of HTTP headers exposed as part of the cross-domain response.
http.access.control.allow.methods	List of HTTP methods that can be called during a cross-domain request.
kerberos.client.enabled	Enables Kerberos authentication for Data Collector. Must be enabled to allow non-Kafka stages to use Kerberos to access external systems. For more information, see Kerberos Authentication.
kerberos.client.principal	Kerberos principal to use. Enter a service principal.
kerberos.client.keytab	Location of the Kerberos keytab file that contains the credentials for the Kerberos principal. Use a fully-qualified directory or a directory relative to the `$SDC_CONF`Data Collector configuration directory.
preview.maxBatchSize	Maximum number of records used to preview data. Default is 10.
preview.maxBatches	Maximum number of batches used to preview data. Default is 10.
production.maxBatchSize	Maximum number of records included in a batch when the runs. Default is 50000.
parser.limit	Maximum parser buffer size that origins can use to process data. Limits the size of the data that can be parsed and converted to a record. By default, the parser buffer size is 1048576 bytes. To increase the size, uncomment and configure this property. For more information about how this property affects record sizes, see Maximum Record Size.
production.maxErrorRecordsPerStage	Maximum number of error records to save in memory for each stage to display in Monitor mode. When the limit is reached, older error records are discarded. Default is 100.
production.maxPipelineErrors	Maximum number of errors to save in memory to display in monitor mode. When the limit is reached, older errors are discarded. Default is 100.
max.logtail.concurrent.requests	Maximum number of external processes allowed to access the Data Collector log file at the same time through REST API calls. Default is 5.
max.webSockets.concurrent.requests	Maximum number of WebSocket calls allowed.
pipeline.access.control.enabled	Enables permissions and sharing . With permissions enabled, a user must have the appropriate permissions to view or work with a . Only Admin users and owners have full access to . When pipeline permissions are disabled, access to pipelines is based on the roles assigned to the user and its groups. For more information about permissions, see Pipeline Permissions. Default is `false`.
ui.header.title	Optional custom header to display in Data Collector next to the StreamSetsIBM StreamSets logo. You can create a header using HTML and include an additional image. To use an image, place the file in a directory local to the following directory: `$SDC_DIST/sdc-static-web/` For example, to add custom text, you might use the following HTML: `<span class="navbar-brand">Dev Data Collector</span>` Or to use an image in the `$SDC_DIST/sdc-static-web/` directory, you can use the following HTML: `<img src="<filename>.<extension>">` We recommend using an image no more than 48 pixels high.
ui.local.help.base.url	Base URL for the online help installed with Data Collector. Do not change this value.
ui.hosted.help.base.url	Base URL for the online help. Do not change this value.
ui.registration.url	URL used to register Data Collector. Do not change this value.
ui.refresh.interval.ms	Interval in milliseconds that Data Collector waits before refreshing the UI. Default is 2000.
ui.jvmMetrics.refresh.interval.ms	Interval in milliseconds that the Data Collector metrics are refreshed. Default is 4000.
ui.enable.webSocket	Enables Data Collector to use WebSocket to gather information.
ui.undo.limit	Number of recent actions stored so you can undo them.
ui.default.configuration.view	Displays basic properties for and stages by default. Users can choose to show the advanced options when configuring a or stage. Uncomment the property and set it to `ADVANCED` to display advanced options for all new and new stages added to existing .

The Data Collector configuration fileconfiguration properties includes the following properties for sending email:


Email Property	Description
mail.transport.protocol	Use smtp or smtps. Default is `smtp`.
mail.smtp.host	SMTP host name. Default is `localhost`.
mail.smtp.port	SMTP port number. Default is 25.
mail.smtp.auth	Whether the SMTP host uses authentication. Use `true` or `false`. Default is `false`.
mail.smtp.starttls.enable	Whether the SMTP host uses STARTTLS encryption. Use `true` or `false`. Default is `false`.
mail.smtps.host	SMTPS host name. Default is `localhost`.
mail.smtps.port	SMTPS port number. Default is 25.
mail.smtps.auth	Whether the SMTPS host uses authentication. Use `true` or `false`. Default is `false`.
xmail.username	User name for the email account to send email.
xmail.password	Password for the email account. To protect the passwordprotect the password, store the password in an external location and then use a function to retrieve the password. Default uses the `file` function to retrieve the password from email-password.txt in the Data Collector configuration directory, `$SDC_CONF`<installation_dir>/etc.
xmail.from.address	Email address to use to send email.

The Data Collector configuration fileconfiguration properties includes the following advanced properties:


Advanced Property	Description
runtime.conf.location	Location of runtime properties. Use to declare where runtime properties are defined: `embedded` - Runtime properties are defined in the Data Collector configuration fileconfiguration properties. `<file path>` - Directory and file name where runtime properties are defined. You can specify the file relative to the `$SDC_CONF` directory or in an absolute directory outside the `$SDC_CONF` directory. `<file path>` - Absolute directory and file name where runtime properties are defined. For example: /sdc/streamsets-datacollector-5.8.0/externalResources/resources/test-runtime.properties The runtime properties file must be added as an external resource for the deployment, as described in the Control Hub documentation.

The Data Collector configuration fileconfiguration properties includes properties with a java.security. prefix which you can use to configure Java security properties. Any Java security properties that you modify in the configuration fileconfiguration properties change the JVM configuration. Do not modify the Java security properties when running multiple Data Collector instances within the same JVM.

The Data Collector configuration fileconfiguration properties includes the following Java security property:


Java Security Property	Description
java.security.networkaddress.cache.ttl	Note: This property has been deprecated and may be removed in a future release. If needed, you can configure the `networkaddress.cache.ttl` property in the `$SDC_DIST/etc/sdc-java-security.properties` file to cache Domain Name Service (DNS) lookups. Number of seconds to cache Domain Name Service (DNS) lookups. Default is 0, which configures the JVM to use the DNS time to live value. For more information, see the networkaddress.cache.ttl property in the Oracle documentation.

The Data Collector configuration fileconfiguration properties includes Security Manager properties that allow you to enable the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in Data Collector configuration, data, and resource directories.

By default, Data Collector uses the Java Security Manager that allows stages to access files in all Data Collector directories.

The Data Collector configuration fileconfiguration properties includes the following Security Manager properties:


Security Manager Property	Description
security_manager.sdc_manager.enable	Enables the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in protected Data Collector directories. Uncomment the property to enable.
security_manager.sdc_dirs.exceptions	Files in protected directories that can be accessed by all stage libraries when the Data Collector Security Manager is enabled. Generally, you should not need to change this property.
security_manager.sdc_dirs.exceptions.<stage_library_name>	Files in protected directories that can be accessed by the specified stage library when the Data Collector Security Manager is enabled. Generally, you should not need to change this property.

The Data Collector configuration fileconfiguration properties includes the following stage-specific properties:


Stage-Specific Properties	Description
stage.conf_hadoop.always.impersonate.current.user	Ensures that Hadoop-related stages use the currently logged in Data Collector user to perform tasks, such as writing data, in Hadoop systems. With this property enabled, Data Collector prevents configuring an alternate user in Hadoop-related stages. To use this property, uncomment the property and set it to `true`. For more information and a list of affected stages, see Hadoop Impersonation Mode.
stage.conf_hadoop.always.lowercase.user	Converts the user name to lowercase before passing it to Hadoop. Use to lowercase user names from case insensitive systems, such as a case-insensitive LDAP installation, before passing the user names to Hadoop systems. To use this property, uncomment the property and set it to `true`.
stage.conf_com.streamsets.pipeline.stage.hive.impersonate.current.user	Enables the Hive Metadata processor, the Hive Metastore destination, and the Hive Query executor to impersonate the current user when connecting to Hive. Default is `false`. Set to `true` to automatically impersonate the current user, without specifying a proxy user in the JDBC URL.
stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load	Lists JDBC drivers that Data Collector automatically loads for all . To use this property, uncomment the property and set it to a comma-separated list of JDBC drivers.
stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL	Enables Data Collector to attempt to disable SSL for all JDBC connections. Many newer JDBC systems enable SSL by default. When you have JDBC that do not use SSL, you can use this property to handle JDBC systems with SSL enabled. However, some JDBC vendors do not allow disabling SSL. To use this property, uncomment the property and set it to `true`.
stage.conf_kafka.keytab.location	Storage location for Kerberos keytabs that are specified in Kafka stages. Keytabs are stored only for the duration of the run. Generally, you should not need to change this property.
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.addrecordstoqueue	Enables the Oracle CDC Client origin to reduce memory usage when the origin is configured to buffer data locally, in memory. This property is enabled by default. Do not disable this property unless recommended by customer support.
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.monitorbuffersize	Enables Data Collector to report memory consumption when the Oracle CDC Client origin uses local buffers. Reporting reduces performance, so enable the property only as a temporary troubleshooting measure. This property is disabled by default.
stage.conf_com.streamsets.pipeline.stage.executor.shell. shell	Defines the relative or absolute path to the command line interpreter to use to execute scripts, such as `/bin/bash`. Default is `sh`. Used by Shell executors.
stage.conf_com.streamsets.pipeline.stage.executor. shell.sudo	Defines the relative or absolute path to the sudo to use when executing scripts. Default is `sudo`. Used by Shell executors.
stage.conf_com.streamsets.pipeline.stage.executor.shell. impersonation_mode	Uses the Data Collector user who starts the to execute shell scripts defined in Shell executors. When not enabled, the operating system user who started Data Collector is used to execute shell scripts. To enable the secure use of shell scripts through the Shell executor, we highly recommend uncommenting this property. Requires the user who starts the to have a matching user account in the operating system. For more information about the security ramifications, see Data Collector Shell Impersonation Mode. Used by Shell executors.

The Data Collector configuration fileconfiguration properties includes the following observer properties, used to process data rules and alerts:


Observer Properties	Description
observer.queue.size	Maximum queue size for data rule evaluation requests. Each data rule generates an evaluation request for every batch that passes through the stream. When the number of requests outstrips the queue size, requests are dropped. Default is 100.
observer.sampled.records.cache.size	Maximum number of records to be cached for display for each rule. The exact number of records is specified in the data rule. Default is 100. You can reduce this number as needed.
observer.queue.offer.max.wait.time.ms	Maximum number of milliseconds to wait before dropping a data rule evaluation request when the observer queue is full.

The Data Collector configuration fileconfiguration properties includes the following miscellaneous properties:


Miscellaneous Property	Description
max.stage.private.classloaders	Maximum number of stage libraries Data Collector allows. Default is 50.
runner.thread.pool.size	Pre-multiplier size of the thread pool. One running requires five threads, and share threads in the pool. To calculate the approximate runner thread pool size, multiply the number of running by 2.2. Increasing this value does not increase the parallelization of an individual . Default is 50, which is sufficient to run approximately 22 standalone at the same time. For information about advanced thread pool properties, see AdvancedThreadPool.html#concept_y4z_3k1_cvb.
runner.boot.pipeline.restart	Automatically restarts all running on a Data Collector restart. To disable the automatic restart of , uncomment this property. Disable only for troubleshooting or in a development environment.
pipeline.max.runners.count	Maximum number of runners to use for a multithreaded . Default is 50.
package.manager.repository.links	Enables specifying alternate locations for the Package Manager repositories. Use this property to install non-StreamSetsIBM StreamSets stage libraries or to install stage libraries from local or alternate repositories. To use alternate Package Manager repositories, uncomment the property and specify a comma-separated list of URLs.
bundle.upload.enabled	Enables uploading manually-generated support bundles support bundles support bundles to customer support. When disabled, you can still generate, download, and email support bundles. To disable uploads of manually-generated bundles, uncomment this property.
bundle.upload.on_error	Enables the automatic generation and upload of support bundles support bundles support bundles to customer support when transition to an error state. Use of this property is not recommended.

The configuration fileconfiguration properties includes stage and stage library aliases to enable backward compatibility for created with earlier versions of Data Collector, such as:

stage.alias.streamsets-datacollector-basic-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget=
streamsets-datacollector-jdbc-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget

library.alias.streamsets-datacollector-apache-kafka_0_8_1_1-lib=
streamsets-datacollector-apache-kafka_0_8_1-lib

Generally, you should not need to change or remove these aliases.

You can optionally add stage libraries to the following properties to limit the stage libraries Data Collector uses and include additional configuration filesconfiguration properties. The property names differ depending on the Data Collector version:

6.1 and later Use the following properties for Data Collector 6.1 and later:


Blocklist / Allowlist Property	Description
system.stagelibs.allowlist system.stagelibs.blocklist	Use one list to limit the IBM StreamSets stage libraries that can be used in Data Collector. Do not use both.
user.stagelibs.allowlist user.stagelibs.blocklist	Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both.

6.0 Use the following properties for Data Collector 6.0:


Blacklist / Whitelist Property	Description
system.stagelibs.whitelist system.stagelibs.blacklist	Use one list to limit the IBM StreamSets stage libraries that can be used in Data Collector. Do not use both.
user.stagelibs.whitelist user.stagelibs.blacklist	Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both.

The Data Collector configuration fileconfiguration properties includes the following classpath validation properties:


Classpath Validation Property	Description
stagelibs.classpath.validation.enable	Allows you to disable classpath validation when necessary. By default, Data Collector performs classpath validation each time it starts. It writes the results to the Data Collector log. Though generally unnecessary, you can disable classpath validation by uncommenting this property and setting it to `false`.
stagelibs.classpath.validation.terminate	Prevents Data Collector from starting when it discovers an invalid classpath. To use enable this behavior, uncomment this property and set it to `true`.

The Data Collector configuration fileconfiguration properties includes the following Health Inspector property:


Health Inspector Property	Description
health_inspector.network.host	Host name that the Data Collector Health Inspector uses for the `ping` and `traceroute` commands.

The Data Collector configuration fileconfiguration properties includes the following property that specifies additional configuration filesconfiguration properties to include in the Data Collector configuration:


Additional Files Property	Description
config.includes	Additional configuration filesconfiguration properties to include in the Data Collector configuration. The files must be stored in a directory relative to the `$SDC_CONF` directory. You can enter multiple file names separated by commas. The files are loaded into the Data Collector configuration in the listed order. If the same configuration property is defined in multiple files, the value defined in the last loaded file takes precedence. By default, the dpm.properties, vault.properties, and credential-stores.properties files are included in the Data Collector configuration. By default, credential store, Java. log4j, and security policy properties are included in the Data Collector advanced configuration properties.

The Data Collector configuration fileconfiguration properties includes record sampling properties that indicate the size of the sample set chosen from a total population of records. Data Collector uses the sampling properties when you run a that writes to a destination system using the SDC Record data format and then run another that reads from that same system using the SDC Record data format. Data Collector uses record sampling to calculate the time that a record stays in the intermediate destination.

By default, Data Collector uses 1 out of 10,000 records for sampling. If you modify the sampling size, simplify the fraction for better performance. For example, configure the sampling size as 1/40 records instead of 250/10000 records. The following properties specify the sampling size:


Record Sampling Property	Description
sdc.record.sampling.sample.size	Size of the sample set. Default is 1.
sdc.record.sampling.population.size	Size of the total number of records. Default is 10,000.

The Data Collector configuration fileconfiguration properties includes properties that define how Data Collector caches states. Data Collector can cache the state of for faster retrieval of those states in the Home page. If Data Collector does not cache states, it must retrieve states from the data files stored in the $SDC_DATA directory. You can configure the following properties that specify how Data Collector caches states:


State Cache Property	Description
store.pipeline.state.cache.maximum.size	Maximum number of states that Data Collector caches. When the maximum number is reached, Data Collector evicts the oldest states from the cache. Default is 100.
store.pipeline.state.cache.expire.after.access	Amount of time in minutes that a state can remain in the cache after the entry's creation, the most recent replacement of its value, or its last access. Default is 10 minutes.

The Data Collector configuration fileconfiguration properties includes the following properties that define how Data Collector works with Control Hub:


General Property	Description
dpm.enabled	Specifies whether the Data Collector is enabled to work with Control Hub. Default is false.
dpm.base.url	URL to access Control Hub. Set to `https://cloud.streamsets.com`. Set to the Control Hub URL provided by your system administrator. For example, `https://<hostname>:18631`.
dpm.registration.retry.attempts	Maximum number of times that Data Collector attempts to register with Control Hub before failing the registration. Default is 5.
dpm.security.validationTokenFrequency.secs	Frequency in seconds that Data Collector validates authentication and user tokens with Control Hub. Default is 60.
dpm.appAuthToken	File located within `$SDC_CONF`, the Data Collector configuration directory, that includes the authentication token for this Data Collector instance. Generally, you should not need to change this value.
dpm.remote.control.job.labels	Labels to assign to this Data Collector. Use labels to group Data Collectors registered with Control Hub. To assign multiple labels, enter a comma-separated list of labels. Default is "all", which you can use to run a job on all registered Data Collectors.
dpm.remote.control.ping.frequency	Frequency in milliseconds that Data Collector notifies Control Hub that it is running. Default is 5,000.
dpm.remote.control.events.recipient	Name of the internal Control Hub application to which Data Collector sends pipeline status updates. Do not change this value.
dpm.remote.control.process.events.recipients	Names of the internal Control Hub applications to which Data Collector sends performance updates - including CPU load and memory usage. Do not change this value.
dpm.remote.control.status.events.interval	Frequency in milliseconds that Data Collector informs Control Hub of the following information: Status of all local and published pipelines that are running on this Data Collector. Performance information for this Data Collector - including CPU load and memory usage. Default is 60,000.
dpm.remote.deployment.id	For provisioned Data Collectors, the ID of the deployment that provisioned the Data Collector. For manually administered Data Collectors, the value is blank. Do not change this value.
http.meta.redirect.to.sso	Enables the redirect of Data Collector user logins to Control Hub using the HTML meta refresh method. Set to true only if the registered Data Collector is installed as on application on Microsoft Azure HDInsight. Default is false, which means that Data Collector uses HTTP redirect headers to redirect logins. Use the default for all other Data Collector installation types.
dpm.alias.name.enabled	Enables using an abbreviated Control Hub user ID when Hadoop impersonation mode or shell impersonation mode are used. By default, when using Hadoop impersonation mode or shell impersonation mode, a Data Collector registered with Control Hub uses the full Control Hub user ID as the user name, as follows: `<ID>@<organization ID>` Enable this property to use only the ID, ignoring "`@<organization ID>`". For example, using `myname` instead of `myname@org` as the user name. To use a partial Control Hub user ID, uncomment the property and set it to true. When using Hadoop impersonation mode, the Hadoop system, Data Collector, and the stages must be properly configured. For more information, see Hadoop Impersonation Mode. When using shell impersonation mode, Data Collector and the operating system to run the shell script must be properly configured. For more information, see Data Collector Shell Impersonation Mode.