Configuring Data Collector
You can
customize Data Collector by editing the Data Collector configuration file, sdc.properties
. Use a text editor to
edit the Data Collector configuration file, $SDC_CONF/sdc.properties
. To enable the
changes, restart Data Collector. You can customize Data Collector by configuring the deployment. In Control Hub, edit the
deployment. In the Configure Engine section, click
Advanced Configuration. Then, click Data
Collector Configuration.
General Property | Description |
---|---|
sdc.base.http.url | Data Collector URL that is included in emails sent for metric
and data alerts. Default is
Be sure to uncomment the property if you change the value. |
http.bindHost | Host name or IP address that Data Collector binds to. You might want to configure a specific host or IP
address when the machine that Data Collector is installed on has multiple network cards. Default is 0.0.0.0, which means that Data Collector can bind to any host or IP address. Be sure to uncomment the property if you change the value. |
http.maxThreads | Maximum number of concurrent threads the Data Collector web server uses to serve UI requests. Default is 200. Uncomment the property to change the value, but increasing this value is not recommended. |
http.port | Port number to use for the Data Collector UIData Collector. Default is 18630. |
https.port | Secure port number for the Data Collector UIData Collector. For example, 18636. Any number besides -1 enables
the secure port number. If you use both port properties, the HTTP port bounces to the HTTPS port. Default is -1. For more information, see Enabling HTTPS. |
http2.enable | Enables support of the HTTP/2 protocol for the UI and API. To enable HTTP/2, set this
property to true and configure the
https.port property, above. Do not use with clients that do not support application layer protocol negotiation (ALPN). Default is
|
http.enable.forwarded.requests | Enables handling X-Forwarded-For, X-Forwarded-Proto,
X-Forwarded-Port HTTP request headers issued by a reverse proxy
such as HAProxy, ELB, or NGINX. Set to Default is
|
https.keystore.path | Keystore path and file name used by Data Collector and the gateway node for cluster pipelines. Enter an absolute
path or a path relative the $SDC_RESOURCES Data Collector resources directory.Note: Default is
keystore.jks in the $SDC_CONF Data Collector configuration directory which provides a
self-signed certificate that you can use. However, StreamSets strongly recommends that you generate a certificate
signed by a trusted CA, as described in Enabling HTTPS. |
https.keystore.password | Password to the Data Collector keystore file. To protect the passwordprotect the password, store the password in an
external location and then use a function to retrieve the password.
Default uses the |
https.cluster.keystore.path | For cluster pipelines, the absolute path and file name of the keystore file on worker nodes. The file must be in the same location on each worker node. |
https.cluster.keystore.password | For cluster pipelines, the absolute path and name of the file that contains the password to the keystore file on worker nodes. The file must be in the same location on each worker node. |
https.truststore.path | For cluster pipelines, the path and name of the truststore
file on the gateway node. Enter an absolute path or a path
relative to the $SDC_CONF Data Collector configuration directory.Default is the truststore
file from the following directory on the gateway node:
If you register Data Collector with Control Hub and use HTTPS, set the path and name of the
truststore file in the SDC_JAVA_OPTS environment variableJava configuration properties of the
deployment with the Run the following
command:
Enter the option as follows in the Java Options property: -Djavax.net.ssl.trustStore = <path to truststore file> |
https.truststore.password | For cluster pipelines, the path and name of the file that
contains the password to the truststore file on the gateway
node. Enter an absolute path or a path relative to the $SDC_CONF Data Collector configuration directory.Be sure to uncomment the property if you change the value. |
https.cluster.truststore.path | For cluster pipelines, the absolute path and file name of the
truststore file on the worker nodes. The file must be in the
same location on each worker node. Default is the truststore
file from the following directory on each worker node:
|
https.cluster.truststore.password | For cluster pipelines, the absolute path and name of the file
that contains the password to the truststore file on the worker
nodes. The file must be in the same location on each worker
node. Be sure to uncomment the property if you change the value. |
http.session.max.inactive.interval | Maximum amount of time that the Data Collector UI can remain inactive before the user is logged out. Use -1
to allow user sessions to remain inactive
indefinitely. Default is 86,400 seconds (24 hours). |
http.authentication | HTTP authentication. Use none ,
basic , digest , or
form .The HTTP authentication type determines how passwords are transferred from the browser to Data Collector over HTTP. Digest authentication encrypts the passwords. Basic and form authentication do not encrypt the passwords. When using Default is
|
http.authentication.login.module | Indicates where user account information resides:
Default is |
http.digest.realm | Realm used for HTTP authentication. Use basic-realm,
digest-realm, or form-realm. The associated realm.properties
file must be located in the $SDC_CONF Data Collector configuration directory.Default is
|
http.realm.file.permission.check | Checks the permissions for the
realm.properties file in use:
Relevant when http.authentication.login.module is set to
|
http.authentication.ldap.role.mapping | Maps groups defined by the LDAP server to Data Collector roles. Enter a semicolon-separated list as
follows:
Relevant
when http.authentication.login.module is set to
|
ldap.login.module.name | Name of the JAAS configuration properties in the $SDC_CONF/ldap-login.confldap-login.conf file located in the Data Collector configuration directory. Default is
|
http.access.control.allow.origin | List of domains allowed to access the Data Collector REST API for cross-origin resource sharing (CORS). To
restrict access to specific domains, enter a comma-separated
list as
follows:
Default is the asterisk wildcard (*) which means that any domain can access the Data Collector REST API. |
http.access.control.allow.headers | List of HTTP headers allowed during a cross-domain request. |
http.access.control.exposed.headers | List of HTTP headers exposed as part of the cross-domain response. |
http.access.control.allow.methods | List of HTTP methods that can be called during a cross-domain request. |
kerberos.client.enabled | Enables Kerberos authentication for Data Collector. Must be enabled to allow non-Kafka stages to use Kerberos to
access external systems. For more information, see Kerberos Authentication. |
kerberos.client.principal | Kerberos principal to use. Enter a service principal. |
kerberos.client.keytab | Location of the Kerberos keytab file that contains the
credentials for the Kerberos principal. Use a fully-qualified
directory or a directory relative to the |
preview.maxBatchSize | Maximum number of records used to preview data. Default is 10. |
preview.maxBatches | Maximum number of batches used to preview data. Default is 10. |
production.maxBatchSize | Maximum number of records included in a batch when the
pipeline runs. Default is 50000. |
parser.limit | Maximum parser buffer size that origins can use to process
data. Limits the size of the data that can be parsed and
converted to a record. By default, the parser buffer size is 1048576 bytes. To increase the size, uncomment and configure this property. For more information about how this property affects record sizes, see Maximum Record Size. |
production.maxErrorRecordsPerStage | Maximum number of error records to save in memory for each
stage to display in Monitor mode. When the limit is reached,
older error records are discarded. Default is 100. |
production.maxPipelineErrors | Maximum number of pipeline errors to save in memory to
display in monitor mode. When the limit is reached, older errors
are discarded. Default is 100. |
max.logtail.concurrent.requests | Maximum number of external processes allowed to access the
Data Collector log file at the same time through REST API calls. Default is 5. |
max.webSockets.concurrent.requests | Maximum number of WebSocket calls allowed. |
pipeline.access.control.enabled | Enables pipeline permissions and sharing pipelines. With
pipeline permissions enabled, a user must have the appropriate
permissions to view or work with a pipeline. Only Admin users
and pipeline owners have full access to pipelines. When pipeline permissions are disabled, access to pipelines is based on the roles assigned to the user and its groups. For more information about pipeline permissions, see Pipeline Permissions. Default is |
ui.header.title | Optional custom header to display in the Data Collector UI next to the StreamSets logo. You can create a header using
HTML and include an additional image. To use an image, place
the file in a directory local to the following directory:
For example, to add custom text, you might use the following HTML:
Or to use an image in the
$SDC_DIST/sdc-static-web/ directory,
you can use the following HTML:
We recommend using an image no more than 48 pixels high. |
ui.local.help.base.url | Base URL for the online help installed with Data Collector. Do not change this value. |
ui.hosted.help.base.url | Base URL for the online help hosted on the StreamSets website. Do not change this value. |
ui.registration.url | URL used to register Data Collector with StreamSets. Do not change this value. |
ui.refresh.interval.ms | Interval in milliseconds that Data Collector waits before refreshing the Data Collector UI. Default is 2000. |
ui.jvmMetrics.refresh.interval.ms | Interval in milliseconds that the Data Collector metrics are refreshed. Default is 4000. |
ui.enable.webSocket | Enables Data Collector to use WebSocket to gather pipeline information. |
ui.undo.limit | Number of recent actions stored so you can undo them. |
ui.default.configuration.view | Displays basic properties for pipelines and pipeline stages
by default. Users can choose to show the advanced options when configuring a pipeline or
stage. Uncomment the property and set it to
|
Email Property | Description |
---|---|
mail.transport.protocol | Use smtp or smtps. Default is
|
mail.smtp.host | SMTP host name. Default is
|
mail.smtp.port | SMTP port number. Default is 25. |
mail.smtp.auth | Whether the SMTP host uses authentication. Use
true or false .Default
is |
mail.smtp.starttls.enable | Whether the SMTP host uses STARTTLS encryption. Use
true or false .Default
is |
mail.smtps.host | SMTPS host name. Default is
|
mail.smtps.port | SMTPS port number. Default is 25. |
mail.smtps.auth | Whether the SMTPS host uses authentication. Use
true or false .Default
is |
xmail.username | User name for the email account to send email. |
xmail.password | Password for the email account. To protect the passwordprotect the password, store the password in an
external location and then use a function to retrieve the password. Default uses the |
xmail.from.address | Email address to use to send email. |
Advanced Property | Description |
---|---|
runtime.conf.location | Location of runtime properties. Use to declare where runtime
properties are defined:
|
The Data Collector
configuration fileconfiguration properties includes properties with a java.security.
prefix which you can
use to configure Java security properties. Any Java security properties that you
modify in the configuration fileconfiguration properties change the JVM configuration. Do not modify the Java security properties when
running multiple Data Collector
instances within the same JVM.
The Data Collector configuration fileconfiguration properties includes the following Java security property:
Java Security Property | Description |
---|---|
java.security.networkaddress.cache.ttl | Number of seconds to cache Domain Name Service (DNS)
lookups. Default is 0, which configures the JVM to use the DNS time to live value. For more information, see the networkaddress.cache.ttl property in the Oracle documentation. |
The Data Collector configuration fileconfiguration properties includes Security Manager properties that allow you to enable the Data Collector Security ManagerSecurity Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in Data Collector configuration, data, and resource directories.
By default, Data Collector uses the Java Security Manager that allows stages to access files in all Data Collector directories.
The Data Collector configuration fileconfiguration properties includes the following Security Manager properties:
Security Manager Property | Description |
---|---|
security_manager.sdc_manager.enable | Enables the Data Collector Security Manager for enhanced security. The Data Collector Security Manager does not allow stages to access files in
protected Data Collector directories. Uncomment the property to enable. |
security_manager.sdc_dirs.exceptions | Files in protected directories that can be accessed by all
stage libraries when the Data Collector Security Manager is enabled. Generally, you should not need to change this property. |
security_manager.sdc_dirs.exceptions.<stage_library_name> | Files in protected directories that can be accessed by the
specified stage library when the Data Collector Security Manager is enabled. Generally, you should not need to change this property. |
Stage-Specific Properties | Description |
---|---|
stage.conf_hadoop.always.impersonate.current.user | Ensures that Hadoop-related stages use the currently logged
in Data Collector user to perform tasks, such as writing data, in Hadoop
systems. With this property enabled, Data Collector prevents configuring an alternate user in Hadoop-related
stages. To use this property, uncomment the property and set
it to For more information and a list of affected stages, see Hadoop Impersonation Mode. |
stage.conf_hadoop.always.lowercase.user | Converts the user name to lowercase before passing it to
Hadoop. Use to lowercase user names from case insensitive systems, such as a case-insensitive LDAP installation, before passing the user names to Hadoop systems. To
use this property, uncomment the property and set it to
|
stage.conf_com.streamsets.pipeline.stage.hive.impersonate. current.user | Enables the Hive Metadata processor, the Hive Metastore
destination, and the Hive Query executor to impersonate the
current user when connecting to Hive. Default is
Set to
|
stage.conf_com.streamsets.pipeline.stage.jdbc.drivers.load | Lists JDBC drivers that Data Collector automatically loads for all pipelines. To use this property, uncomment the property and set it to a comma-separated list of JDBC drivers. |
stage.conf_kafka.keytab.location | Storage location for Kerberos keytabs that are specified
in Kafka stages. Keytabs are stored only for the
duration of the pipeline run. Generally, you should not need to change this property. |
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc. oracle.addrecordstoqueue | Enables the Oracle CDC Client origin to reduce memory usage
when the origin is configured to buffer data locally, in
memory. This property is enabled by default. Do not disable this property unless recommended by the StreamSets support team. |
stage.conf_com.streamsets.pipeline.stage.executor.shell. shell | Defines the relative or absolute path to the command line
interpreter to use to execute scripts, such as
/bin/bash .Default is
Used by Shell executors. |
stage.conf_com.streamsets.pipeline.stage.executor. shell.sudo | Defines the relative or absolute path to the sudo to use when
executing scripts. Default is
Used by Shell executors. |
stage.conf_com.streamsets.pipeline.stage.executor.shell. impersonation_mode |
Uses the Data Collector user who starts the pipeline to execute shell scripts defined
in Shell executors. When not enabled, the operating system user
who started Data Collector is used to execute shell scripts. To enable the secure use of shell scripts through the Shell executor, we highly recommend uncommenting this property. Requires the user who starts the pipeline to have a matching user account in the operating system. For more information about the security ramifications, see Data Collector Shell Impersonation Mode. Used by Shell executors. |
Antenna Doctor Properties | Description |
---|---|
antennadoctor.enable | Disables Antenna Doctor. Antenna Doctor is enabled by
default. To disable, uncomment this property and set it to
false . |
antennadoctor.update.enable | Stops Antenna Doctor from accessing the internet for periodic
updates. To disable, uncomment this property and set it to
false . |
Observer Properties | Description |
---|---|
observer.queue.size | Maximum queue size for data rule evaluation requests. Each
data rule generates an evaluation request for every batch that
passes through the stream. When the number of requests outstrips
the queue size, requests are dropped. Default is 100. |
observer.sampled.records.cache.size | Maximum number of records to be cached for display for each
rule. The exact number of records is specified in the data rule.
Default is 100. You can reduce this number as needed. |
observer.queue.offer.max.wait.time.ms | Maximum number of milliseconds to wait before dropping a data rule evaluation request when the observer queue is full. |
The Data Collector configuration fileconfiguration properties includes the following miscellaneous properties:
Miscellaneous Property | Description |
---|---|
max.stage.private.classloaders | Maximum number of stage libraries Data Collector allows. Default is 50. |
runner.thread.pool.size | Pre-multiplier size of the thread pool. One running pipeline
requires five threads, and pipelines share threads in the pool.
To calculate the approximate runner thread pool size, multiply
the number of running pipelines by 2.2. Increasing this value does not increase the parallelization of an individual pipeline. Default is 50, which is sufficient to run approximately 22 standalone pipelines at the same time. |
runner.boot.pipeline.restart | Automatically restarts all running pipelines on a Data Collector restart. To disable the automatic restart of pipelines, uncomment this property. Disable only for troubleshooting or in a development environment. |
pipeline.max.runners.count | Maximum number of pipeline runners to use for a multithreaded
pipeline. Default is 50. |
package.manager.repository.links | Enables specifying alternate locations for the Package
Manager repositories. Use this property to install non-StreamSets stage libraries or to install stage libraries from local or
alternate repositories. To use alternate Package Manager repositories, uncomment the property and specify a comma-separated list of URLs. |
bundle.upload.enabled | Enables uploading manually-generated support bundlessupport
bundlessupport
bundles to the StreamSets support team. When disabled, you can still generate, download, and email support bundles. To disable uploads of manually-generated bundles, uncomment this property. |
bundle.upload.on_error | Enables the automatic generation and upload of support bundlessupport
bundlessupport
bundles to the StreamSets support team when pipelines
transition to an error state. Use of this property is not recommended. |
stage.alias.streamsets-datacollector-basic-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget=
streamsets-datacollector-jdbc-lib,
com_streamsets_pipeline_stage_destination_jdbc_JdbcDTarget
library.alias.streamsets-datacollector-apache-kafka_0_8_1_1-lib=
streamsets-datacollector-apache-kafka_0_8_1-lib
Generally,
you should not need to change or remove these aliases.Blacklist / Whitelist Property | Description |
---|---|
system.stagelibs.whitelist system.stagelibs.blacklist |
Use one list to limit the StreamSets stage libraries that can be used in Data Collector. Do not use both. |
user.stagelibs.whitelist user.stagelibs.blacklist |
Use one list to limit the third-party stage libraries that can be used in Data Collector. Do not use both. |
Classpath Validation Property | Description |
---|---|
stagelibs.classpath.validation.enable | Allows you to disable classpath validation when necessary.
By default, Data Collector performs classpath validation each time it starts. It writes the results to the Data Collector log. Though generally unnecessary, you can disable
classpath validation by uncommenting this property and
setting it to |
stagelibs.classpath.validation.terminate | Prevents Data Collector from starting when it discovers an invalid classpath. To
use enable this behavior, uncomment this property and set it
to |
Health Inspector Property | Description |
---|---|
health_inspector.network.host | Hostname that the Data Collector Health Inspector uses for the ping and
traceroute commands. |
The Data Collector configuration fileconfiguration properties includes the following property that specifies additional configuration filesconfiguration properties to include in the Data Collector configuration:
Additional Files Property | Description |
---|---|
config.includes | Additional configuration filesconfiguration properties to include in the Data Collector configuration. The files must be stored in a
directory relative to the $SDC_CONF
directory.You can enter multiple file names separated by commas. The files are loaded into the Data Collector configuration in the listed order. If the same configuration property is defined in multiple files, the value defined in the last loaded file takes precedence. By default, the dpm.properties, vault.properties, and credential-stores.properties files are included in the Data Collector configuration. By default, credential store, Java. log4j, and security policy properties are included in the Data Collector advanced configuration properties. |
The Data Collector configuration fileconfiguration properties includes record sampling properties that indicate the size of the sample set chosen from a total population of records. Data Collector uses the sampling properties when you run a pipeline that writes to a destination system using the SDC Record data format and then run another pipeline that reads from that same system using the SDC Record data format. Data Collector uses record sampling to calculate the time that a record stays in the intermediate destination.
By default, Data Collector uses 1 out of 10,000 records for sampling. If you modify the sampling size, simplify the fraction for better performance. For example, configure the sampling size as 1/40 records instead of 250/10000 records. The following properties specify the sampling size:
Record Sampling Property | Description |
---|---|
sdc.record.sampling.sample.size | Size of the sample set. Default is 1. |
sdc.record.sampling.population.size | Size of the total number of records. Default is 10,000. |
The Data Collector
configuration fileconfiguration properties includes properties that define how Data Collector
caches pipeline states. Data Collector
can cache the state of pipelines for faster retrieval of those states in the Home
page. If Data Collector
does not cache pipeline states, it must retrieve pipeline states from the pipeline
data files stored in the $SDC_DATA
directory. You can configure the
following properties that specify how Data Collector
caches pipeline states:
Pipeline State Cache Property | Description |
---|---|
store.pipeline.state.cache.maximum.size | Maximum number of pipeline states that Data Collector caches. When the maximum number is reached, Data Collector evicts the oldest states from the cache. Default is 100. |
store.pipeline.state.cache.expire.after.access | Amount of time in minutes that a pipeline state can remain in the
cache after the entry's creation, the most recent replacement of its
value, or its last access. Default is 10 minutes. |
General Property | Description |
---|---|
dpm.enabled | Specifies whether the Data Collector is
enabled to work with Control Hub.
Default is false. |
dpm.base.url | URL to access Control Hub. Set to
Set to the Control Hub URL
provided by your system administrator. For example,
|
dpm.registration.retry.attempts | Maximum number of times that Data Collector
attempts to register with Control Hub
before failing the registration. Default is 5. |
dpm.security.validationTokenFrequency.secs | Frequency in seconds that Data Collector
validates authentication and user tokens with Control Hub. Default is 60. |
dpm.appAuthToken | File located within $SDC_CONF , the Data Collector
configuration directory, that includes the
authentication token for this Data Collector
instance.Generally, you should not need to change this value. |
dpm.remote.control.job.labels | Labels to assign to this Data Collector. Use
labels to group Data Collectors
registered with Control Hub. To
assign multiple labels, enter a comma-separated list of labels.
Default is "all", which you can use to run a job on all registered Data Collectors. |
dpm.remote.control.ping.frequency | Frequency in milliseconds that Data Collector
notifies Control Hub that
it is running. Default is 5,000. |
dpm.remote.control.events.recipient | Name of the internal Control Hub
application to which Data Collector sends
pipeline status updates. Do not change this value. |
dpm.remote.control.process.events.recipients | Names of the internal Control Hub
applications to which Data Collector sends
performance updates - including CPU load and memory usage. Do not change this value. |
dpm.remote.control.status.events.interval | Frequency in milliseconds that Data Collector
informs Control Hub of the following information:
Default is 60,000. |
dpm.remote.deployment.id | For provisioned Data Collectors, the
ID of the deployment that provisioned the Data Collector. For manually administered Data Collectors, the value is blank. Do not change this value. |
http.meta.redirect.to.sso | Enables the redirect of Data Collector user
logins to Control Hub
using the HTML meta refresh method. Set to true only if the
registered Data Collector is
installed as on application on Microsoft Azure
HDInsight. Default is false, which means that Data Collector uses HTTP redirect headers to redirect logins. Use the default for all other Data Collector installation types. |
dpm.alias.name.enabled |
Enables using an abbreviated Control Hub user ID when Hadoop impersonation mode or shell impersonation mode are used. By default, when using Hadoop
impersonation mode or shell impersonation mode, a Data Collector
registered with Control Hub
uses the full Control Hub
user ID as the user name, as
follows:
Enable this property to use only the ID, ignoring " To use a partial Control Hub user ID, uncomment the property and set it to true. When using Hadoop impersonation mode, the Hadoop system, Data Collector, and the pipeline stages must be properly configured. For more information, see Hadoop Impersonation Mode. When using shell impersonation mode, Data Collector and the operating system to run the shell script must be properly configured. For more information, see Data Collector Shell Impersonation Mode. |