Data Collector Environment Configuration
- Data Collector directories
- User and group used to start Data Collector as a service
- Java configuration options
- Security Manager that restricts the runtime permissions of user libraries
- Path to JAR files to be added to the root classloader
- Heap dump creation and file location
Modifying Environment Variables
- Tarball installation started manually from the command line
- When you start Data Collector manually from the command line on any operating system, edit the
$SDC_DIST/libexec/sdc-env.sh
file to modify environment variables. - Tarball or RPM installation started as a service on operating systems that use the SysV init system
- When you start Data Collector as a service on CentOS 6, Oracle Linux 6, Red Hat Enterprise Linux 6, or Ubuntu
14.04 LTS, edit the
$SDC_DIST/libexec/sdcd-env.sh
file to modify environment variables. - Tarball or RPM installation started as a service on operating systems that use the systemd init system
- When you start Data Collector as a service on CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu
16.04 LTS, edit the
sdc.service
file to modify environment variables.Note: When you install Data Collector through a cloud service provider marketplace, Data Collector is deployed as an RPM package on an operating system that uses the systemd init system. - Cloudera Manager installation
- When you install Data Collector through Cloudera Manager, modify environment variables by configuring the StreamSets service through Cloudera Manager.
Data Collector Directories
Data Collector includes environment variables that define the directories used to store files used by Data Collector, such as configuration files, log files, and runtime resources.
The SDC_DIST environment variable defines the Data Collector runtime directory. The runtime directory is the base Data Collector directory that stores the executables and related files. This environment variable is set during installation.
When you start Data Collector
manually, the default values of the remaining directory variables are relative to the
$SDC_DIST
runtime directory. When you start Data Collector as
a service, the default values of the remaining directory variables are absolute paths
that are outside of the $SDC_DIST
runtime directory.
Modify environment variables using the method required by your installation type.
You can configure the following environment variables that define directories:
Environment Variable | Description |
---|---|
SDC_CONF |
Defines the configuration directory for the Data Collector
configuration file, Default directories:
|
SDC_DATA |
Defines the data directory for pipeline configuration and run details. Default directories:
|
SDC_LOG |
Defines the log directory. Default directories:
|
SDC_EXTERNAL_RESOURCES | Defines an optional external resources directory. By default, this
directory contains the following directories:
To define this directory, you must add the environment variable
to the appropriate file. Set the variable to a directory outside of
the Default
directory: |
SDC_RESOURCES | Defines the directory for runtime resource files. To configure this environment
variable, you must uncomment the variable in the appropriate file.
Set the variable to a directory outside of the
Default directories:
|
STREAMSETS_LIBRARIES_EXTRA_DIR | Defines the directory for external
libraries. To configure this environment variable, you
must add the environment variable to the appropriate file. Set the
variable to a directory outside of the Default directory:
This resolves to the following directory unless you define the SDC_EXTERNAL_RESOURCES environment variable:
|
USER_LIBRARIES_DIR | Defines the directory for custom stage
libraries. To configure this environment variable, you
must add it to the appropriate file. Set the variable to a directory
outside of the Default directory:
This resolves to the following directory unless you define the SDC_EXTERNAL_RESOURCES environment variable:
|
User and Group for Service Start
When you run Data Collector as a service, Data Collector runs as the system user account and group defined in environment variables. The default system user and group are named sdc.
You can modify the values of the environment variables to point to another system user or group.
Modify environment variables using the method required by your installation type.
If you change the system user, you must make the new system user the owner of all Data Collector directories.
myuser
, use the
following command to change the owner of the configuration directory,
$SDC_CONF
, and all files in the directory to
myuser:myuser
:chown -R myuser:myuser /etc/sdc
Java Configuration Options
You define Java configuration options used by Data Collector in environment variables.
- SDC_JAVA_OPTS - Includes configuration options for Java.
- SDC_JAVA8_OPTS - Includes configuration options specific to Java 8.
Data Collector loads the value of the version-specific environment variable and adds it to the SDC_JAVA_OPTS environment variable.
When defining Java configuration options, avoid defining duplicate options. If you do define duplicates, the last option passed to the JVM usually takes precedence.
For a Cloudera Manager installation, define Java configuration options by configuring the StreamSets service through Cloudera Manager.
Java Heap Size
Modify the Data Collector Java heap size as necessary, based on the resources available on the host machine. By default, the Java heap size is 1024 MB.
The Java heap size determines the heap size allocated to Data Collector and affects the amount of memory Data Collector can use when it runs a pipeline. Running a pipeline can use up to 65% of the allocated heap size. For example, with a heap size of 2048 MB, you can configure a pipeline to use up to 65% - that's 1331 MB of memory.
- Xmx - Defines the maximum heap size.
- Xms - Defines the minimum heap size.
UseCompressedOops
option, which allows a maximum of 32 GB of heap size regardless of the configured size. To
allocate more than 32 GB, disable the option by adding the following Java option :
-XX:-UseCompressedOops
Define the heap size based on your installation:
- Tarball or RPM installation
-
Define the heap size in the SDC_JAVA_OPTS environment variable.
For example, to double the heap size, increase the Xmx and Xms settings as follows:
export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -Xmx2048m -Xms2048m -server"
Modify environment variables using the method required by your installation type.
- Cloud service provider installation
- Define the heap size percentage in the SDC_HEAP_SIZE_PERCENTAGE environment variable. Default is 50% of the available memory on the virtual machine.
- Cloudera Manager installation
- Define the heap size in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
Remote Debugging
You can enable remote debugging to debug a Data Collector instance running on a remote machine.
- Tarball or RPM installation
-
Define debugging options in the SDC_JAVA_OPTS environment variable.
- Cloudera Manager installation
- Define the debugging options in the Java Options property for the StreamSets service in Cloudera Manager.
Garbage Collector
- Java 11 or later - Default is the G1 garbage collector.
If you define another garbage collector, test and evaluate Data Collector performance before making the same change in a production environment. Garbage collector performance depends on each particular use case.
- Tarball or RPM installation
- Define the garbage collector in the SDC_JAVA8_OPTS environment variable.
- Cloudera Manager installation
- Define the garbage collector in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
Logging
Data Collector
enables garbage collector logging by default to facilitate troubleshooting. Log
files are written to $SDC_LOG/gc.log
. You can disable logging.
Disable garbage collector logging based on your installation:
- Tarball or RPM installation
- Set the SDC_GC_LOGGING environment variable to false. For example:
- Cloudera Manager installation
- Set the SDC_GC_LOGGING environment variable to false in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field for the StreamSets service in Cloudera Manager.
Root Classloader
You can edit the SDC_ROOT_CLASSPATH environment variable to define the path to JAR files to be added to the Data Collector root classloader.
Use the variable for components that must be in the root classloader, such as Snappy.
Default is $SDC_DIST/root-lib/'*'
.
Modify environment variables using the method required by your installation type.
Heap Dump Creation
By default, when Data Collector encounters an out of memory error (OOME), it creates a heap dump.
By default, heap dump files are written to the file defined in the SDC_LOG environment
variable and use a naming convention that allows generating multiple heap dump files, as
follows: $SDC_LOG/sdc_heapdump_${timestamp}.hprof
.
You can change the name of the heap dump files, but we recommend using the
${timestamp}
or similar variable to ensure that the heap dump name
is unique.
Note that Java Virtual Machine, and therefore Data Collector,
does not overwrite existing heap dump files. For example, if you use
$SDC_LOG/sdc_heapdump.hprof
as the file name, after Data Collector
creates the first heap dump file, it will not create another until you remove the
existing file.
Heap Dump Environment Variable | Description |
---|---|
SDC_HEAPDUMP_ON_OOM | Specifies whether Data Collector generates a heap dump upon encountering an out of memory error.
Default is true. |
SDC_HEAPDUMP_PATH | Specifies the file name and location to use for heap dump files.
By default, heap dumps are written to
To specify a different file name or location, uncomment the property and enter the location and file name to use. Tip: To write multiple heap dump files to a
directory, use a function or variable to ensure that the
file name is unique. If a file of the same name exists in
the directory, Data Collector does not create a new heap dump file.
|
Modify environment variables using the method required by your installation type.