Requirements for Self-Managed Deployments

When working with self-managed deployments, you take full control of procuring the resources needed to run a Data Collector engine. You must set up the machine and complete the installation prerequisites required by the engine.

Before launching a Data Collector engine for a self-managed deployment, set up a machine with the minimum requirements. Then, complete the additional Docker image prerequisites or tarball prerequisites based on the installation type you want to use.

After launching a Data Collector tarball, you can optionally set up the engine to run as a service.

Each machine must meet the following minimum requirements:

Component Minimum Requirement
Operating system Use one of the following operating systems and versions:
  • Mac OS X
  • Amazon Linux 2
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x - 8.x
  • Red Hat Enterprise Linux 6.x - 9.x
  • Ubuntu 14.04 LTS - 22.04 LTS
Cores 2
RAM 1 GB
Disk space 6 GB
Note: StreamSets does not recommend using NFS or NAS to store Data Collector files.

Java Version

Data Collector requires that the appropriate Java version be installed on the engine machine.

When you configure a self-managed deployment using an engine tarball file, you are responsible for installing the appropriate Java version as a prerequisite before you run the installation script command that installs and launches the engine tarball.

When you configure a self-managed deployment using a Docker image, Control Hub bundles an appropriate Java version into the Docker image.

Data Collector supports the following Java versions:

Some Data Collector functionality is dependent on the Java version that you use. For more information, see Java Versions and Available Features.

Java Versions and Available Features

The Java version installed on the Data Collector machine determines the Data Collector features that you can use.

All supported Java versions provide almost all Data Collector features. However due to third-party requirements, some features require a particular Java version. For example, HPE Ezmeral Data Fabric 7.0.x requires Java JDK 11 or higher, so you must use Java 11 or 17 to work with MapR stages that connect to HPE Ezmeral Data Fabric 7.0.x.

The following table describes the features available with different Java versions:
Java Version Unavailable Features
Java 8 (Oracle Java 8 and Open JDK 8) Provides access to all Data Collector features except for the following:
  • Stages in the MapR 7.0.x stage library
Java 11 (Oracle Java 11 and Open JDK 11)

Java 17 (Oracle Java 17 and Open JDK 17)

Provides access to all Data Collector features except for the following:
  • Stages in CDH and CDP stage libraries
  • Stages in HDP stage libraries
  • Stages in MapR stage libraries earlier than 7.0.x

JCE for Oracle JVM

If you use AES-256 encryption with your Oracle JVM and use a version of JDK earlier than 1.8.0_161, then configure the JDK on the Data Collector machine to use the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy.

To configure the JDK to use unlimited cryptography, set the crypto.policy Java Security property in the java.security file included in your JDK installation to a value of unlimited. See the notes in the java.security file for more information.

Docker Image Prerequisites

For a Docker image installation of Data Collector, you must install Docker as a prerequisite.

Tarball Prerequisites

For a tarball installation of Data Collector, you must complete the following prerequisites:
  1. Install one of the supported Java versions.
  2. Configure the open file limit.

Configuring the Open File Limit

Data Collector requires a large number of file descriptors to work correctly with all stages. Most operating systems provide a configuration to limit the number of files a process or a user can open. The default values are usually less than the Data Collector requirement of 32768 file descriptors.

Use the following command to verify the configured limit for the current user:
ulimit -n

Most operating systems use two ways of configuring the maximum number of open files - the soft limit and the hard limit. The hard limit is set by the system administrator. The soft limit can be set by the user, but only up to the hard limit.

Increasing the open file limit differs for each operating system. Consult your operating system documentation for the preferred method.

Increase the Limit on Linux

To increase the open file limit on Linux, see the following solution: How to set ulimit values.

This solution should work on Red Hat Enterprise Linux, Oracle Linux, CentOS, and Ubuntu. However, refer to the administrator documentation for your operating system for the preferred method.

Increase the Limit on MacOS

The method you use to increase the limit on MacOS can differ with each version. Refer to the documentation for your operating system version for the preferred method.

To increase the limit for the computer - so that the limits are retained after relaunching the terminal and restarting the computer - create a property list file. The following steps are valid for recent MacOS versions:

  1. Use the following command to create a property list file named limit.maxfiles.plist:
    sudo vim /Library/LaunchDaemons/limit.maxfiles.plist
  2. Add the following contents to the file, modifying the maxfiles attribute as needed.

    The maxfiles attribute defines the open file limit. The first value in the file is the soft limit. The second value is the hard limit.

    For example, in the following limit.maxfiles.plist file, both the soft and hard limit are set to 32,768:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
      <dict>
        <key>Label</key>
        <string>limit.maxfiles</string>
        <key>ProgramArguments</key>
        <array>
          <string>launchctl</string>
          <string>limit</string>
          <string>maxfiles</string>
          <string>32768</string>
          <string>32768</string>
        </array>
        <key>RunAtLoad</key>
        <true/>
        <key>ServiceIPC</key>
        <false/>
      </dict>
    </plist>
  3. Use the following commands to load the new settings:
    sudo launchctl unload -w /Library/LaunchDaemons/limit.maxfiles.plist
    sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
  4. Use the following command to check that the system limits were modified:
    launchctl limit maxfiles
  5. Use the following command to set the session limit:
    ulimit -n 32768

Running Data Collector as a Service

When you install and launch a Data Collector tarball, the installation script starts the engine manually. Alternatively, you can set up Data Collector to run as a service on supported operating systems that use the systemd init system. Supported operating systems include CentOS 7, Oracle Linux 7, Red Hat Enterprise Linux 7, or Ubuntu 16.04 LTS.

Setting up Data Collector to run as a service requires root privileges.
Important: Complete these steps after you launch a Data Collector tarball, as described in the Control Hub documentation.
  1. Use the Control Hub UI to shut down the engine instance, as described in the Control Hub documentation.
  2. On the Data Collector machine, create a system user and group named sdc.

    The sdc user and group are used to start Data Collector as a service.

  3. Move the default Data Collector installation directory to a public folder, such as /opt/sdc/.

    For example, if you installed Data Collector version 5.9.0 on an Ubuntu operating system using the default installation directory and you want to move the installation to /opt/sdc/, use the following command:

    mv /home/ubuntu/.streamsets/install/dc/streamsets-datacollector-5.9.0 /opt/sdc/
  4. Use the following commands to copy the sdc.service and sdc.socket files to the /etc/systemd/system directory:
    cp /opt/sdc/systemd/sdc.service /etc/systemd/system/sdc.service
    cp /opt/sdc/systemd/sdc.socket /etc/systemd/system/sdc.socket
  5. Make the following modifications to the sdc.service file:
    1. Edit the existing environment variables to specify the new Data Collector installation directory.
      For example, if you moved the installation directory to /opt/sdc/, define the existing environment variables as follows:
      Environment=SDC_CONF=/opt/sdc/etc
      Environment=SDC_HOME=/opt/sdc
      Environment=SDC_LOG=/opt/sdc/log
      Environment=SDC_DATA=/opt/sdc/data
      Environment=SDC_RESOURCES=/opt/sdc/externalResources/resources
      ExecStart=/opt/sdc/bin/streamsets dc -verbose
    2. Add the following environment variables to the file:
      Environment="STREAMSETS_DEPLOYMENT_SCH_URL=<sch-url>"
      Environment="STREAMSETS_DEPLOYMENT_ID=<deployment-id>"
      Environment="STREAMSETS_DEPLOYMENT_TOKEN=<deployment-token>"

      To locate the value of each environment variable, retrieve the installation script for the self-managed deployment, as described in the Control Hub documentation.

      For example, the following installation script displays the deployment-id, deployment-token, and sch-url options in bold:

       bash -c 'set -eo pipefail; curl -fsS https://na01.hub.streamsets.com/streamsets-engine-install.sh | bash -s -- --deployment-id="61ddd369-2a0d-49bf-954b-e249da0ff84c:a7f82a57-b7e3-11eb-b93c-cddd1f34c1" --deployment-token="eyJ0eXAiOiJKVCJhbGciOiJub25lIn0.eyJzIjoiNjIzO2NjM2NWQxNTY5ZGVlNTRkODc1MjRkNWRkZGUwZGYNTkwYmRkNjVmYzFlODkyZGIzYTcxMzI5ZjNiYWQ0N2VjM2NhZmN5NDQyZjZkMmMwZmI1OTI0MGE5ZWY1ODcwNWM0NGIyYjExNzBlMmNmODlhNGQiLCJ2IjoxLCJpc3MiOiJkZXYiLCJqdGkiOiIxZTdmOGIxNS00ODM4LTRiOTgtYTRhOC1jZmI5NWQyNzVhOTEiLCJvIjoiYTdmODJhNTMy0xMWViLWI5M2MtY2RkMmVkMWYzNGMxIn0." --sch-url="https://na01.hub.streamsets.com" --foreground '
    3. If the engine is configured to use a proxy server, add the following environment variables to the file to define the proxy properties:
      Environment='STREAMSETS_BOOTSTRAP_JAVA_OPTS=-Dhttps.proxyHost=<proxy server> -Dhttps.proxyPort=<port> -Dhttp.proxyHost=<proxy server> -Dhttp.proxyPort=<port>'
      Environment="http_proxy=http://<proxy server>:<port>"
      Environment="https_proxy=http://<proxy server>:<port>"
      Environment="HTTP_PROXYHOST=<proxy server>"
      Environment="HTTP_PROXYPORT=<port>"
      Environment="HTTPS_PROXYHOST=<proxy server>"
      Environment="HTTPS_PROXYPORT=<port>"
      Environment='HTTP_NONPROXYHOSTS=<list of non-proxy hosts>'
      Environment='no_proxy=<list of non-proxy hosts>'
      Environment="HTTP_PROXY_AUTH_TUNNELING_DISABLED_SCHEMES=<scheme>"

      To locate the value of each environment variable, retrieve the installation script for the self-managed deployment, as described in the Control Hub documentation. The proxy server environment variables are defined before the bash command.

  6. Use the following command to reload the systemd manager configuration:
    systemctl daemon-reload
  7. Use the following command to start Data Collector as a service:
    systemctl start sdc
  8. Use the following command to add the Data Collector service to the system startup:
    systemctl enable sdc