Self-Managed Deployments

You can create a self-managed deployment for an active self-managed environment.

When using a self-managed deployment, you take full control of procuring the resources needed to run engine instances. The resources can be local on-premises machines or cloud computing machines. You must set up the machines and complete the installation prerequisites required by the StreamSets engine type. When the machines reside behind a firewall, you must allow the required inbound and outbound traffic to each machine, as described in Firewall Configuration Overview.

When you create a self-managed deployment, you define the engine type, version, and configuration to deploy. You select the installation type to use - either a tarball or a Docker image.

After you create and start a self-managed deployment, Control Hub displays the engine installation script that you run to install and launch engine instances on the on-premises or cloud computing machines that you have set up. You can configure the installation script to run engine instances as a foreground or background process.

Using a self-managed deployment is the simplest way to get started with StreamSets. After getting started, you might consider using one of the cloud service provider integrations that StreamSets provides, such as the AWS and GCP environments and deployments. With these integrations, Control Hub automatically provisions the resources needed to run the engine type in your cloud service provider account, and then deploys engine instances to those resources.

Or, you can continue using self-managed deployments when you prefer to take full control of the on-premises or cloud computing machines where you run engine instances. You can increase the number of engine instances for a self-managed deployment by simply running the command on additional machines.
Important: If your pipelines require external resources, you must set up an external resource archive that all engine instances can access before increasing the number of instances.

Quick Start Deployment

When you deploy an engine as you build your first Data Collector pipeline, Control Hub presents a simplified process to help you quickly deploy your first Data Collector engine.

Control Hub creates a self-managed deployment for you and names the deployment Data Collector 1 (Quick Start). Control Hub also assigns a quick-start tag and a datacollector1quickstart engine label to the deployment.

You can rename the quick start deployment, remove the default tag or engine label, or edit the deployment just as you edit any other self-managed deployment.

Configuring a Self-Managed Deployment

Configure a self-managed deployment to define the group of engine instances to deploy to a self-managed environment.

To create a new deployment, click Set Up > Deployments in the Navigation panel, and then click the Create Deployment icon: .

To edit an existing deployment, click Set Up > Deployments in the Navigation panel, click the deployment name, and then click Edit.

Define the Deployment

Define the deployment essentials, including the deployment name and type, the environment that the deployment belongs to, and the engine type and version to deploy.

Once saved, you cannot change the deployment type, the engine version, or the environment.

  1. Configure the following properties:
    Define Deployment Property Description
    Deployment Name Name of the deployment.

    Use a brief name that informs your team of the deployment use case.

    Deployment Type Select Self-Managed.
    Environment Active self-managed environment where engine instances will be deployed.

    In most cases, you can select the default self-managed environment.

    Engine Type Type of engine to deploy:
    • Data Collector
    • Transformer
    Engine Version Engine version to deploy.
    Deployment Tags Optional tags that identify similar deployments within Control Hub. Use deployment tags to easily search and filter deployments.

    Enter nested tags using the following format:

    <tag1>/<tag2>/<tag3>

  2. If creating the deployment, click one of the following buttons:
    • Cancel - Cancels creating the deployment and exits the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Engine

Define the configuration of the engine to deploy. You can use the defaults to get started.

  1. Configure the following properties:
    Engine Property Description
    Stage Libraries

    Stage libraries to install on the engine.

    The available stage libraries depend on the selected engine type and version.

    Advanced Configuration

    Access to advanced configuration properties to further customize the engine. As you get started with StreamSets, the default values should work in most cases.

    The available properties depend on the selected engine type.

    Important: When using a Transformer engine that works with a Spark cluster, edit the transformer.base.http.url property on the Transformer Configuration tab. Uncomment the property and set it to the Transformer URL. For more information, see Granting the Spark Cluster Access to Transformer in the Transformer engine documentation.
    External Resource Source Source of the external files and libraries, such as JDBC drivers, required by the engine:
    • None - External resources are not defined in the deployment.

      Select when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.

    • Archive File - External resources are included in an archive file defined in the deployment.

      Select when the deployment launches multiple engine instances and when your pipelines require external resources.

    External Resource Location

    Location of the archive file that contains the external resources used by the engine. The archive file must be in TGZ or ZIP format.

    Enter the location using one of the following formats:
    • File path. For example: /mnt/shared/externalResources.tgz
      Important: To use a file path when using a Docker image for the engine installation type, you must modify the installation script command to mount the file to the engine container.
    • URL. For example: https://<hostname>:<port>/shared/externalResources.tgz
    Tip: Click the download icon to download a sample externalResources.tgz file to view the required directory structure.

    Available when using an archive file as the source for external resources.

    Engine Labels Labels to assign to all engine instances launched for this deployment. Labels determine the group of engine instances that run a job.

    Default is the name of the deployment.

    Max CPU Load (%)

    Maximum percentage of CPU on the host machine that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.

    All engine instances belonging to the deployment inherit these resource threshold values.

    Default is 80.

    Max Memory (%)

    Maximum percentage of the configured Java heap size that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.

    Default is 100.

    Max Running Pipeline Count

    Maximum number of pipelines that can be running on each engine instance. When an engine equals this threshold, Control Hub does not start new pipeline instances on the engine.

    Default is 1,000,000.

  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Install Type

Select the type of engine installation to deploy to a local on-premises or cloud computing machine.

  1. Select the type of engine installation:
    • Tarball
    • Docker image

    The selected type determines the installation prerequisites you must complete when you launch an engine instance for the deployment.

  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Share the Deployment

By default, the deployment can only be seen by you. Share the deployment with other users and groups to grant them access to it.

  1. In the Select Users and Groups field, type a user email address or a group name.
  2. Select users or groups from the list, and then click Add.

    The added users and groups display in the User / Group table.

  3. Modify permissions as needed. By default, each added user or group is granted the following permissions:
    • Read - View the details of the deployment and of all engines managed by the deployment. Restart or shut down individual engines managed by the deployment in the Engines view.
    • Write - Edit, start, stop, and delete the deployment. Delete engines managed by the deployment. Also requires read access on the parent environment.
    • Execute - Start jobs on engines managed by the deployment. Starting jobs also requires execute access on the job and read access on the pipeline.

    For more information, see Deployment Permissions.

  4. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Review and Launch the Deployment

You've successfully finished creating the deployment.

  1. Click Start & Generate Install Script to start the deployment and generate the command to run the engine installation script.
    Note: If you click Exit, Control Hub saves the deployment and exits the wizard, displaying the Deactivated deployment in the Deployments view. You can start the deployment and retrieve the installation script at a later time.
  2. Select whether the installation script runs an engine instance as a foreground or background process.
  3. Click the Copy to Clipboard icon () to copy the generated command.
  4. Use the copied command to launch an engine for the deployment.

    Optionally, click Check Engine Status after Running the Script to display the Engine Status window, where you can view the engine status after you run the generated command.

Foreground or Background Process

When creating a self-managed deployment or when retrieving the engine installation script, you can configure the installation script to run an engine instance in the following ways:
Foreground
When the installation script runs an engine instance as a foreground process, you cannot run additional commands from that command prompt while the engine runs. The command prompt must remain open for the engine to continue to run. If you close the command prompt, the engine shuts down.
Background
When the installation script runs an engine instance as a background process, you regain access to the command prompt after the engine starts. You can run additional commands from that command prompt as the engine runs. If you close the command prompt, the engine continues to run.

By default, a tarball installation script runs an engine instance as a foreground process. A Docker installation script runs an engine instance as a background process.

Launching an Engine for a Deployment

After creating a self-managed deployment, you set up a machine that meets the engine requirements. The machine can be a local on-premises machine or a cloud computing machine. Then, you manually run the engine installation script to install and launch an engine instance on the machine.

When the machine resides behind a firewall, you also must allow the required inbound and outbound traffic to each machine, as described in Firewall Configuration Overview.

You can increase the number of engine instances for a self-managed deployment by setting up another machine that meets the engine requirements and then running the engine installation script on that machine. To launch multiple engine instances on the same machine, launch the engine instances one at a time.
Important: If your pipelines require external resources, you must set up an external resource archive that all engine instances can access before increasing the number of instances.
The steps to install and launch an engine instance depend on the engine and installation type. For a Transformer engine, the steps also depend on whether Transformer works with Apache Spark that runs locally on a single machine or that runs on a cluster:

Launching a Data Collector Docker Image

Complete the following steps on the machine where you want to launch the Data Collector Docker image.

  1. Verify that the machine meets the minimum requirements for a Data Collector engine.
  2. Install Docker and start the Docker daemon.
  3. Open a command prompt on the machine.
  4. To verify that Docker is running, run the following command:
    docker info
  5. Paste and then run the installation script command that you copied from the self-managed deployment.
  6. If you chose to check the engine status, view the status in the Control Hub UI.
  7. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Launching a Data Collector Tarball

Complete the following steps on the machine where you want to install and launch the Data Collector tarball.

  1. Verify that the machine meets the minimum requirements for a Data Collector engine.
  2. Download and install one of the supported Java versions.

    When choosing the Java version, review the Data Collector functionality that is available with each Java version.

  3. Open a command prompt and set your file descriptors limit to at least 32768.
  4. Paste and then run the installation script command that you copied from the self-managed deployment. Respond to the command prompts to enter download and installation directories.
    Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
  5. View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
  6. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Launching Transformer when Spark Runs Locally

To get started with Transformer, you can use a local Spark installation that runs on the same machine as Transformer.

This allows you to easily develop and test local pipelines, which run on the local Spark installation.

Launching a Transformer Docker Image

To use a Transformer Docker image when Spark runs locally, complete the following steps on the machine where you want to launch Transformer. The Docker image includes a local Spark installation that matches the Scala version selected for the engine version.

  1. Verify that the machine meets all the requirements for a Transformer engine.
  2. Install Docker and start the Docker daemon.
  3. Open a command prompt on the machine.
  4. To verify that Docker is running, run the following command:
    docker info
  5. Paste and then run the installation script command that you copied from the self-managed deployment.
  6. If you chose to check the engine status, view the status in the Control Hub UI.
  7. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Launching a Transformer Tarball

To use a Transformer tarball when Spark runs locally, complete the following steps on the machine where you want to install and launch Transformer.

  1. Verify that the machine meets all the requirements for a Transformer engine.
  2. Download and install the Java JDK version for the Scala version selected for the engine version.
  3. Open a command prompt and set your file descriptors limit to at least 32768.
  4. Download Apache Spark from the Apache Spark Download page.
    Download a supported Spark version that is valid for the Transformer features that you want to use.
    Make sure that the Spark version is prebuilt with the same Scala version as Transformer. For more information, see Scala Match Requirement.
  5. Open a command prompt and then extract the Apache Spark tarball by running the following command:
    tar xvzf <spark tarball name>

    For example:

    tar xvzf spark-2.4.7-bin-hadoop2.7.tgz
  6. Run the following command to set the SPARK_HOME environment variable to the directory where you extracted the Apache Spark tarball:
    export SPARK_HOME=<spark path>

    For example:

    export SPARK_HOME=/opt/spark-2.4.7-bin-hadoop2.7/
  7. Paste and then run the installation script command that you copied from the self-managed deployment. Respond to the command prompts to enter download and installation directories.
    Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
  8. View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
  9. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Launching Transformer when Spark Runs on a Cluster

In a production environment, use a Spark installation that runs on a cluster to leverage the performance and scale that Spark offers.

Install Transformer on a machine that is configured to submit Spark jobs to the cluster. When you run Transformer pipelines, Spark distributes the processing across nodes in the cluster.

For information about each cluster type, see Cluster Types in the Transformer engine documentation.

Launching a Transformer Docker Image

To use a Transformer Docker image when Spark runs on a cluster, complete the following steps on the machine where you want to launch Transformer.

  1. Verify that the machine meets all the requirements for a Transformer engine.
  2. Verify that the machine is configured to submit Spark jobs to the cluster.
  3. Grant the Spark cluster access to Transformer.

    The Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.

  4. Install Docker and start the Docker daemon.
  5. Open a command prompt on the machine.
  6. To verify that Docker is running, run the following command:
    docker info
  7. Paste and then run the installation script command that you copied from the self-managed deployment.
  8. If you chose to check the engine status, view the status in the Control Hub UI.
  9. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Launching a Transformer Tarball

To use a Transformer tarball when Spark runs on a cluster, complete the following steps on the machine where you want to launch Transformer.

  1. Verify that the machine meets all the requirements for a Transformer engine.
  2. Verify that the machine is configured to submit Spark jobs to the cluster.
  3. Grant the Spark cluster access to Transformer.

    The Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.

  4. Download and install the Java JDK version for the Scala version selected for the engine version.
  5. Open a command prompt and set your file descriptors limit to at least 32768.
  6. Run the following command to set the JAVA_HOME environment variable to the Java installation on the machine:
    export JAVA_HOME=<java path>

    For example:

    export JAVA_HOME=/opt/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
  7. If using a Hadoop YARN or Spark standalone cluster, set the following additional environment variables:
    Environment Variable Description
    SPARK_HOME Path to the Spark installation on the machine.

    Clusters can include multiple Spark installations. Be sure to point to a supported Spark version that is valid for the Transformer features that you want to use.

    On Cloudera clusters, Spark is generally installed into the parcels directory. For example, for CDH 5.11, you might use: /opt/cloudera/parcels/SPARK2/lib/spark2.

    Tip: To verify the version of a Spark installation, you can run the spark-shell command. Then, use sc.getConf.get("spark.home") to return the installation location.
    HADOOP_CONF_DIR or YARN_CONF_DIR Directory that contains the client side configuration files for the Hadoop cluster.

    For more information about these environment variables, see the Apache Spark documentation.

  8. Paste and then run the installation script command that you copied from the self-managed deployment. Respond to the command prompts to enter download and installation directories.
    Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
  9. View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
  10. To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.

Retrieving the Installation Script

You can retrieve the installation script generated for a self-managed deployment.

Important: Each deployment generates a unique command that configures the engine as defined for that specific deployment. Be sure that you retrieve the installation script for the correct deployment.
  1. In the Control Hub Navigation panel, click Set Up > Deployments.
  2. Locate the self-managed deployment that you want to launch an engine instance for.
  3. In the Actions column, click the More icon () and then click Get Install Script.
  4. Select whether the installation script runs an engine instance as a foreground or background process.
  5. Click the Copy to Clipboard icon () to copy the generated command, and then click Close.

Running the Installation Script without Prompts

When you run the engine installation script for a tarball installation, you must respond to command prompts to enter download and installation directories. To skip the prompts, you can optionally define the directories as command arguments.

You might skip the command prompts if you set up an automation tool such as Ansible to install and launch engines. Or you might skip the prompts if you prefer to define the directories at the same time that you run the command.

To skip the prompts, include the following arguments in the installation script command:

Argument Value
--no-prompt None. Indicates that the script should run without prompts.
--download-dir Enter the full path to an existing download directory.
--install-dir Enter the full path to an existing installation directory.
Include the arguments before the last quotation mark in the copied installation script. For example:
http_proxy= https_proxy= bash -c 'set -eo pipefail; curl -fsS https://na01.hub.streamsets.com/streamsets-engine-install.sh | bash -s -- --deployment-id="<deployment_ID>" --deployment-token="<deployment_token>" --sch-url="https://na01.hub.streamsets.com" --engine-shutdown-timeout="10" --foreground --no-prompt --download-dir=/tmp/streamsets --install-dir=/opt/streamsets-datacollector'

Increasing the Engine Timeout for the Installation Script

By default, the installation script waits a maximum of five minutes for the engine to start. In most situations, the default timeout is sufficient. However, in some situations, it might take longer and the engine can fail to start with the following errors:
Step 2 of 4: Waiting up to 5 minutes for engine to respond on http://<host name>:<port>
Step 2 of 4 failed: Timed out while waiting for engine to respond on http://<host name>:<port>

When you encounter this error, run the installation script again using the STREAMSETS_ENGINE_TIMEOUT_MINS environment variable to increase the engine timeout value.

For example, to set an eight minute timeout for a tarball installation, add the environment variable to the installation script as follows:

STREAMSETS_ENGINE_TIMEOUT_MINS=8 http_proxy= https_proxy= bash -c 'set -eo pipefail; curl -fsS https://na01.hub.streamsets.com/streamsets-engine-install.sh | bash -s -- --deployment-id="<deployment_ID>" --deployment-token="<deployment_token>" --sch-url="https://na01.hub.streamsets.com" --engine-shutdown-timeout="10" --foreground '
To set an eight minute timeout for a Docker installation, add the environment variable to the installation script as follows:
docker run -d -e STREAMSETS_ENGINE_TIMEOUT_MINS=8 -e http_proxy= -e https_proxy= -e STREAMSETS_DEPLOYMENT_SCH_URL=https://na01.hub.streamsets.com -e STREAMSETS_DEPLOYMENT_ID=<deployment_ID> -e STREAMSETS_DEPLOYMENT_TOKEN=<deployment_token> -e ENGINE_SHUTDOWN_TIMEOUT=10 streamsets/datacollector:5.5.0