Self-Managed Deployments
You can create a self-managed deployment for an active self-managed environment.
When using a self-managed deployment, you take full control of procuring the resources needed to run engine instances. The resources can be local on-premises machines or cloud computing machines. You must set up the machines and complete the installation prerequisites required by the IBM StreamSets engine type. When the machines reside behind a firewall, you must allow the required inbound and outbound traffic to each machine, as described in Firewall Configuration Overview.
When you create a self-managed deployment, you define the engine type, version, and configuration to deploy.
After you create and start a self-managed deployment, Control Hub displays the engine installation script that you run to install and launch engine instances on the on-premises or cloud computing machines that you have set up. You can configure the installation script to run engine instances as a foreground or background process. You also select the installation type to use - either a Docker image or a tarball file.
Using a self-managed deployment is the simplest way to get started with IBM StreamSets. After getting started, you might consider using one of the cloud service provider integrations, such as the AWS and GCP environments and deployments. With these integrations, Control Hub automatically provisions the resources needed to run the engine type in your cloud service provider account, and then deploys engine instances to those resources.
Quick Start Deployment
When you deploy an engine as you build your first Data Collector pipeline, Control Hub presents a simplified process to help you quickly deploy your first Data Collector engine.
Control Hub
creates a self-managed deployment for you and names the deployment Data
Collector 1 (Quick Start). Control Hub
also assigns a quick-start tag and a
datacollector1quickstart
engine label to the deployment.
You can rename the quick start deployment, remove the default tag or engine label, or edit the deployment just as you edit any other self-managed deployment.
Configuring a Self-Managed Deployment
Configure a self-managed deployment to define the group of engine instances to deploy to a self-managed environment.
To create a new deployment, click Create Deployment icon: .
in the Navigation panel, and then click theTo edit an existing deployment, click Edit.
in the Navigation panel, click the deployment name, and then clickDefine the Deployment
Define the deployment essentials, including the deployment name and type, the environment that the deployment belongs to, and the engine type and version to deploy.
Once saved, you cannot change the deployment type, the engine version, or the environment.
-
Configure the following properties:
Define Deployment Property Description Deployment Name Name of the deployment. Use a brief name that informs your team of the deployment use case.
Deployment Type Select Self-Managed. Environment Active self-managed environment where engine instances will be deployed. In most cases, you can select the default self-managed environment.
Engine Type Type of engine to deploy: - Data Collector
- Transformer
- Transformer for Snowflake - Applicable when your organization uses a deployed Transformer for Snowflake engine.
Engine Version Engine version to deploy. Deployment Tags Optional tags that identify similar deployments within Control Hub. Use deployment tags to easily search and filter deployments. Enter nested tags using the following format:
<tag1>/<tag2>/<tag3>
-
If creating the deployment, click one of the following buttons:
- Cancel - Cancels creating the deployment and exits the wizard.
- Save & Next - Saves the deployment and continues.
- Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.
Configure the Engine
Define the configuration of the engine to deploy. You can use the defaults to get started.
-
Configure the following properties:
Engine Property Description Stage Libraries Stage libraries to install on the engine.
The available stage libraries depend on the selected engine type and version.
Not applicable for a Transformer for Snowflake deployment.
Advanced Configuration Access to advanced configuration properties to further customize the engine. As you get started with StreamSets, the default values should work in most cases.
The available properties depend on the selected engine type.
Important: When using a Transformer engine that works with a Spark cluster, edit thetransformer.base.http.url
property on the Transformer Configuration tab. Uncomment the property and set it to the Transformer URL. For more information, see Granting the Spark Cluster Access to Transformer in the Transformer engine documentation.External Resource Source Source of the external files and libraries, such as JDBC drivers, required by the engine: - None - External resources are not defined in the
deployment.
Select when using a single engine instance to get started with IBM StreamSets, or when your pipelines do not require external resources.
- Archive File - External resources are included in an
archive file defined in the deployment.
Select when the deployment launches multiple engine instances and when your pipelines require external resources.
Not applicable for a Transformer for Snowflake deployment.
External Resource Location Location of the archive file that contains the external resources used by the engine. The archive file must be in TGZ or ZIP format.
Enter the location using one of the following formats:- File path. For example:
/mnt/shared/externalResources.tgzImportant: To use a file path when using a Docker image for the engine installation type, you must modify the installation script command to mount the file to the engine container.
- URL. For example: https://<hostname>:<port>/shared/externalResources.tgz
Tip: Click the download icon to download a sample externalResources.tgz file to view the required directory structure.Available when using an archive file as the source for external resources.
Not applicable for Transformer for Snowflake.
Engine Labels Labels to assign to all engine instances launched for this deployment. Labels determine the group of engine instances that run a job. Default is the name of the deployment.
Max CPU Load (%) Maximum percentage of CPU on the host machine that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.
All engine instances belonging to the deployment inherit these resource threshold values.
Default is 80.
Max Memory (%) Maximum percentage of the configured Java heap size that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.
Default is 100.
Max Running Pipeline Count Maximum number of pipelines that can be running on each engine instance. When an engine equals this threshold, Control Hub does not start new pipeline instances on the engine.
Default is 1,000,000.
- None - External resources are not defined in the
deployment.
-
If creating the deployment, click one of the following buttons:
- Back - Returns to the previous step in the wizard.
- Save & Next - Saves the deployment and continues.
- Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.
Share the Deployment
By default, the deployment can only be seen by you. Share the deployment with other users and groups to grant them access to it.
- In the Select Users and Groups field, type a user email address or a group name.
-
Select users or groups from the list, and then click
Add.
The added users and groups display in the User / Group table.
-
Modify permissions as needed. By default, each added user
or group is granted the following permissions:
- Read - View the details of the deployment and of all engines managed by the deployment. Restart or shut down individual engines managed by the deployment in the Engines view.
- Write - Edit, start, stop, and delete the deployment. Delete engines managed by the deployment. Also requires read access on the parent environment.
- Execute - Start jobs on engines managed by the deployment. Starting jobs also requires execute access on the job and read access on the pipeline.
For more information, see Deployment Permissions.
-
Click one of the following buttons:
- Back - Returns to the previous step in the wizard.
- Save & Next - Saves the deployment and continues.
- Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.
Review and Launch the Engine
You've successfully finished creating the deployment.
-
Click Start & Generate Install Script to start the
deployment and generate the command to launch the engine.
Note: If you click Exit, Control Hub saves the deployment and exits the wizard, displaying the Deactivated deployment in the Deployments view. You can start the deployment and retrieve the installation script at a later time.
- Select whether the installation script runs the engine instance as a foreground or background process.
-
Choose the installation type to use:
- Run Docker Image - Download and install Docker Desktop, and then deploy the engine in a Docker container.
- Download and Install from Script - Run an installation script that downloads and extracts a tarball file on your machine, and then deploys the engine so that it runs locally on the machine. Allows you to take full control of setting up the machine.
The selected type determines the installation prerequisites you complete.
- Click the Copy to Clipboard icon () to copy the generated command.
-
Use the copied command to launch an engine for the deployment.
Optionally, click Check Engine Status after Running the Script to display the Engine Status window, where you can view the engine status after you run the generated command.
Foreground or Background Process
- Foreground
- When the installation script runs an engine instance as a foreground process, you cannot run additional commands from that command prompt while the engine runs. The command prompt must remain open for the engine to continue to run. If you close the command prompt, the engine shuts down.
- Background
- When the installation script runs an engine instance as a background process, you regain access to the command prompt after the engine starts. You can run additional commands from that command prompt as the engine runs. If you close the command prompt, the engine continues to run.
By default, a tarball installation script runs an engine instance as a foreground process. A Docker installation script runs an engine instance as a background process.
Launching an Engine for a Deployment
After creating a self-managed deployment, you set up a machine that meets the engine requirements. The machine can be a local on-premises machine or a cloud computing machine. Then, you manually run the engine installation script to install and launch an engine instance on the machine.
When the machine resides behind a firewall, you also must allow the required inbound and outbound traffic to each machine, as described in Firewall Configuration Overview.
Launching a Data Collector Docker Image
Complete the following steps on the machine where you want to launch the Data Collector Docker image.
- Verify that the machine meets the minimum requirements for a Data Collector engine.
- Install Docker and start the Docker daemon.
- Open a command prompt on the machine.
-
To verify that Docker is running, run the following command:
docker info
-
Paste and then run the installation script command that you copied from the
self-managed deployment.
Note: If needed, you can retrieve the generated installation script.
- If you chose to check the engine status, view the status in the Control Hub UI.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching a Data Collector Tarball
Complete the following steps on the machine where you want to install and launch the Data Collector tarball.
- Verify that the machine meets the minimum requirements for a Data Collector engine.
-
Download and install one of the supported Java
versions.
When choosing the Java version, review the Data Collector functionality that is available with each Java version.
- Open a command prompt and set your file descriptors limit to at least 32768.
-
Paste and then run the installation script command that you copied from the
self-managed deployment. Respond to the command prompts to enter download and
installation directories.
Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
- View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching Transformer when Spark Runs Locally
To get started with Transformer, you can use a local Spark installation that runs on the same machine as Transformer.
This allows you to easily develop and test local pipelines, which run on the local Spark installation.
Launching a Transformer Docker Image
To use a Transformer Docker image when Spark runs locally, complete the following steps on the machine where you want to launch Transformer. The Docker image includes a local Spark installation that matches the Scala version selected for the engine version.
- Verify that the machine meets all the requirements for a Transformer engine.
- Install Docker and start the Docker daemon.
- Open a command prompt on the machine.
-
To verify that Docker is running, run the following command:
docker info
-
Paste and then run the installation script command that you copied from the
self-managed deployment.
Note: If needed, you can retrieve the generated installation script.
- If you chose to check the engine status, view the status in the Control Hub UI.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching a Transformer Tarball
To use a Transformer tarball when Spark runs locally, complete the following steps on the machine where you want to install and launch Transformer.
- Verify that the machine meets all the requirements for a Transformer engine.
- Download and install the Java JDK version for the Scala version selected for the engine version.
- Open a command prompt and set your file descriptors limit to at least 32768.
-
Download Apache Spark from the Apache Spark Download page.
Download a supported Spark version that is valid for the Transformer features that you want to use.Make sure that the Spark version is prebuilt with the same Scala version as Transformer. For more information, see Scala Match Requirement.
-
Open a command prompt and then extract the Apache Spark tarball by running the
following command:
tar xvzf <spark tarball name>
For example:
tar xvzf spark-3.5.3-bin-hadoop3.tgz
-
Run the following command to set the
SPARK_HOME
environment variable to the directory where you extracted the Apache Spark tarball:export SPARK_HOME=<spark path>
For example:
export SPARK_HOME=/opt/spark-3.5.3-bin-hadoop3/
-
Paste and then run the installation script command that you copied from the
self-managed deployment. Respond to the command prompts to enter download and
installation directories.
Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
- View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching Transformer when Spark Runs on a Cluster
In a production environment, use a Spark installation that runs on a cluster to leverage the performance and scale that Spark offers.
Install Transformer on a machine that is configured to submit Spark jobs to the cluster. When you run Transformer pipelines, Spark distributes the processing across nodes in the cluster.
For information about each cluster type, see Cluster Types in the Transformer engine documentation.
Launching a Transformer Docker Image
To use a Transformer Docker image when Spark runs on a cluster, complete the following steps on the machine where you want to launch Transformer.
- Verify that the machine meets all the requirements for a Transformer engine.
- Verify that the machine is configured to submit Spark jobs to the cluster.
-
Grant the Spark cluster access to Transformer.
The Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.
- Install Docker and start the Docker daemon.
- Open a command prompt on the machine.
-
To verify that Docker is running, run the following command:
docker info
-
Paste and then run the installation script command that you copied from the
self-managed deployment.
Note: If needed, you can retrieve the generated installation script.
- If you chose to check the engine status, view the status in the Control Hub UI.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching a Transformer Tarball
To use a Transformer tarball when Spark runs on a cluster, complete the following steps on the machine where you want to launch Transformer.
- Verify that the machine meets all the requirements for a Transformer engine.
- Verify that the machine is configured to submit Spark jobs to the cluster.
-
Grant the Spark cluster access to Transformer.
The Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.
- Download and install the Java JDK version for the Scala version selected for the engine version.
- Open a command prompt and set your file descriptors limit to at least 32768.
-
Run the following command to set the
JAVA_HOME
environment variable to the Java installation on the machine:export JAVA_HOME=<java path>
For example:
export JAVA_HOME=/opt/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
-
If using a Hadoop YARN or Spark standalone cluster, set the following
additional environment variables:
Environment Variable Description SPARK_HOME Path to the Spark installation on the machine. Clusters can include multiple Spark installations. Be sure to point to a supported Spark version that is valid for the Transformer features that you want to use.
On Cloudera clusters, Spark is generally installed into the parcels directory. For example, for CDP 7.1.9, you might use: /opt/cloudera/parcels/SPARK3/lib/spark3.
Tip: To verify the version of a Spark installation, you can run thespark-shell
command. Then, usesc.getConf.get("spark.home")
to return the installation location.HADOOP_CONF_DIR or YARN_CONF_DIR Directory that contains the client side configuration files for the Hadoop cluster. For more information about these environment variables, see the Apache Spark documentation.
-
Paste and then run the installation script command that you copied from the
self-managed deployment. Respond to the command prompts to enter download and
installation directories.
Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
- View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching a Transformer for Snowflake Docker Image
Complete the following steps on the machine where you want to launch the Transformer for Snowflake Docker image.
Applicable when your organization uses a deployed Transformer for Snowflake engine.
- Verify that the machine meets the minimum requirements for a Transformer for Snowflake engine.
- Install Docker and start the Docker daemon.
- Open a command prompt on the machine.
-
To verify that Docker is running, run the following command:
docker info
-
Paste and then run the installation script command that you copied from the
self-managed deployment.
Note: If needed, you can retrieve the generated installation script.
- If you chose to check the engine status, view the status in the Control Hub UI.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Launching a Transformer for Snowflake Tarball
Complete the following steps on the machine where you want to install and launch the Transformer for Snowflake tarball.
Applicable when your organization uses a deployed Transformer for Snowflake engine.
- Verify that the machine meets the minimum requirements for a Transformer for Snowflake engine.
- Download and install the supported Java version.
- Open a command prompt and set your file descriptors limit to at least 32768.
-
Paste and then run the installation script command that you copied from the
self-managed deployment. Respond to the command prompts to enter download and
installation directories.
Note: If needed, you can retrieve the generated installation script. You can optionally skip the command prompts by defining the directories as command arguments.
- View the engine status in the command prompt or in the Control Hub UI if you chose to check the engine status.
- To deploy an additional engine instance for this deployment, simply repeat these steps on another machine.
Retrieving the Installation Script
You can retrieve the installation script generated for a self-managed deployment.
- In the Control Hub Navigation panel, click .
- Locate the self-managed deployment that you want to launch an engine instance for.
- In the Actions column, click the More icon () and then click Get Install Script.
- Select whether the installation script runs the engine instance as a foreground or background process.
-
Choose the installation type to use:
- Run Docker Image - Download and install Docker Desktop, and then deploy the engine in a Docker container.
- Download and Install from Script - Run an installation script that downloads and extracts a tarball file on your machine, and then deploys the engine so that it runs locally on the machine. Allows you to take full control of setting up the machine.
The selected type determines the installation prerequisites you complete.
- Click the Copy to Clipboard icon () to copy the generated command, and then click Close.
Running the Installation Script without Prompts
When you run the engine installation script for a tarball installation, you must respond to command prompts to enter download and installation directories. To skip the prompts, you can optionally define the directories as command arguments.
You might skip the command prompts if you set up an automation tool such as Ansible to install and launch engines. Or you might skip the prompts if you prefer to define the directories at the same time that you run the command.
To skip the prompts, include the following arguments in the installation script command:
Argument | Value |
---|---|
--no-prompt |
None. Indicates that the script should run without prompts. |
--download-dir |
Enter the full path to an existing download directory. |
--install-dir |
Enter the full path to an existing installation directory. |
bash -c "$(curl -fsSL https://na01.hub.streamsets.com/streamsets-engine-install.sh)" --deployment-id=<deployment_ID> --deployment-token=<deployment_token> --sch-url=https://na01.hub.streamsets.com --no-prompt --download-dir=/tmp/streamsets --install-dir=/opt/streamsets-datacollector
Increasing the Engine Timeout for the Installation Script
Step 2 of 4: Waiting up to 5 minutes for engine to respond on http://<host name>:<port>
Step 2 of 4 failed: Timed out while waiting for engine to respond on http://<host name>:<port>
When you encounter this error, run the installation script again using the
STREAMSETS_ENGINE_TIMEOUT_MINS
environment variable to increase the
engine timeout value.
For example, to set an eight minute timeout for a tarball installation, add the environment variable to the installation script as follows:
STREAMSETS_ENGINE_TIMEOUT_MINS=8 bash -c "$(curl -fsSL https://na01.hub.streamsets.com/streamsets-engine-install.sh)" --deployment-id=<deployment_ID> --deployment-token=<deployment_token> --sch-url=https://na01.hub.streamsets.com
docker run -d -e STREAMSETS_ENGINE_TIMEOUT_MINS=8 -e STREAMSETS_DEPLOYMENT_SCH_URL=https://na01.hub.streamsets.com -e STREAMSETS_DEPLOYMENT_ID=<deployment_ID> -e STREAMSETS_DEPLOYMENT_TOKEN=<deployment_token> streamsets/datacollector:5.5.0