Deployments Overview

A deployment is a group of identical engine instances deployed within an environment. A deployment defines the StreamSets engine type, version, and configuration to use. You can deploy and launch multiple instances of the configured engine.

Note: Deployments are required for Data Collector and Transformer pipelines and for Transformer for Snowflake pipelines that run on a deployed engine. They are not applicable for Transformer for Snowflake pipelines that run on the StreamSets hosted engine.

When you create a deployment, you select an active environment for that deployment. You must create and activate environments before creating deployments.

A deployment allows you to manage all deployed engine instances with a single configuration change. You can update a deployment to install an additional stage library on the engine or to customize engine configuration properties. After a deployment update, you instruct Control Hub to restart all engine instances in the deployment to replicate the changes to each instance.

When you deploy StreamSets engines to on-premise or cloud computing machines that reside behind a firewall, you must allow the required inbound and outbound connections to each machine.
Important: A deployment is the primary unit of tenancy in the StreamSets platform. Resources configured for a deployment, such as credential stores or AWS instance profiles, are accessible by all authorized users of that deployment. When multiple groups use the same environment, you can restrict access to deployment resources by creating different deployments for each group in the environment and assigning the groups appropriate permissions on the deployments.

You can create the following types of deployments:

Self-managed
In a self-managed deployment, you take full control of procuring the resources needed to run engine instances. The resources can be local on-premises machines or cloud computing machines.
You must set up the machines and complete the installation prerequisites required by the engine type. You manually run an installation script to install and launch an engine instance on each machine that you have set up.
Control Hub-managed
In a Control Hub-managed deployment, Control Hub connects to the external system represented by the parent environment and automatically provisions the resources needed to run the engine type, ensuring that the resources meet engine requirements. Engine instances are then automatically deployed and launched on those resources.
After an administrator completes the required prerequisites, you can create the following types of Control Hub-managed deployments:

Engine Types

A deployment defines the type of engine to deploy and launch.

When you create a deployment, you select one of the following engine types:
Data Collector
Use a Data Collector engine to run data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
Transformer
Use a Transformer engine to run data processing pipelines on Apache Spark. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform set-based transformations such as joins, aggregates, and sorts on the entire data set.
Transformer for Snowflake
Most organizations use the Transformer for Snowflake engine hosted and managed by StreamSets. These organizations do not create deployments to use Transformer for Snowflake.
Based on the account agreement for your organization, you can deploy Transformer for Snowflake engines as you do other engine types. Use a Transformer for Snowflake engine to generate SQL queries based on your pipeline configuration and pass the queries to Snowflake for execution. Snowflake pipelines read from and write to Snowflake tables using Snowpark DataFrame-based processing.
At this time, you can create Amazon EC2 and self-managed deployments for a Transformer for Snowflake engine.

Once you save a deployment, you cannot change the engine type.

For more information about the pipeline types, see Comparing StreamSets Pipelines.

Hosted or Deployed Transformer for Snowflake Engine

Most organizations use the Transformer for Snowflake engine hosted and managed by StreamSets. Using the hosted engine is the easiest way to work with Transformer for Snowflake.

When needed, you can deploy a Transformer for Snowflake engine to your private network, which can be on-premises or on a protected cloud computing platform such as AWS. You might need to deploy a Transformer for Snowflake engine, rather than using the StreamSets hosted engine, due to company policies or security requirements.

Deploying a Transformer for Snowflake engine requires that your organization have the appropriate account agreement. For more information about your account agreement, contact your StreamSets account team.

Important: An organization can use either the hosted engine or the deployed engine - not both at the same time. Also, pipelines created in an organization that uses one engine type cannot be imported into an organization that uses the other engine type.

Most functionality available with the hosted engine and the deployed engine is exactly the same. For example, you configure pipelines on the same canvas and then create and run jobs to run published pipelines. When you run a job, Transformer for Snowflake generates a SQL query based on your pipeline configuration and passes the query to Snowflake for execution. Since Snowflake performs the work, all data processing occurs within Snowflake. With both the hosted and deployed engine, your data never leaves Snowflake as a job runs a pipeline.

The difference lies in where the engine runs - either the public cloud platform hosted by StreamSets or a private network that you manage.

Here's a summary of the differences between hosted and deployed Transformer for Snowflake engines:

Category Hosted Engine Deployed Engine
Control Hub environments and deployments Not applicable. Configure a Control Hub environment and deployment to deploy a Transformer for Snowflake engine to your private network.

A deployment is the primary unit of tenancy in the StreamSets platform. When multiple groups use the same environment, you can restrict access to deployment resources by creating different deployments for each group in the environment and assigning the groups appropriate permissions.

Engine management StreamSets manages the Transformer for Snowflake engine for you. You cannot view any details or perform any actions on the hosted engine.

The hosted engine is shared across organizations. Your data is only accessible by your organization.

The logs for the hosted engine include activity from multiple organizations and are accessible only by StreamSets.

You manage the Transformer for Snowflake engine in your own private network. You can stop, start, monitor, and view the logs of the deployed engine.

The deployed engine is only accessible by your organization.

The logs for the deployed engine include only activity for your organization and are accessible within your own infrastructure.

Connection information Default connection information such as the warehouse, database, or schema to use, is securely stored in your StreamSets account.

You can override this information in individual pipelines as needed.

You must create one or more Snowflake connections to specify the connection information to use. Then, in pipelines, you select the appropriate Snowflake connection.

You can override the following details in individual pipelines: role, warehouse, database, and schema.

Authentication methods Supports the following authentication methods, which you configure in the pipeline properties:
  • User credentials
  • Unencrypted private keys
Supports the following authentication methods, which you configure in a Snowflake connection:
  • User credentials
  • Unencrypted private keys
  • Unencrypted private key files
  • None
Snowflake credentials Enter the credentials directly in the Control Hub user interface (UI).

Snowflake credentials are validated and securely stored in your StreamSets account. You cannot view the existing Snowflake password or private key through the Control Hub UI. You can only view those values as you enter them.

As a best practice, configure the deployed engine to use the AWS credential store to securely retrieve Snowflake credentials from AWS Secrets Manager.
Pipeline design Design pipelines using the latest Transformer for Snowflake release.

With each new release, all existing pipelines are automatically updated to use the latest new features.

Design pipelines using the selected authoring Transformer for Snowflake version. The authoring engine version determines the stages and functionality that display in the pipeline canvas.

To use features available in a newer release, you must upgrade the engine.

Pipeline preview When you preview data in a pipeline, Snowflake data passes through encrypted connections beyond your own network into StreamSets Control Hub.

You can optionally disable data preview for your organization if your company policies or best practices prohibit data from leaving your own network.

When you preview data in a pipeline, Snowflake data passes through encrypted connections beyond your own network into StreamSets Control Hub.

You can optionally change the default engine communication method from WebSocket tunneling to direct engine REST APIs if your company policies or best practices prohibit data from leaving your own network.

High availability StreamSets automatically handles any failover scenarios for you. As such, failover properties are omitted from Transformer for Snowflake jobs. Deploy multiple engines so that you have available backup engines in case of pipeline failover due to an unexpected engine shutdown.

When you configure a Transformer for Snowflake job, you configure failover properties.

Communication during a job run When you run a job, the StreamSets Control Hub public cloud platform passes the query to your Snowflake account. When you run a job, the Transformer for Snowflake engine deployed to your private network passes the query to your Snowflake account.

For example, if you use AWS PrivateLink to directly connect your Snowflake account to an AWS VPC, you can deploy the Transformer for Snowflake engine to the same AWS VPC. This ensures that StreamSets communications to your Snowflake account occur inside your own network.

For information about hosted and deployed details in the Transformer for Snowflake documentation, see the Transformer for Snowflake documentation.

Engine Versions

A deployment defines the engine version to deploy and launch. StreamSets recommends using the latest engine version to ensure that you have the latest updates and features.

New deployments support the following minimum engine versions:
Engine Minimum Supported Version
Data Collector 4.3.0
Transformer 4.2.0
Transformer for Snowflake 5.0.0

Applicable when your organization uses a deployed Transformer for Snowflake engine.

Note: Existing deployments can continue to use Data Collector 4.0.x to 4.2.x or Transformer 4.0.x to 4.1.x.

Transformer engines provide engine versions based on the Scala version that the engine is built with, in addition to the engine version. For example, for Transformer 4.2.0, you can choose between engine version 4.2.0 (Scala 11) and 4.2.0 (Scala 12). For information about choosing a Scala version, see Choosing an Engine Version.

Note: When allowed on the parent environment, a deployment can use nightly engine builds in addition to released engine versions. The version number of a nightly build includes a -SNAPSHOT suffix and the build number. For example, 5.2.0-SNAPSHOT (Build 1013). Nightly builds are for testing features under development and should not be used in production systems.

Once you save a deployment, you cannot change the engine version. To upgrade to a later engine version, see Upgrading Engines for Self-Managed Deployments.

If you design and run pipelines across engine instances managed by different deployments, ensure that all engine versions are the same. Since engine functionality can differ from version to version, using a different engine version can result in a pipeline that is invalid. Use engine labels to ensure that you do not mix engine versions for a single pipeline.

Engine Java Version

All deployed StreamSets engines require that the appropriate Java version be installed on the engine machine. Each engine type has different Java version requirements. In addition, some stage libraries and use cases require specific Java versions. For details, see the following engine documentation:

When you configure a self-managed deployment using an engine tarball file, you are responsible for installing the appropriate Java version as a prerequisite before you run the installation script command that installs and launches the engine tarball.

For all other deployment types, Control Hub deploys and installs the appropriate Java version for you. Based on the deployment type and the selected Data Collector version, you can choose between supported Java versions when configuring the deployment. StreamSets recommends using the default Java version unless you have a specific need for another version.

The following table lists the default Java version used for each Data Collector version:
Data Collector Version Default Java Version
Data Collector 5.11.x or later Java 17
Data Collector 5.10.x or earlier Java 8

You can edit the Java version when you create the deployment or when you edit an existing deployment as long as the deployment is deactivated. In the Configure Engine step of the deployment wizard, click Advanced Configuration, then click Java Configuration. Select a version from the Java Version property.

The following table lists the Data Collector versions that support selecting a Java version for each deployment type:
Data Collector Version Deployment Types that Allow Selecting a Java Version
Data Collector 5.11.x or later All deployment types that provided Java version support in earlier Data Collector versions, as well as the following types:
  • Amazon EC2 deployment
  • GCE deployment
Data Collector 5.10.x or later
  • Self-managed deployment using an engine Docker image
  • Azure VM deployment
  • Kubernetes deployment
Note: Deployments for Data Collector 5.9.x and earlier use the default Java 8 only.

Engine Configuration

A deployment defines the configuration of the engine to deploy and launch.

Note: As you get started, you can typically use the default engine configuration. You might need to modify the engine configuration as you further explore StreamSets.

You can define the engine configuration when you create or edit a deployment. If you edit the engine configuration for an existing deployment, you instruct Control Hub to restart all engine instances managed by the deployment to replicate the changes to the instances.

You can define the following engine configurations for a deployment: