Deployments Overview
A deployment is a group of identical engine instances deployed within an environment. A deployment defines the IBM StreamSets engine type, version, and configuration to use. You can deploy and launch multiple instances of the configured engine.
When you create a deployment, you select an active environment for that deployment. You must create and activate environments before creating deployments.
A deployment allows you to manage all deployed engine instances with a single configuration change. You can update a deployment to install an additional stage library on the engine or to customize engine configuration properties. After a deployment update, you instruct Control Hub to restart all engine instances in the deployment to replicate the changes to each instance.
You can create the following types of deployments:
- Self-managed
- In a self-managed deployment, you take full control of procuring the resources needed to run engine instances. The resources can be local on-premises machines or cloud computing machines.
- Control Hub-managed
- In a Control Hub-managed deployment, Control Hub connects to the external system represented by the parent environment and automatically provisions the resources needed to run the engine type, ensuring that the resources meet engine requirements. Engine instances are then automatically deployed and launched on those resources.
Engine Types
A deployment defines the type of engine to deploy and launch.
- Data Collector
- Use a Data Collector engine to run data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
- Transformer
- Use a Transformer engine to run data processing pipelines on Apache Spark. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform set-based transformations such as joins, aggregates, and sorts on the entire data set.
- Transformer for Snowflake
- Most organizations use the Transformer for Snowflake engine hosted and managed by IBM. These organizations do not create deployments to use Transformer for Snowflake.
Once you save a deployment, you cannot change the engine type.
For more information about the pipeline types, see Comparing IBM StreamSets Pipelines.
Hosted or Deployed Transformer for Snowflake Engine
Most organizations use the Transformer for Snowflake engine hosted and managed by IBM. Using the hosted engine is the easiest way to work with Transformer for Snowflake.
When needed, you can deploy a Transformer for Snowflake engine to your private network, which can be on-premises or on a protected cloud computing platform such as AWS. You might need to deploy a Transformer for Snowflake engine, rather than using the hosted engine, due to company policies or security requirements.
Deploying a Transformer for Snowflake engine requires that your organization have the appropriate account agreement. For more information about your account agreement, contact your IBM StreamSets account team.
Most functionality available with the hosted engine and the deployed engine is exactly the same. For example, you configure pipelines on the same canvas and then create and run jobs to run published pipelines. When you run a job, Transformer for Snowflake generates a SQL query based on your pipeline configuration and passes the query to Snowflake for execution. Since Snowflake performs the work, all data processing occurs within Snowflake. With both the hosted and deployed engine, your data never leaves Snowflake as a job runs a pipeline.
The difference lies in where the engine runs - either the hosted public cloud service or a private network that you manage.
Here's a summary of the differences between hosted and deployed Transformer for Snowflake engines:
Category | Hosted Engine | Deployed Engine |
---|---|---|
Control Hub environments and deployments | Not applicable. | Configure a Control Hub
environment and deployment to deploy a Transformer for Snowflake engine to your private network. A deployment is the primary unit of tenancy in IBM StreamSets. When multiple groups use the same environment, you can restrict access to deployment resources by creating different deployments for each group in the environment and assigning the groups appropriate permissions. |
Engine management | IBM manages the Transformer for Snowflake engine for you. You cannot view any details or perform any actions on
the hosted engine. The hosted engine is shared across organizations. Your data is only accessible by your organization. The logs for the hosted engine include activity from multiple organizations and are accessible only by IBM. |
You manage the Transformer for Snowflake engine in your own private network. You can stop, start, monitor, and
view the logs of the deployed engine. The deployed engine is only accessible by your organization. The logs for the deployed engine include only activity for your organization and are accessible within your own infrastructure. |
Connection information | Default connection information such as the warehouse, database, or
schema to use, is securely stored in your IBM StreamSets account. You can override this information in individual pipelines as needed. |
You must create one or more Snowflake
connections to specify the connection information to use.
Then, in pipelines, you select the appropriate Snowflake connection.
You can override the following details in individual pipelines: role, warehouse, database, and schema. |
Authentication methods | Supports the following authentication methods, which you configure in
the pipeline properties:
|
Supports the following authentication methods, which you configure in
a Snowflake connection:
|
Snowflake credentials | Enter the credentials directly in the Control Hub user interface (UI). Snowflake credentials are validated and securely stored in your IBM StreamSets account. You cannot view the existing Snowflake password or private key through the Control Hub UI. You can only view those values as you enter them. |
As a best practice, configure the deployed engine to use the AWS credential store to securely retrieve Snowflake credentials from AWS Secrets Manager. |
Pipeline design | Design pipelines using the latest Transformer for Snowflake release. With each new release, all existing pipelines are automatically updated to use the latest new features. |
Design pipelines using the selected authoring Transformer for Snowflake version. The authoring engine version determines the stages and
functionality that display in the pipeline canvas. To use features available in a newer release, you must upgrade the engine. |
Pipeline preview | When you preview data in a pipeline, Snowflake data passes through
encrypted connections beyond your own network into Control Hub. You can optionally disable data preview for your organization if your company policies or best practices prohibit data from leaving your own network. |
When you preview data in a pipeline, Snowflake data passes through
encrypted connections beyond your own network into Control Hub. You can optionally change the default engine communication method from WebSocket tunneling to direct engine REST APIs if your company policies or best practices prohibit data from leaving your own network. |
High availability | IBM StreamSets automatically handles any failover scenarios for you. As such, failover properties are omitted from Transformer for Snowflake jobs. | Deploy multiple engines so that you have available backup engines in
case of pipeline failover due to an unexpected engine shutdown. When you configure a Transformer for Snowflake job, you configure failover properties. |
Communication during a job run | When you run a job, the Control Hub passes the query to your Snowflake account. | When you run a job, the Transformer for Snowflake engine deployed to your private network passes the query to your
Snowflake account. For example, if you use AWS PrivateLink to directly connect your Snowflake account to an AWS VPC, you can deploy the Transformer for Snowflake engine to the same AWS VPC. This ensures that IBM StreamSets communications to your Snowflake account occur inside your own network. |
For information about hosted and deployed details in the Transformer for Snowflake documentation, see the Transformer for Snowflake documentation.
Engine Versions
A deployment defines the engine version to deploy and launch. StreamSets recommends using the latest engine version to ensure that you have the latest updates and features.
Engine | Minimum Supported Version |
---|---|
Data Collector | 4.3.0 |
Transformer | 4.2.0 |
Transformer for Snowflake | 5.0.0 Applicable when your organization uses a deployed Transformer for Snowflake engine. |
Transformer engines provide engine versions based on the Scala version that the engine is built with, in addition to the engine version. For example, for Transformer 4.2.0, you can choose between engine version 4.2.0 (Scala 11) and 4.2.0 (Scala 12). For information about choosing a Scala version, see Choosing an Engine Version.
Once you save a deployment, you cannot change the engine version. To upgrade to a later engine version, see Upgrading Engines for Self-Managed Deployments.
If you design and run pipelines across engine instances managed by different deployments, ensure that all engine versions are the same. Since engine functionality can differ from version to version, using a different engine version can result in a pipeline that is invalid. Use engine labels to ensure that you do not mix engine versions for a single pipeline.
Engine Java Version
- Data Collector Java requirements
- Transformer Java requirements
- Transformer for Snowflake Java requirements - Applicable when your organization uses a deployed Transformer for Snowflake engine.
When you configure a self-managed deployment using an engine tarball file, you are responsible for installing the appropriate Java version as a prerequisite before you run the installation script command that installs and launches the engine tarball.
For all other deployment types, Control Hub deploys and installs the appropriate Java version for you. Based on the deployment type and the selected Data Collector version, you can choose between supported Java versions when configuring the deployment. Use the default Java version unless you have a specific need for another version.
Data Collector Version | Default Java Version |
---|---|
Data Collector 5.11.x or later | Java 17 |
Data Collector 5.10.x or earlier | Java 8 |
You can edit the Java version when you create the deployment or when you edit an existing deployment as long as the deployment is deactivated. In the Configure Engine step of the deployment wizard, click Advanced Configuration, then click Java Configuration. Select a version from the Java Version property.
Data Collector Version | Deployment Types that Allow Selecting a Java Version |
---|---|
Data Collector 5.11.x or later | All deployment types that provided Java version support in
earlier Data Collector versions, as well as the following types:
|
Data Collector 5.10.x or later |
|
Engine Configuration
A deployment defines the configuration of the engine to deploy and launch.
You can define the engine configuration when you create or edit a deployment. If you edit the engine configuration for an existing deployment, you instruct Control Hub to restart all engine instances managed by the deployment to replicate the changes to the instances.