Configuring a Pipeline
Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.
A pipeline can include multiple origin, processor, and destination stages.
-
From the Home page or Getting Started
page, click Create New Pipeline.
Tip: To get to the Home page, click the Home icon.
-
In the New Pipeline window, configure the following
properties:
Pipeline Property Description Title Title of the pipeline. Transformer uses the alphanumeric characters entered for the pipeline title as a prefix for the generated pipeline ID. For example, if you enter My Pipeline *&%&^^ 123 as the pipeline title, then the pipeline ID has the following value:
MyPipeline123tad9f592-5f02-4695-bb10-127b2e41561c
.Description Optional description of the pipeline. Pipeline Label Optional labels to assign to the pipeline. Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.
You can use nested labels to create a hierarchy of pipeline groupings. Enter nested labels using the following format:
For example, you might want to group pipelines in the test environment by the origin system. You add the labels Test/Databricks and Test/Snowflake to the appropriate pipelines.<label1>/<label2>/<label3>
-
Click Save.
The pipeline canvas displays the pipeline title, the generated pipeline ID, and an error icon. The error icon indicates that the pipeline is empty. The Properties panel displays the pipeline properties.
-
In the Properties panel, on the General tab, configure the
following properties:
Pipeline Property Description Title Optionally edit the title of the pipeline. Because the generated pipeline ID is used to identify the pipeline, any changes to the pipeline title are not reflected in the pipeline ID.
Description Optionally edit or add a description of the pipeline. Labels Optionally edit or add labels assigned to the pipeline. Execution Mode Execution mode of the pipeline: - Batch - Processes available data, limited by the configuration of pipeline origins, and then the pipeline stops.
- Streaming - Maintains connections to origin systems and processes data as it becomes available. The pipeline runs continuously until you manually stop it.
Trigger Interval Milliseconds to wait between processing batches of data. In pipelines with origins configured to read from multiple tables, milliseconds to wait after processing one batch for each of the tables.
For streaming execution mode only.
Enable Ludicrous Mode Enables predicate and filter pushdown to optimize queries so unnecessary data is not processed. -
On the Cluster tab, select one of the following options
for the Cluster Manager Type property:
- None (local) - Run the pipeline locally on the Transformer machine.
- Cloudera Data Engineering - Run the pipeline on a CDE cluster.
- Databricks - Run the pipeline on a Databricks cluster.
- Dataproc - Run the pipeline on a Dataproc cluster.
- EMR - Run the pipeline on an EMR cluster.
- EMR Serverless - Run the pipeline an EMR Serverless application.
- Hadoop YARN - Run the pipeline on a Hadoop YARN cluster.
- Spark Standalone - Run the pipeline on a Spark standalone cluster. Spark Standalone clusters are supported for development workloads only.
-
Configure the remaining properties on the Cluster tab
based on the selected cluster manager type.
For a pipeline that runs on a Cloudera Data Engineering cluster, also configure the following properties.Then, continue configuring pipeline properties.
CDE Property Description Jobs API URL Jobs API URL for the CDE virtual cluster where you want the pipeline to run. Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Job Resource Name of the CDE job resource to store pipeline resources. Resource File Prefix Prefix to add to resource files stored in the job resource. Authentication API URL Authentication API URL for the virtual cluster where the pipeline runs. Used to obtain a CDE access token. Workload User User name to use to obtain the access token. Workload Password Password to use to obtain the access token Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.For a pipeline that runs on a Databricks cluster, also configure the following properties:Then, continue configuring pipeline properties.Databricks Property Description URL to Connect to Databricks Databricks URL for your account. Use the following format: https://<your_domain>.cloud.databricks.com
Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Staging Directory Staging directory on Databricks File System (DBFS) where Transformer stores the StreamSets resources and files needed to run the pipeline as a Databricks job. When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.
When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required.
Default is
/streamsets
.Credential Type Type of credential used to connect to Databricks: Username/Password or Token. Username Databricks user name. Password Password for the account. Token Personal access token for the account. Provision a New Cluster Provisions a new Databricks job cluster to run the pipeline upon the initial run of the pipeline. Clear this option to run the pipeline on an existing interactive cluster.
Init Scripts Cluster-scoped init scripts to execute before processing data. Configure the following properties for each init script that you want to use: - Script Type - Location of the script:
- DBFS from Pipeline - Databricks File System (DBFS) init script defined in the pipeline. When provisioning the cluster, Transformer temporarily stores the script in DBFS and removes it after the pipeline run.
- DBFS from Location - Databricks File System init script stored on Databricks.
- S3 from Location - Amazon S3 init script stored on AWS. Use only when provisioning a Databricks cluster on AWS.
- ABFSS from Location - Azure init script stored on Azure Blob File System (ABFS).
Use only when provisioning a Databricks cluster on Azure.Note: To use this option, you must provide an access key to access the init script.
- DBFS Script - Contents of the Databricks
cluster-scoped init script.
Available when you select the DBFS from Pipeline script type.
- DBFS Script Location - Path to the script on DBFS.
For example:
dbfs:/databricks/scripts/postgresql-install.sh
Available when you select the DBFS from Location script type.
- S3 Script Location - Path to the script on Amazon
S3. For example:
s3://databricks/scripts/postgresql-install.sh
Available when you select the S3 from Location script type.
- AWS Region - AWS region where the init script is
located.
Available when you select the S3 from Location script type.
- ABFSS Script Location - Location of the script on
Azure Blob File System.
Available when you select the ABFSS from Location script type.
Cluster Configuration Configuration properties for a provisioned Databricks job cluster. Configure the listed properties and add additional Databricks cluster properties as needed, in JSON format. Transformer uses the Databricks default values for Databricks properties that are not listed.
Include the
instance_pool_id
property to provision a cluster that uses an existing instance pool.Use the property names and values as expected by Databricks.
Terminate Cluster Terminates the provisioned job cluster when the pipeline stops. Tip: Terminating a provisioned cluster after the pipeline stops is a cost-effective method of running a Transformer pipeline.Cluster ID ID of an existing Databricks interactive cluster to run the pipeline. Specify a cluster ID when not provisioning a cluster to run the pipeline. Note: When using an existing interactive cluster, all Transformer pipelines that the cluster runs must be built by the same version of Transformer.Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.For a pipeline that runs on a Dataproc cluster, configure the following properties on the Cluster tab:Cluster Property Description Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.Also configure the following properties on the Dataproc tab:Dataproc Property Description Project ID Google Cloud project ID. Region Region to create the cluster in. Select a region or select Custom and enter a region name. Custom Custom region to create a cluster in. Credentials Provider Credentials to use: - Default credentials provider - Uses Google Cloud default credentials.
- Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
- Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON) Path to the Google Cloud service account credentials file that the pipeline uses to connect. The credentials file must be a JSON file. Enter a path relative to the Transformer resources directory,
$TRANSFORMER_RESOURCES
, or enter an absolute path.Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store. GCS Staging URI Staging location in Dataproc where Transformer stores the StreamSets resources and files needed to run the pipeline as a Dataproc job. When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.
When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required.
Default is /streamsets.
Create Cluster Provisions a new Dataproc cluster to run the pipeline upon the initial run of the pipeline. Clear this option to run the pipeline on an existing cluster.
Cluster Name Name of the existing cluster to run the pipeline. Use the full Dataproc cluster name. Cluster Prefix Optional prefix to add to the provisioned cluster name. Image Version Image version to use for the provisioned cluster. Specify the full image version name, such as
1.4-ubuntu18
or1.3-debian10
.When not specified, Transformer uses the default Dataproc image version.
For a list of Dataproc image versions, see the Dataproc documentation.
Master Machine Type Master machine type to use for the provisioned cluster. Worker Machine Type Worker machine type to use for the provisioned cluster. Network Type Network type to use for the provisioned cluster: - Auto - Uses a VPC network type in auto mode.
- Custom - Uses a VPC network type with the specified subnet name.
- Default VPC for project and region - Uses the default VPC for the project ID and region specified for the cluster.
For more information about network types, see the Dataproc documentation.
Subnet Name Subnet name for the custom VPC network. Network Tags Optional network tags to apply to the provisioned cluster. For more information, see the Dataproc documentation.
Worker Count Number of workers to use for a provisioned cluster. Minimum is 2. Using an additional worker for each partition can improve pipeline performance.
This property is ignored if you enable dynamic allocation using
spark.dynamicAllocation.enabled
as an extra Spark configuration property.Terminate Cluster Terminates the provisioned cluster when the pipeline stops. Tip: Terminating a provisioned cluster after the pipeline stops is a cost-effective method of running a Transformer pipeline.Then, continue configuring pipeline properties.For a pipeline that runs on an EMR cluster, also configure the following properties:Then, continue configuring pipeline properties.EMR Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Authentication Method Authentication method used to connect to Amazon Web Services (AWS): - AWS Keys - Authenticates using an AWS access key pair.
- Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
- None - Connects to a public bucket using no authentication.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS. Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS. Assume Role Temporarily assumes another role to authenticate with AWS. Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Role Session Name Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.
Available when assuming another role.
Role ARN Amazon resource name (ARN) of the role to assume, entered in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.Available when assuming another role.
Session Timeout Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.
Set to a value between 3,600 seconds and 43,200 seconds.
Available when assuming another role.
Set Session Tags Sets a session tag to record the name of the currently logged in IBM StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.
Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.
When cleared, the connection does not set a session tag.
External ID External ID included in an IAM trust policy that allows the specified role to be assumed. Available when assuming another role.
Staging Directory Path of an intermediate location, under the location specified in the S3 Staging URI property, used to store needed files. To determine the staging location, Transformer, concatenates the specified path with the S3 Staging URI property as follows:
<S3 staging URI>/<staging directory>
.The path must exist at the staging URI before you run the pipeline.
Default is /streamsets.
Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.AWS Region AWS region that contains the EMR cluster. Select one of the available regions. If the region is not listed, select Other and then enter the name of the AWS region.
S3 Staging URI Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline.
Use the following format:s3://<bucket>/<path>
Define Cluster Start Option Defines how the pipeline accesses the cluster to run the pipeline: - Existing Cluster - Uses an existing cluster. Specify the cluster by cluster ID or name and tags.
- Provision New Cluster - Uses a cluster provisioned by Transformer based on the specified properties.
- AWS Service Catalog - Uses a cluster provisioned by AWS Service Catalog based on an EMR cluster product template and specified properties. Provisioning a cluster with AWS Service Catalog requires completing prerequisite tasks.
Cluster by Name and Tags Enables specifying a cluster by cluster name and tags instead of by cluster ID. Available when using an existing cluster.Cluster ID ID of the existing cluster to run the pipeline. Available when using an existing cluster and not specifying the cluster by name and tags.Cluster Name Name of the existing cluster to run the pipeline. This property is case-sensitive. Available when specifying an existing cluster by name and tags.
Cluster Tags Tag name and values to use to differentiate between multiple clusters with the specified cluster name. Click Add to define a tag. Click Add Another to define additional tags.
Available when specifying an existing cluster by name and tags.
Define Bootstrap Actions Enables defining bootstrap actions to execute before processing data. Available for clusters provisioned by Transformer.
Bootstrap Actions Source Location of bootstrap actions scripts: - Executable Files in S3
- Defined in Pipeline
Available when defining bootstrap actions.
Bootstrap Actions Scripts Contents of a bootstrap actions script. Click the Add icon to add additional scripts. Available when the bootstrap actions source is Defined in Pipeline.
Bootstrap Actions Bootstrap actions scripts to execute. Define the following properties for each script that you want to use: - Location - Path to the script in S3.
- Arguments - Comma-separated list of arguments to use with the script.
Click the Add icon to add additional scripts.
Available when the bootstrap actions source is Executable Files in S3.
EMR Version EMR cluster version to provision. Available for clusters provisioned by Transformer.
Cluster Name Prefix Prefix for the name of the provisioned EMR cluster. Available for clusters provisioned by Transformer.
Terminate Cluster Terminates the provisioned cluster when the pipeline stops. When cleared, the cluster remains active after the pipeline stops.
Available for clusters provisioned by Transformer.
Logging Enabled Enables copying log data to a specified Amazon S3 location. Use to preserve log data that would otherwise become unavailable when the provisioned cluster terminates. Available for clusters provisioned by Transformer.
S3 Log URI Location in Amazon S3 to store pipeline log data. Location must be unique for each pipeline. Use the following format:s3://<bucket>/<path>
The bucket must exist before you start the pipeline.
Available when you enable logging for a cluster provisioned by Transformer.
Service Role EMR role used by the Transformer EC2 instance to provision resources and perform other service-level tasks. Default is
EMR_DefaultRole
. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.Available for clusters provisioned by Transformer.
Job Flow Role EMR role for the EC2 instances within the cluster used to perform pipeline tasks. Default is
EMR_EC2_DefaultRole
. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.Available for clusters provisioned by Transformer.
SSH EC2 Key ID SSH key used to access the EMR cluster nodes. Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes.
For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation.
Available for clusters provisioned by Transformer.
Visible to All Users Enables all AWS Identity and Access Management (IAM) users under your account to access the provisioned cluster. Available for clusters provisioned by Transformer.
EC2 Subnet ID EC2 subnet identifier to launch the provisioned cluster in. Available for clusters provisioned by Transformer.
Master Security Group ID of the security group on the master node in the cluster. Note: Verify that the master security group allows Transformer to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.Available for clusters provisioned by Transformer.
Slave Security Group Security group ID for the slave nodes in the cluster. Available for clusters provisioned by Transformer.
Instance Count Number of EC2 instances to use. Each instance corresponds to a slave node in the EMR cluster. Minimum is 2. Using an additional instance for each partition can improve pipeline performance.
Available for clusters provisioned by Transformer.
Master Instance Type EC2 instance type for the master node in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type.
Available for clusters provisioned by Transformer.
Master Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Master Instance Type property.
Slave Instance Type EC2 instance type for the slave nodes in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type.
Available for clusters provisioned by Transformer.
Slave Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Slave Instance Type property.
AWS Cluster Tags AWS tags assigned to each EMR cluster provisioned. For each tag, specify a tag name or key and a tag value. Available for clusters provisioned by Transformer.
Provisioned Product Name Name for the provisioned cluster. Available for clusters provisioned by AWS Service Catalog when not generating the product name.
Generate Product Name Transformer generates the product name for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Project ID Project ID for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Version Name Version name for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Parameters Optional parameter names and values to pass to AWS Service Catalog. They must correspond to parameters allowed by your product template. For more information, see the AWS documentation. Available for clusters provisioned by AWS Service Catalog.
Terminate Provisioned Product Terminates the provisioned cluster and associated AWS Service Catalog product when the pipeline stops. Available for clusters provisioned by AWS Service Catalog.
Max Retries Maximum number of times to retry a failed request or throttling error. Retry Base Delay Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. Throttling Retry Base Delay Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. Max Backoff The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. Enable Server-Side Encryption Option that Amazon S3 uses to manage encryption keys for server-side encryption: - None - Do not use server-side encryption.
- SSE-S3 - Use Amazon S3-managed keys.
- SSE-KMS - Use Amazon Web Services KMS-managed keys.
Default is None.
AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format: arn:<partition>:kms:<region>:<account-id>:key/<key-id>
For example:
arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab
Used for SSE-KMS encryption only.
Execution Role EMR runtime role of the Spark job that Transformer submits to the cluster. Runtime roles determine access to AWS resources. For more information about configuring runtime roles, see the Amazon EMR documentation.
For a pipeline that runs on an EMR Serverless application, also configure the following properties:Then, continue configuring pipeline properties.EMR Serverless Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Authentication Method Authentication method used to connect to Amazon Web Services (AWS): - AWS Keys - Authenticates using an AWS access key pair.
- Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
- None - Connects to a public bucket using no authentication.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS. Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS. Assume Role Temporarily assumes another role to authenticate with AWS. Role Session Name Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.
Available when assuming another role.
Role ARN Amazon resource name (ARN) of the role to assume, entered in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.Available when assuming another role.
Session Timeout Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.
Set to a value between 3,600 seconds and 43,200 seconds.
Available when assuming another role.
Set Session Tags Sets a session tag to record the name of the currently logged in IBM StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.
Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.
When cleared, the connection does not set a session tag.
External ID External ID included in an IAM trust policy that allows the specified role to be assumed. Available when assuming another role.
Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Staging Directory Path of an intermediate location, under the location specified in the S3 Staging URI property, used to store needed files. To determine the staging location, Transformer, concatenates the specified path with the S3 Staging URI property as follows:
<S3 staging URI>/<staging directory>
.The path must exist at the staging URI before you run the pipeline.
Default is /streamsets.
Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.AWS Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other. Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name. S3 Staging URI Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline.
Use the following format:s3://<bucket>/<path>
Create a New Application Creates a new EMR Serverless application to run the pipeline. Application by Name and Tags Enables specifying a cluster by cluster name and tags instead of by cluster ID. Available when using an existing cluster.EMR Application Name Name of the existing application to run the pipeline. This property is case-sensitive. Available when specifying an application by name and tags.
EMR Application Tags Tag name and values to use to differentiate between multiple applications with the specified application name. Click Add to define a tag. Click Add Another to define additional tags.
Available when specifying an application by name and tags.
Application ID ID of an existing EMR Serverless application to run the pipeline. Available when not creating a new application and when not specifying an application by name and tags.
Runtime Role ARN Identity and Access Management (IAM) role used by the job. The role must have access to the data sources, targets, scripts, and libraries that the job uses. Enter the role in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.EMR Version Amazon EMR version of created application. Transformer supports version 6.9.0 and later 6.x. Available when creating a new application.
Application Name Prefix Prefix automatically added to the EMR Serverless application name. A prefix can help you identify the created applications in the EMR Studio console.
Available when creating a new application.
Stop Application Stops the EMR Serverless application when the pipeline stops. Available when creating a new application.
Subnet IDs IDs of the subnets that contain Transformer and the origin and destination systems configured in the pipeline. Specify subnets in the virtual private cloud (VPC) where the EMR Serverless application resides. Available when creating a new application.
Security Group IDs ID of one or more security groups that can communicate with Transformer and the origin and destination systems configured in the pipeline. Available when creating a new application.
Maximum CPU (vCPU) Maximum number of vCPUs that the application can scale to. Available when creating a new application.
Maximum Memory (GB) Maximum memory, specified in GB, that the application can scale to. Available when creating a new application.
Maximum Disk (GB) Maximum disk size, specified in GB, that the application can scale to. Available when creating a new application.
Logging Enabled Copies log files from job runs to a specified Amazon S3 location. Select to provide access to log data after the application stops. S3 Log URI Location in Amazon S3 to store pipeline log data. Location must be unique for each pipeline. Use the following format:s3://<bucket>/<path>
The bucket must exist before you start the pipeline.
Available when Logging Enabled is selected.
AWS Tags AWS tags assigned to all applications and job runs created. For each tag, specify a tag name or key and a tag value. Max Retries Maximum number of times to retry a failed request or throttling error. Retry Base Delay Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. Throttling Retry Base Delay Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. Max Backoff The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. For a pipeline that runs on a Hadoop YARN cluster, also configure the following properties:Then, continue configuring pipeline properties.Hadoop YARN Property Description Deployment Mode Deployment mode to use: - Client - Launches the Spark driver program locally.
- Cluster - Launches the Spark driver program remotely on one of the nodes inside the cluster.
For more information about deployment modes, see the Apache Spark documentation.
Hadoop User Name Name of the Hadoop user that Transformer impersonates to launch the Spark application and to access files in the Hadoop system. When using this property, make sure impersonation is enabled for the Hadoop system. When not configured, Transformer impersonates the user who starts the pipeline.
When Transformer uses Kerberos authentication or is configured to always impersonate the user who starts the pipeline, this property is ignored. For more information, see Kerberos Authentication and Hadoop Impersonation Mode.
Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.Use YARN Kerberos Keytab Use a Kerberos principal and keytab to launch the Spark application and to access files in the Hadoop system. Transformer includes the keytab file with the launched Spark application. When not selected, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system.
Enable for long-running pipelines when Transformer is enabled for Kerberos authentication.
Keytab Source Source to use for the pipeline keytab file: - Transformer Configuration File - Use the same Kerberos keytab and principal configured for Transformer in the Transformer configuration file.
- Pipeline Configuration - File - Use a specific Kerberos keytab file and principal for this pipeline. Store the keytab file on the Transformer machine.
- Pipeline Configuration - Credential Store - Use a specific Kerberos keytab file and principal for this pipeline. Store the Base64-encoded keytab file in a credential store.
Available when using a Kerberos principal and keytab for the pipeline.
YARN Kerberos Keytab Path Absolute path to the keystore file stored on the Transformer machine. Available when using Pipeline Configuration - File as the keytab source.
Keytab Credential Function Credential function used to retrieve the Base64-encoded keytab from the credential store. Use the credential:get()
orcredential:getWithOptions()
credential function.For example, the following expression retrieves a Base64-encoded keytab stored in theclusterkeytab
secret within theazure
credential store:${credential:get("azure", "devopsgroup", "clusterkeytab")}
Note: The user who starts the pipeline must be in the Transformer group specified in the credential function,devopsgroup
in the example above. When Transformer requires a group secret, the user must also be in a group associated with the keytab.Available when using Pipeline Configuration - Credential Store as the keytab source.
YARN Kerberos Principal Kerberos principal name that the pipeline runs as. The specified keytab file must contain the credentials for this Kerberos principal. Available when using either pipeline configuration as the keytab source.
For a pipeline that runs locally, also configure the following properties:Local Property Description Master URL Local master URL to use to connect to Spark. You can define any valid local master URL as described in the Spark Master URL documentation. Default is
local[*]
which runs the pipeline in the local Spark installation using the same number of worker threads as logical cores on the machine.Application Name Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.
When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:myapplication_run1
Default is the expression
${pipeline:title()}
, which uses the pipeline title as the application name.Log Level Log level to use for the launched Spark application. Extra Spark Configuration Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by Spark.
Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property. - Script Type - Location of the script:
-
To define runtime
parameters, on the Parameters tab, click the
Add icon and define the name and the default value
for each parameter.
You can use simple or bulk edit mode to configure the parameters.
-
On the Advanced tab, optionally configure the following
properties:
Advanced Property Description Cluster Callback URL Callback URL for the Spark cluster to use to communicate with Transformer. Overrides the Transformer URL configured in Transformer configuration file.
Important: Do not define a cluster callback URL when you plan to enable pipeline failover for the job that includes this pipeline. To support failover, the pipeline must use the default Transformer URL.Cache Level Specifies how and where Spark caches data, when needed: - None
- Disk only
- Memory only
- Memory only with serialization
- Memory and disk
- Memory and disk with serialization
- Off heap
For more information about these options, see the Spark documentation.
Cache Replicas Determines the number of cache replicas to keep. Preprocessing Script Scala script to run before the pipeline starts. Develop the script using the Spark APIs for the version of Spark installed on your cluster.
-
Use the Stage Library panel to add an origin stage. In the Properties panel,
configure the stage properties.
For configuration details about origin stages, see Origins.
-
Use the Stage Library panel to add the next stage that you want to use, connect
the origin to the new stage, and configure the new stage.
For configuration details about processors, see Processors.
For configuration details about destinations, see Destinations.
- Add additional stages as necessary.
-
At any point, use the Preview icon () to
preview data to help configure the pipeline.
Preview becomes available in partial pipelines when all existing stages are connected and configured.
-
When the pipeline is validated and complete, use the
Start icon to run the pipeline.
When Transformer starts the pipeline, monitor mode displays real-time statistics for the pipeline.