Configuring a Pipeline

Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.

A pipeline can include multiple origin, processor, and destination stages.

You can create a new pipeline in one of the following ways:
- In the top toolbar, click Quick Start > Create a pipeline.
- In the Navigation panel, click Build > Pipelines, and then click the Add icon.
Enter a pipeline name and optional description.
For Engine Type, select Transformer, and then click Next.

By default, Control Hub selects an accessible authoring Transformer that you have read permission on and that has the most recent reported time. To select another Transformer, click Click here to select.
Click Save & Open in Canvas.

A blank pipeline opens in the canvas.

In the Properties panel, on the General tab, configure the following properties:


Pipeline Property	Description
Name	Optionally edit the name of the pipeline.
Description	Optionally edit or add a description of the pipeline.
Labels	Optional labels to assign to the pipeline. Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment. You can use nested labels to create a hierarchy of pipeline groupings. Enter nested labels using the following format: `<label1>/<label2>/<label3>` For example, you might want to group pipelines in the test environment by the origin system. You add the labels Test/HDFS and Test/Elasticsearch to the appropriate pipelines.
Execution Mode	Execution mode of the pipeline: Batch - Processes available data, limited by the configuration of pipeline origins, and then the pipeline stops. Streaming - Maintains connections to origin systems and processes data as it becomes available. The pipeline runs continuously until you manually stop it.
Trigger Interval	Milliseconds to wait between processing batches of data. In pipelines with origins configured to read from multiple tables, milliseconds to wait after processing one batch for each of the tables. For streaming execution mode only.
Enable Ludicrous Mode	Enables predicate and filter pushdown to optimize queries so unnecessary data is not processed.

On the Cluster tab, select one of the following options for the Cluster Manager Type property:
- None (local) - Run the pipeline locally on the Transformer machine.
- Cloudera Data Engineering - Run the pipeline on a CDE cluster.
- Databricks - Run the pipeline on a Databricks cluster.
- Dataproc - Run the pipeline on a Dataproc cluster.
- EMR - Run the pipeline on an EMR cluster.
- EMR Serverless - Run the pipeline an EMR Serverless application.
- Hadoop YARN - Run the pipeline on a Hadoop YARN cluster.
- Spark Standalone - Run the pipeline on a Spark standalone cluster. Spark Standalone clusters are supported for development workloads only.

Configure the remaining properties on the Cluster tab based on the selected cluster manager type.

For a pipeline that runs on a Cloudera Data Engineering cluster, also configure the following properties.


CDE Property	Description
Jobs API URL	Jobs API URL for the CDE virtual cluster where you want the pipeline to run.
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Job Resource	Name of the CDE job resource to store pipeline resources.
Resource File Prefix	Prefix to add to resource files stored in the job resource.
Authentication API URL	Authentication API URL for the virtual cluster where the pipeline runs. Used to obtain a CDE access token.
Workload User	User name to use to obtain the access token.
Workload Password	Password to use to obtain the access token
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.

Then, continue configuring pipeline properties.

For a pipeline that runs on a Databricks cluster, also configure the following properties:


Databricks Property	Description
URL to Connect to Databricks	Databricks URL for your account. Use the following format: https://<your_domain>.cloud.databricks.com
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Staging Directory	Staging directory on Databricks File System (DBFS) where Transformer stores the StreamSets resources and files needed to run the pipeline as a Databricks job. When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories. When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required. Default is `/streamsets`.
Credential Type	Type of credential used to connect to Databricks: Username/Password or Token.
Username	Databricks user name.
Password	Password for the account.
Token	Personal access token for the account.
Provision a New Cluster	Provisions a new Databricks job cluster to run the pipeline upon the initial run of the pipeline. Clear this option to run the pipeline on an existing interactive cluster.
Init Scripts	Cluster-scoped init scripts to execute before processing data. Configure the following properties for each init script that you want to use: Script Type - Location of the script: DBFS from Pipeline - Databricks File System (DBFS) init script defined in the pipeline. When provisioning the cluster, Transformer temporarily stores the script in DBFS and removes it after the pipeline run. DBFS from Location - Databricks File System init script stored on Databricks. S3 from Location - Amazon S3 init script stored on AWS. Use only when provisioning a Databricks cluster on AWS. ABFSS from Location - Azure init script stored on Azure Blob File System (ABFS). Use only when provisioning a Databricks cluster on Azure. Note: To use this option, you must provide an access key to access the init script. DBFS Script - Contents of the Databricks cluster-scoped init script. Available when you select the DBFS from Pipeline script type. DBFS Script Location - Path to the script on DBFS. For example: `dbfs:/databricks/scripts/postgresql-install.sh` Available when you select the DBFS from Location script type. S3 Script Location - Path to the script on Amazon S3. For example: `s3://databricks/scripts/postgresql-install.sh` Available when you select the S3 from Location script type. AWS Region - AWS region where the init script is located. Available when you select the S3 from Location script type. ABFSS Script Location - Location of the script on Azure Blob File System. Available when you select the ABFSS from Location script type.
Cluster Configuration	Configuration properties for a provisioned Databricks job cluster. Configure the listed properties and add additional Databricks cluster properties as needed, in JSON format. Transformer uses the Databricks default values for Databricks properties that are not listed. Include the `instance_pool_id` property to provision a cluster that uses an existing instance pool. Use the property names and values as expected by Databricks.
Terminate Cluster	Terminates the provisioned job cluster when the pipeline stops. Tip: Terminating a provisioned cluster after the pipeline stops is a cost-effective method of running a Transformer pipeline.
Cluster ID	ID of an existing Databricks interactive cluster to run the pipeline. Specify a cluster ID when not provisioning a cluster to run the pipeline. Note: When using an existing interactive cluster, all Transformer pipelines that the cluster runs must be built by the same version of Transformer.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.

Then, continue configuring pipeline properties.

For a pipeline that runs on a Dataproc cluster, configure the following properties on the Cluster tab:


Cluster Property	Description
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.

Also configure the following properties on the Dataproc tab:


Dataproc Property	Description
Project ID	Google Cloud project ID.
Region	Region to create the cluster in. Select a region or select Custom and enter a region name.
Custom	Custom region to create a cluster in.
Credentials Provider	Credentials to use: Default credentials provider - Uses Google Cloud default credentials. Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file. Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON)	Path to the Google Cloud service account credentials file that the pipeline uses to connect. The credentials file must be a JSON file. Enter a path relative to the Transformer resources directory, `$TRANSFORMER_RESOURCES`, or enter an absolute path.
Credentials File Content (JSON)	Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.
GCS Staging URI	Staging location in Dataproc where Transformer stores the StreamSets resources and files needed to run the pipeline as a Dataproc job. When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories. When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required. Default is /streamsets.
Create Cluster	Provisions a new Dataproc cluster to run the pipeline upon the initial run of the pipeline. Clear this option to run the pipeline on an existing cluster.
Cluster Name	Name of the existing cluster to run the pipeline. Use the full Dataproc cluster name.
Cluster Prefix	Optional prefix to add to the provisioned cluster name.
Image Version	Image version to use for the provisioned cluster. Specify the full image version name, such as `1.4-ubuntu18` or `1.3-debian10`. When not specified, Transformer uses the default Dataproc image version. For a list of Dataproc image versions, see the Dataproc documentation.
Master Machine Type	Master machine type to use for the provisioned cluster.
Worker Machine Type	Worker machine type to use for the provisioned cluster.
Network Type	Network type to use for the provisioned cluster: Auto - Uses a VPC network type in auto mode. Custom - Uses a VPC network type with the specified subnet name. Default VPC for project and region - Uses the default VPC for the project ID and region specified for the cluster. For more information about network types, see the Dataproc documentation.
Subnet Name	Subnet name for the custom VPC network.
Network Tags	Optional network tags to apply to the provisioned cluster. For more information, see the Dataproc documentation.
Worker Count	Number of workers to use for a provisioned cluster. Minimum is 2. Using an additional worker for each partition can improve pipeline performance. This property is ignored if you enable dynamic allocation using `spark.dynamicAllocation.enabled` as an extra Spark configuration property.
Terminate Cluster	Terminates the provisioned cluster when the pipeline stops. Tip: Terminating a provisioned cluster after the pipeline stops is a cost-effective method of running a Transformer pipeline.

Then, continue configuring pipeline properties.

For a pipeline that runs on an EMR cluster, also configure the following properties:


EMR Property	Description
Connection	Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline. To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .
Authentication Method	Authentication method used to connect to Amazon Web Services (AWS): AWS Keys - Authenticates using an AWS access key pair. Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance. None - Connects to a public bucket using no authentication.
Access Key ID	AWS access key ID. Required when using AWS keys to authenticate with AWS.
Secret Access Key	AWS secret access key. Required when using AWS keys to authenticate with AWS. Tip: To secure sensitive information, you can use credential stores or runtime resources.
Assume Role	Temporarily assumes another role to authenticate with AWS. Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Role Session Name	Optional name for the session created by assuming a role. Overrides the default unique identifier for the session. Available when assuming another role.
Role ARN	Amazon resource name (ARN) of the role to assume, entered in the following format: `arn:aws:iam::<account_id>:role/<role_name>` Where `<account_id>` is the ID of your AWS account and `<role_name>` is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed. Available when assuming another role.
Session Timeout	Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time. Set to a value between 3,600 seconds and 43,200 seconds. Available when assuming another role.
Set Session Tags	Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role. Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts. When cleared, the connection does not set a session tag.
External ID	External ID included in an IAM trust policy that allows the specified role to be assumed. Available when assuming another role.
Staging Directory	Path of an intermediate location, under the location specified in the S3 Staging URI property, used to store needed files. To determine the staging location, Transformer, concatenates the specified path with the S3 Staging URI property as follows: `<S3 staging URI>/<staging directory>`. The path must exist at the staging URI before you run the pipeline. Default is /streamsets.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.
AWS Region	AWS region that contains the EMR cluster. Select one of the available regions. If the region is not listed, select Other and then enter the name of the AWS region.
S3 Staging URI	Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the following format: `s3://<bucket>/<path>`
Define Cluster Start Option	Defines how the pipeline accesses the cluster to run the pipeline: Existing Cluster - Uses an existing cluster. Specify the cluster by cluster ID or name and tags. Provision New Cluster - Uses a cluster provisioned by Transformer based on the specified properties. AWS Service Catalog - Uses a cluster provisioned by AWS Service Catalog based on an EMR cluster product template and specified properties. Provisioning a cluster with AWS Service Catalog requires completing prerequisite tasks.
Cluster by Name and Tags	Enables specifying a cluster by cluster name and tags instead of by cluster ID. Available when using an existing cluster.
Cluster ID	ID of the existing cluster to run the pipeline. Available when using an existing cluster and not specifying the cluster by name and tags.
Cluster Name	Name of the existing cluster to run the pipeline. This property is case-sensitive. Available when specifying an existing cluster by name and tags.
Cluster Tags	Tag name and values to use to differentiate between multiple clusters with the specified cluster name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying an existing cluster by name and tags.
Define Bootstrap Actions	Enables defining bootstrap actions to execute before processing data. Available for clusters provisioned by Transformer.
Bootstrap Actions Source	Location of bootstrap actions scripts: Executable Files in S3 Defined in Pipeline Available when defining bootstrap actions.
Bootstrap Actions Scripts	Contents of a bootstrap actions script. Click the Add icon to add additional scripts. Available when the bootstrap actions source is Defined in Pipeline.
Bootstrap Actions	Bootstrap actions scripts to execute. Define the following properties for each script that you want to use: Location - Path to the script in S3. Arguments - Comma-separated list of arguments to use with the script. Click the Add icon to add additional scripts. Available when the bootstrap actions source is Executable Files in S3.
EMR Version	EMR cluster version to provision. Available for clusters provisioned by Transformer.
Cluster Name Prefix	Prefix for the name of the provisioned EMR cluster. Available for clusters provisioned by Transformer.
Terminate Cluster	Terminates the provisioned cluster when the pipeline stops. When cleared, the cluster remains active after the pipeline stops. Available for clusters provisioned by Transformer.
Logging Enabled	Enables copying log data to a specified Amazon S3 location. Use to preserve log data that would otherwise become unavailable when the provisioned cluster terminates. Available for clusters provisioned by Transformer.
S3 Log URI	Location in Amazon S3 to store pipeline log data. Location must be unique for each pipeline. Use the following format: `s3://<bucket>/<path>` The bucket must exist before you start the pipeline. Available when you enable logging for a cluster provisioned by Transformer.
Service Role	EMR role used by the Transformer EC2 instance to provision resources and perform other service-level tasks. Default is `EMR_DefaultRole`. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation. Available for clusters provisioned by Transformer.
Job Flow Role	EMR role for the EC2 instances within the cluster used to perform pipeline tasks. Default is `EMR_EC2_DefaultRole`. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation. Available for clusters provisioned by Transformer.
SSH EC2 Key ID	SSH key used to access the EMR cluster nodes. Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes. For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation. Available for clusters provisioned by Transformer.
Visible to All Users	Enables all AWS Identity and Access Management (IAM) users under your account to access the provisioned cluster. Available for clusters provisioned by Transformer.
EC2 Subnet ID	EC2 subnet identifier to launch the provisioned cluster in. Available for clusters provisioned by Transformer.
Master Security Group	ID of the security group on the master node in the cluster. Note: Verify that the master security group allows Transformer to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation. Available for clusters provisioned by Transformer.
Slave Security Group	Security group ID for the slave nodes in the cluster. Available for clusters provisioned by Transformer.
Instance Count	Number of EC2 instances to use. Each instance corresponds to a slave node in the EMR cluster. Minimum is 2. Using an additional instance for each partition can improve pipeline performance. Available for clusters provisioned by Transformer.
Master Instance Type	EC2 instance type for the master node in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available for clusters provisioned by Transformer.
Master Instance Type (Custom)	Custom EC2 instance type for the master node. Available when you select Custom for the Master Instance Type property.
Slave Instance Type	EC2 instance type for the slave nodes in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available for clusters provisioned by Transformer.
Slave Instance Type (Custom)	Custom EC2 instance type for the master node. Available when you select Custom for the Slave Instance Type property.
AWS Cluster Tags	AWS tags assigned to each EMR cluster provisioned. For each tag, specify a tag name or key and a tag value. Available for clusters provisioned by Transformer.
Provisioned Product Name	Name for the provisioned cluster. Available for clusters provisioned by AWS Service Catalog when not generating the product name.
Generate Product Name	Transformer generates the product name for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Project ID	Project ID for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Version Name	Version name for the cluster to be provisioned by AWS Service Catalog. Available for clusters provisioned by AWS Service Catalog.
Parameters	Optional parameter names and values to pass to AWS Service Catalog. They must correspond to parameters allowed by your product template. For more information, see the AWS documentation. Available for clusters provisioned by AWS Service Catalog.
Terminate Provisioned Product	Terminates the provisioned cluster and associated AWS Service Catalog product when the pipeline stops. Available for clusters provisioned by AWS Service Catalog.
Max Retries	Maximum number of times to retry a failed request or throttling error.
Retry Base Delay	Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Throttling Retry Base Delay	Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Max Backoff	The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors.
Enable Server-Side Encryption	Option that Amazon S3 uses to manage encryption keys for server-side encryption: None - Do not use server-side encryption. SSE-S3 - Use Amazon S3-managed keys. SSE-KMS - Use Amazon Web Services KMS-managed keys. Default is None.
AWS KMS Key ARN	Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format: `arn:<partition>:kms:<region>:<account-id>:key/<key-id>` For example: `arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab` Tip: To secure sensitive information, you can use credential stores or runtime resources. Used for SSE-KMS encryption only.
Execution Role	EMR runtime role of the Spark job that Transformer submits to the cluster. Runtime roles determine access to AWS resources. For more information about configuring runtime roles, see the Amazon EMR documentation.

Then, continue configuring pipeline properties.

For a pipeline that runs on an EMR Serverless application, also configure the following properties:


EMR Serverless Property	Description
Connection	Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline. To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .
Authentication Method	Authentication method used to connect to Amazon Web Services (AWS): AWS Keys - Authenticates using an AWS access key pair. Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance. None - Connects to a public bucket using no authentication.
Access Key ID	AWS access key ID. Required when using AWS keys to authenticate with AWS.
Secret Access Key	AWS secret access key. Required when using AWS keys to authenticate with AWS. Tip: To secure sensitive information, you can use credential stores or runtime resources.
Assume Role	Temporarily assumes another role to authenticate with AWS.
Role Session Name	Optional name for the session created by assuming a role. Overrides the default unique identifier for the session. Available when assuming another role.
Role ARN	Amazon resource name (ARN) of the role to assume, entered in the following format: `arn:aws:iam::<account_id>:role/<role_name>` Where `<account_id>` is the ID of your AWS account and `<role_name>` is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed. Available when assuming another role.
Session Timeout	Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time. Set to a value between 3,600 seconds and 43,200 seconds. Available when assuming another role.
Set Session Tags	Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role. Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts. When cleared, the connection does not set a session tag.
External ID	External ID included in an IAM trust policy that allows the specified role to be assumed. Available when assuming another role.
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Staging Directory	Path of an intermediate location, under the location specified in the S3 Staging URI property, used to store needed files. To determine the staging location, Transformer, concatenates the specified path with the S3 Staging URI property as follows: `<S3 staging URI>/<staging directory>`. The path must exist at the staging URI before you run the pipeline. Default is /streamsets.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.
AWS Region	AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
Endpoint	Endpoint to connect to when you select Other for the region. Enter the endpoint name.
S3 Staging URI	Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the following format: `s3://<bucket>/<path>`
Create a New Application	Creates a new EMR Serverless application to run the pipeline.
Application by Name and Tags	Enables specifying a cluster by cluster name and tags instead of by cluster ID. Available when using an existing cluster.
EMR Application Name	Name of the existing application to run the pipeline. This property is case-sensitive. Available when specifying an application by name and tags.
EMR Application Tags	Tag name and values to use to differentiate between multiple applications with the specified application name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying an application by name and tags.
Application ID	ID of an existing EMR Serverless application to run the pipeline. Available when not creating a new application and when not specifying an application by name and tags.
Runtime Role ARN	Identity and Access Management (IAM) role used by the job. The role must have access to the data sources, targets, scripts, and libraries that the job uses. Enter the role in the following format: `arn:aws:iam::<account_id>:role/<role_name>` Where `<account_id>` is the ID of your AWS account and `<role_name>` is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.
EMR Version	Amazon EMR version of created application. Transformer supports version 6.9.0 and later 6.x. Available when creating a new application.
Application Name Prefix	Prefix automatically added to the EMR Serverless application name. A prefix can help you identify the created applications in the EMR Studio console. Available when creating a new application.
Stop Application	Stops the EMR Serverless application when the pipeline stops. Available when creating a new application.
Subnet IDs	IDs of the subnets that contain Transformer and the origin and destination systems configured in the pipeline. Specify subnets in the virtual private cloud (VPC) where the EMR Serverless application resides. Available when creating a new application.
Security Group IDs	ID of one or more security groups that can communicate with Transformer and the origin and destination systems configured in the pipeline. Available when creating a new application.
Maximum CPU (vCPU)	Maximum number of vCPUs that the application can scale to. Available when creating a new application.
Maximum Memory (GB)	Maximum memory, specified in GB, that the application can scale to. Available when creating a new application.
Maximum Disk (GB)	Maximum disk size, specified in GB, that the application can scale to. Available when creating a new application.
Logging Enabled	Copies log files from job runs to a specified Amazon S3 location. Select to provide access to log data after the application stops.
S3 Log URI	Location in Amazon S3 to store pipeline log data. Location must be unique for each pipeline. Use the following format: `s3://<bucket>/<path>` The bucket must exist before you start the pipeline. Available when Logging Enabled is selected.
AWS Tags	AWS tags assigned to all applications and job runs created. For each tag, specify a tag name or key and a tag value.
Max Retries	Maximum number of times to retry a failed request or throttling error.
Retry Base Delay	Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Throttling Retry Base Delay	Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Max Backoff	The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors.

Then, continue configuring pipeline properties.

For a pipeline that runs on a Hadoop YARN cluster, also configure the following properties:


Hadoop YARN Property	Description
Deployment Mode	Deployment mode to use: Client - Launches the Spark driver program locally. Cluster - Launches the Spark driver program remotely on one of the nodes inside the cluster. For more information about deployment modes, see the Apache Spark documentation.
Hadoop User Name	Name of the Hadoop user that Transformer impersonates to launch the Spark application and to access files in the Hadoop system. When using this property, make sure impersonation is enabled for the Hadoop system. When not configured, Transformer impersonates the user who starts the pipeline. When Transformer uses Kerberos authentication or is configured to always impersonate the user who starts the pipeline, this property is ignored. For more information, see Kerberos Authentication and Hadoop Impersonation Mode.
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.
Use YARN Kerberos Keytab	Use a Kerberos principal and keytab to launch the Spark application and to access files in the Hadoop system. Transformer includes the keytab file with the launched Spark application. When not selected, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system. Enable for long-running pipelines when Transformer is enabled for Kerberos authentication.
Keytab Source	Source to use for the pipeline keytab file: Transformer Configuration File - Use the same Kerberos keytab and principal configured for Transformer in the Transformer configuration properties. Pipeline Configuration - File - Use a specific Kerberos keytab file and principal for this pipeline. Store the keytab file on the Transformer machine. Pipeline Configuration - Credential Store - Use a specific Kerberos keytab file and principal for this pipeline. Store the Base64-encoded keytab file in a credential store. Available when using a Kerberos principal and keytab for the pipeline.
YARN Kerberos Keytab Path	Absolute path to the keystore file stored on the Transformer machine. Available when using Pipeline Configuration - File as the keytab source.
Keytab Credential Function	Credential function used to retrieve the Base64-encoded keytab from the credential store. Use the `credential:get()` or `credential:getWithOptions()` credential function. For example, the following expression retrieves a Base64-encoded keytab stored in the `clusterkeytab` secret within the `azure` credential store: `${credential:get("azure", "devopsgroup", "clusterkeytab")}` Note: The user who starts the pipeline must be in the Transformer group specified in the credential function, `devopsgroup` in the example above. When Transformer requires a group secret, the user must also be in a group associated with the keytab. Available when using Pipeline Configuration - Credential Store as the keytab source.
YARN Kerberos Principal	Kerberos principal name that the pipeline runs as. The specified keytab file must contain the credentials for this Kerberos principal. Available when using either pipeline configuration as the keytab source.

Then, continue configuring pipeline properties.

For a pipeline that runs locally, also configure the following properties:


Local Property	Description
Master URL	Local master URL to use to connect to Spark. You can define any valid local master URL as described in the Spark Master URL documentation. Default is `local[*]` which runs the pipeline in the local Spark installation using the same number of worker threads as logical cores on the machine.
Application Name	Name of the launched Spark application. Enter a name containing alphanumeric characters and underscores or enter a StreamSets expression that evaluates to the name. Press Ctrl + Space Bar to view the list of valid functions you can use in an expression. When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name `My Application` and then start the initial pipeline run, Spark launches the application with the following name: `myapplication_run1` Default is the expression `${pipeline:title()}`, which uses the pipeline title as the application name.
Log Level	Log level to use for the launched Spark application.
Extra Spark Configuration	Additional Spark configuration properties to use. To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by Spark. Important: Databricks clusters only implement these configuration properties if you enable the Provision a New Cluster property.

To define runtime parameters, on the Parameters tab, click the Add icon and define the name and the default value for each parameter.
You can use simple or bulk edit mode to configure the parameters.

On the Advanced tab, optionally configure the following properties:


Advanced Property	Description
Cluster Callback URL	Callback URL for the Spark cluster to use to communicate with Transformer. Overrides the Transformer URL configured in Transformer configuration properties. Important: Do not define a cluster callback URL when you plan to enable pipeline failover for the job that includes this pipeline. To support failover, the pipeline must use the default Transformer URL.
Cache Level	Specifies how and where Spark caches data, when needed: None Disk only Memory only Memory only with serialization Memory and disk Memory and disk with serialization Off heap For more information about these options, see the Spark documentation.
Cache Replicas	Determines the number of cache replicas to keep.
Preprocessing Script	Scala script to run before the pipeline starts. Develop the script using the Spark APIs for the version of Spark installed on your cluster.

Use the Stage Library panel to add an origin stage. In the Properties panel, configure the stage properties.

For configuration details about origin stages, see Origins.
Use the Stage Library panel to add the next stage that you want to use, connect the origin to the new stage, and configure the new stage.

For configuration details about processors, see Processors.

For configuration details about destinations, see Destinations.
Add additional stages as necessary.
At any point, use the Preview icon () to preview data to help configure the pipeline.

Preview becomes available in partial pipelines when all existing stages are connected and configured.
When the pipeline is validated and complete, you can use the Check In icon to publish the pipeline, then use the Create Job icon to create a job.