Configuring a Pipeline

Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.

A pipeline can include multiple origin, processor, and destination stages.

  1. From the Home page or Getting Started page, click Create New Pipeline.
    Tip: To get to the Home page, click the Home icon.
  2. In the New Pipeline window, configure the following properties:
    Pipeline Property Description
    Title Title of the pipeline.

    Transformer uses the alphanumeric characters entered for the pipeline title as a prefix for the generated pipeline ID. For example, if you enter My Pipeline *&%&^^ 123 as the pipeline title, then the pipeline ID has the following value: MyPipeline123tad9f592-5f02-4695-bb10-127b2e41561c.

    Description Optional description of the pipeline.
    Pipeline Label Optional labels to assign to the pipeline.

    Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.

    You can use nested labels to create a hierarchy of pipeline groupings. Enter nested labels using the following format:
    <label1>/<label2>/<label3>
    For example, you might want to group pipelines in the test environment by the origin system. You add the labels Test/HDFS and Test/Elasticsearch to the appropriate pipelines.
  3. Click Save.
    The pipeline canvas displays the pipeline title, the generated pipeline ID, and an error icon. The error icon indicates that the pipeline is empty. The Properties panel displays the pipeline properties.
  4. In the Properties panel, on the General tab, configure the following properties:
    Pipeline Property Description
    Title Optionally edit the title of the pipeline.

    Because the generated pipeline ID is used to identify the pipeline, any changes to the pipeline title are not reflected in the pipeline ID.

    Description Optionally edit or add a description of the pipeline.
    Labels Optional edit or add labels assigned to the pipeline.
    Execution Mode Execution mode of the pipeline:
    • Batch - Processes available data, limited by the configuration of pipeline origins, and then the pipeline stops.
    • Streaming - Maintains connections to origin systems and processes data as it becomes available. The pipeline runs continuously until you manually stop it.
    Trigger Interval Milliseconds to wait between processing batches of data.

    In pipelines with origins configured to read from multiple tables, milliseconds to wait after processing one batch for each of the tables.

    For streaming execution mode only.

    Enable Ludicrous Mode Enables predicate and filter pushdown to optimize queries so unnecessary data is not processed.
    Collect Input Metrics Collects and displays pipeline input statistics for a pipeline running in ludicrous mode.
    By default, only pipeline output statistics display when a pipeline runs in ludicrous mode.
    Note: For data formats that do not include metrics in the metadata, such as Avro, CSV, and JSON, Transformer must re-read origin data to generate the input statistics. This can slow pipeline performance.
  5. On the Cluster tab, select one of the following options for the Cluster Manager Type property:
  6. Configure the remaining properties on the Cluster tab based on the selected cluster manager type.
    For all cluster manager types, configure the following properties:
    Cluster Property Description
    Application Name Name of the launched Spark application. Enter a name or enter a StreamSets expression that evaluates to the name.

    Press Ctrl + Space Bar to view the list of valid functions you can use in an expression.

    When the application is launched, Spark lowercases the name, removes spaces in the name, and appends the pipeline run number to the name. For example, if you enter the name My Application and then start the initial pipeline run, Spark launches the application with the following name:
    myapplication_run1

    Default is the expression ${pipeline:title()}, which uses the pipeline title as the application name.

    Log Level Log level to use for the launched Spark application.
    Extra Spark Configuration Additional Spark configuration properties to use.

    To add properties, click Add and define the property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by Spark.

    For a pipeline that runs on a Cloudera Data Engineering cluster, also configure the following properties:
    CDE Property Description
    Jobs API URL Jobs API URL for the CDE virtual cluster where you want the pipeline to run.
    Job Resource Name of the CDE job resource to store pipeline resources.
    Resource File Prefix Prefix to add to resource files stored in the job resource.
    Authentication API URL Authentication API URL for the virtual cluster where the pipeline runs. Used to obtain a CDE access token.
    Workload User User name to use to obtain the access token.
    Workload Password Password to use to obtain the access token.
    For a pipeline that runs on a Databricks cluster, also configure the following properties:
    Databricks Property Description
    URL to Connect to Databricks Databricks URL for your account. Use the following format:

    https://<your_domain>.cloud.databricks.com

    Staging Directory Staging directory on Databricks File System (DBFS) where Transformer stores the StreamSets resources and files needed to run the pipeline as a Databricks job.

    When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.

    When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required.

    Default is /streamsets.

    Credential Type Type of credential used to connect to Databricks: Username/Password or Token.
    Username Databricks user name.
    Password Password for the account.
    Token Personal access token for the account.
    Provision a New Cluster Provisions a new Databricks job cluster to run the pipeline upon the initial run of the pipeline.

    Clear this option to run the pipeline on an existing interactive cluster.

    Init Scripts Cluster-scoped init scripts to execute before processing data. Configure the following properties for each init script that you want to use:
    • Script Type - Location of the script:
      • DBFS from Pipeline - Databricks File System (DBFS) init script defined in the pipeline. When provisioning the cluster, Transformer temporarily stores the script in DBFS and removes it after the pipeline run.
      • DBFS from Location - Databricks File System init script stored on Databricks.
      • S3 from Location - Amazon S3 init script stored on AWS. Use only when provisioning a Databricks cluster on AWS.
      • ABFSS from Location - Azure init script stored on Azure Blob File System (ABFS). Use only when provisioning a Databricks cluster on Azure.
        Note: To use this option, you must provide an access key to access the init script.
    • DBFS Script - Contents of the Databricks cluster-scoped init script.

      Available when you select the DBFS from Pipeline script type.

    • DBFS Script Location - Path to the script on DBFS. For example: dbfs:/databricks/scripts/postgresql-install.sh

      Available when you select the DBFS from Location script type.

    • S3 Script Location - Path to the script on Amazon S3. For example: s3://databricks/scripts/postgresql-install.sh

      Available when you select the S3 from Location script type.

    • AWS Region - AWS region where the init script is located.

      Available when you select the S3 from Location script type.

    • ABFSS Script Location - Location of the script on Azure Blob File System.

      Available when you select the ABFSS from Location script type.

    Cluster Configuration Configuration properties for a provisioned Databricks job cluster.

    Configure the listed properties and add additional Databricks cluster properties as needed, in JSON format. Transformer uses the Databricks default values for Databricks properties that are not listed.

    Include the instance_pool_id property to provision a cluster that uses an existing instance pool.

    Use the property names and values as expected by Databricks.

    Terminate Cluster Terminates the provisioned job cluster when the pipeline stops.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    Cluster ID ID of an existing Databricks interactive cluster to run the pipeline. Specify a cluster ID when not provisioning a cluster to run the pipeline.
    Note: When using an existing interactive cluster, all Transformer pipelines that the cluster runs must be built by the same version of Transformer.
    For a pipeline that runs on a Dataproc cluster, also configure the following properties:
    Dataproc Description
    Project ID Google Cloud project ID.
    Region Region to create the cluster in. Select a region or select Custom and enter a region name.
    Custom Custom region to create a cluster in.
    Credentials Provider Credentials to use:
    Credentials File Path (JSON) Path to the Google Cloud service account credentials file that the pipeline uses to connect. The credentials file must be a JSON file.

    Enter a path relative to the Transformer resources directory, $TRANSFORMER_RESOURCES, or enter an absolute path.

    Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.
    GCS Staging URI Staging location in Dataproc where Transformer stores the StreamSets resources and files needed to run the pipeline as a Dataproc job.

    When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.

    When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required.

    Default is /streamsets.

    Create Cluster Provisions a new Dataproc cluster to run the pipeline upon the initial run of the pipeline.

    Clear this option to run the pipeline on an existing cluster.

    Cluster Name Name of the existing cluster to run the pipeline. Use the full Dataproc cluster name.
    Cluster Prefix Optional prefix to add to the provisioned cluster name.
    Image Version Image version to use for the provisioned cluster.

    Specify the full image version name, such as 1.4-ubuntu18 or 1.3-debian10.

    When not specified, Transformer uses the default Dataproc image version.

    For a list of Dataproc image versions, see the Dataproc documentation.

    Master Machine Type Master machine type to use for the provisioned cluster.
    Worker Machine Type Worker machine type to use for the provisioned cluster.
    Network Type Network type to use for the provisioned cluster:
    • Auto - Uses a VPC network type in auto mode.
    • Custom - Uses a VPC network type with the specified subnet name.
    • Default VPC for project and region - Uses the default VPC for the project ID and region specified for the cluster.

    For more information about network types, see the Dataproc documentation.

    Subnet Name Subnet name for the custom VPC network.
    Network Tags Optional network tags to apply to the provisioned cluster.

    For more information, see the Dataproc documentation.

    Worker Count Number of workers to use for a provisioned cluster.

    Minimum is 2. Using an additional worker for each partition can improve pipeline performance.

    This property is ignored if you enable dynamic allocation using spark.dynamicAllocation.enabled as an extra Spark configuration property.

    Terminate Cluster Terminates the provisioned cluster when the pipeline stops.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    For a pipeline that runs on an EMR cluster, also configure the following properties:
    EMR Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .

    Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
    • AWS Keys - Authenticates using an AWS access key pair.
    • Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
    • None - Connects to a public bucket using no authentication.
    Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
    Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Assume Role Temporarily assumes another role to authenticate with AWS.
    Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Available when assuming another role.

    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Available when assuming another role.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Available when assuming another role.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in Data Collector or Transformer user that starts the pipeline or the Control Hub user that starts the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    Available when assuming another role.

    Staging Directory Staging directory on the Amazon S3 staging URI.

    Used to store the Transformer resources and files needed to run the pipeline as an EMR job. The specified staging directory is used with the S3 Staging URI property as follows: <S3 staging URI>/<staging directory>.

    When a pipeline runs on an existing cluster, you might configure pipelines to use the same S3 staging URI and staging directory. This allows EMR to reuse the common files stored in that location. Pipelines that run on different clusters can also use the same staging locations as long as the pipelines are started by the same Transformer instance. Pipelines started by different Transformer instances must use different staging locations.

    Both the staging directory and S3 staging URI must exist before you run the pipeline.

    Default is /streamsets.

    AWS Region AWS region that contains the EMR cluster. Select one of the available regions.

    If the region is not listed, select Other and then enter the name of the AWS region.

    S3 Staging URI URI for an Amazon S3 directory to store the Transformer resources and files needed to run the pipeline.

    The specified URI is used with the staging directory defined on the Cluster tab of the pipeline as follows: <S3 staging URI>/<staging directory>

    Both the S3 staging URI and staging directory must exist before you run the pipeline.

    Provision a New Cluster
    Provisions a new cluster to run the pipeline.
    Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.
    For more information about running a pipeline on a provisioned cluster, see Provisioned Cluster.
    Define Bootstrap Actions Enables defining bootstrap actions to execute before processing data.

    Available only for provisioned clusters.

    Bootstrap Actions Source Location of bootstrap actions scripts:
    • Executable Files in S3
    • Defined in Pipeline

    Available when defining bootstrap actions.

    Bootstrap Actions Scripts Contents of a bootstrap actions script. Click the Add icon to add additional scripts.

    Available when the bootstrap actions source is Defined in Pipeline.

    Bootstrap Actions Bootstrap actions scripts to execute. Define the following properties for each script that you want to use:
    • Location - Path to the script in S3.
    • Arguments - Comma-separated list of arguments to use with the script.

    Click the Add icon to add additional scripts.

    Available when the bootstrap actions source is Executable Files in S3.

    Cluster ID ID of the existing cluster to run the pipeline. Available when not provisioning a cluster.

    For more information about running a pipeline on an existing cluster, see Existing Cluster.

    EMR Version EMR cluster version to provision. Transformer supports version 5.20.0 or later 5.x versions.

    Available only for provisioned clusters.

    Cluster Name Prefix Prefix for the name of the provisioned EMR cluster.

    Available only for provisioned clusters.

    Terminate Cluster Terminates the provisioned cluster when the pipeline stops.

    When cleared, the cluster remains active after the pipeline stops.

    Available only for provisioned clusters.

    Logging Enabled Enables copying log data to a specified Amazon S3 location. Use to preserve log data that would otherwise become unavailable when the provisioned cluster terminates.

    Available only for provisioned clusters.

    S3 Log URI Location in Amazon S3 to store pipeline log data.
    Location must be unique for each pipeline. Use the following format:
    s3://<bucket>/<path>

    The bucket must exist before you start the pipeline.

    Available when you enable logging for a provisioned cluster.

    Service Role EMR role used by the Transformer EC2 instance to provision resources and perform other service-level tasks.

    Default is EMR_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Job Flow Role EMR role for the EC2 instances within the cluster used to perform pipeline tasks.

    Default is EMR_EC2_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    SSH EC2 Key ID SSH key used to access the EMR cluster nodes.

    Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes.

    For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Visible to All Users Enables all AWS Identity and Access Management (IAM) users under your account to access the provisioned cluster.

    Available only for provisioned clusters.

    EC2 Subnet ID EC2 subnet identifier to launch the provisioned cluster in.

    Available only for provisioned clusters.

    Master Security Group ID of the security group on the master node in the cluster.
    Note: Verify that the master security group allows Transformer to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.

    Available only for provisioned clusters.

    Slave Security Group Security group ID for the slave nodes in the cluster.

    Available only for provisioned clusters.

    Instance Count Number of EC2 instances to use. Each instance corresponds to a slave node in the EMR cluster.

    Minimum is 2. Using an additional instance for each partition can improve pipeline performance.

    Available only for provisioned clusters.

    Master Instance Type EC2 instance type for the master node in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

    Available only for provisioned clusters.

    Master Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Master Instance Type property.
    Slave Instance Type EC2 instance type for the slave nodes in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

    Available only for provisioned clusters.

    Slave Instance Type (Custom) Custom EC2 instance type for the master node. Available when you select Custom for the Slave Instance Type property.
    Max Retries Maximum number of times to retry a failed request or throttling error.
    Retry Base Delay Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
    Throttling Retry Base Delay Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
    Max Backoff The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors.
    Enable Server-Side Encryption Option that Amazon S3 uses to manage encryption keys for server-side encryption:
    • None - Do not use server-side encryption.
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.

    Default is None.

    AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format:
    arn:<partition>:kms:<region>:<account-id>:key/<key-id>

    For example: arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab

    Tip: To secure sensitive information, you can use credential stores or runtime resources.

    Used for SSE-KMS encryption only.

    Execution Role EMR runtime role of the Spark job that Transformer submits to the cluster. Runtime roles determine access to AWS resources.

    For more information about configuring runtime roles, see the Amazon EMR documentation.

    For a pipeline that runs on a Hadoop YARN cluster, also configure the following properties:
    Hadoop YARN Property Description
    Deployment Mode Deployment mode to use:
    • Client - Launches the Spark driver program locally.
    • Cluster - Launches the Spark driver program remotely on one of the nodes inside the cluster.

    For more information about deployment modes, see the Apache Spark documentation.

    Hadoop User Name Name of the Hadoop user that Transformer impersonates to launch the Spark application and to access files in the Hadoop system. When using this property, make sure impersonation is enabled for the Hadoop system.

    When not configured, Transformer impersonates the user who starts the pipeline.

    When Transformer uses Kerberos authentication or is configured to always impersonate the user who starts the pipeline, this property is ignored. For more information, see Kerberos Authentication and Hadoop Impersonation Mode.

    Use YARN Kerberos Keytab Use a Kerberos principal and keytab to launch the Spark application and to access files in the Hadoop system. Transformer includes the keytab file with the launched Spark application.

    When not selected, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system.

    Enable for long-running pipelines when Transformer is enabled for Kerberos authentication.

    Keytab Source Source to use for the pipeline keytab file:
    • Transformer Configuration File - Use the same Kerberos keytab and principal configured for Transformer in the Transformer configuration properties.
    • Pipeline Configuration - File - Use a specific Kerberos keytab file and principal for this pipeline. Store the keytab file on the Transformer machine.
    • Pipeline Configuration - Credential Store - Use a specific Kerberos keytab file and principal for this pipeline. Store the Base64-encoded keytab file in a credential store.

    Available when using a Kerberos principal and keytab for the pipeline.

    YARN Kerberos Keytab Path Absolute path to the keystore file stored on the Transformer machine.

    Available when using Pipeline Configuration - File as the keytab source.

    Keytab Credential Function Credential function used to retrieve the Base64-encoded keytab from the credential store. Use the credential:get() or credential:getWithOptions() credential function.
    For example, the following expression retrieves a Base64-encoded keytab stored in the clusterkeytab secret within the azure credential store:
    ${credential:get("azure", "devopsgroup", "clusterkeytab")}
    Note: The user who starts the pipeline must be in the Transformer group specified in the credential function, devopsgroup in the example above. When Transformer requires a group secret, the user must also be in a group associated with the keytab.

    Available when using Pipeline Configuration - Credential Store as the keytab source.

    YARN Kerberos Principal Kerberos principal name that the pipeline runs as. The specified keytab file must contain the credentials for this Kerberos principal.

    Available when using either pipeline configuration as the keytab source.

    For a pipeline that runs locally, also configure the following property:
    Local Property Description
    Master URL Local master URL to use to connect to Spark. You can define any valid local master URL as described in the Spark Master URL documentation.

    Default is local[*] which runs the pipeline in the local Spark installation using the same number of worker threads as logical cores on the machine.

    For a pipeline that runs on SQL Server 2019 BDC, also configure the following properties:
    SQL Server 2019 Big Data Cluster Property Description
    Livy Endpoint SQL Server 2019 BDC Livy endpoint that enables submitting Spark jobs.

    For information about retrieving the Livy endpoint, see Retrieving Connection Information.

    User Name

    Controller user name to submit Spark jobs through the Livy endpoint.

    For more information, see Retrieving Connection Information.

    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Password

    Knox password for the controller user name that allows submitting Spark jobs through the Livy endpoint.

    For more information, see Retrieving Connection Information.

    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Staging Directory Staging directory on SQL Server 2019 BDC where Transformer stores the StreamSets resources and files needed to run the pipeline.

    Default is /streamsets.

    Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.

  7. To define runtime parameters, on the Parameters tab, click the Add icon and define the name and the default value for each parameter.
    You can use simple or bulk edit mode to configure the parameters.
  8. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Cluster Callback URL Callback URL for the Spark cluster to use to communicate with Transformer. Define a URL when the web browser and the Spark cluster must use a different URL to access Transformer.

    Overrides the Transformer URL configured in Transformer configuration properties.

    Important: Do not define a cluster callback URL when you plan to enable pipeline failover for the job that includes this pipeline. To support failover, the pipeline must use the default Transformer URL.
    Preprocessing Script Scala script to run before the pipeline starts.

    Develop the script using the Spark APIs for the version of Spark installed on your cluster.

  9. Use the Stage Library panel to add an origin stage. In the Properties panel, configure the stage properties.

    For configuration details about origin stages, see Origins.

  10. Use the Stage Library panel to add the next stage that you want to use, connect the origin to the new stage, and configure the new stage.

    For configuration details about processors, see Processors.

    For configuration details about destinations, see Destinations.

  11. Add additional stages as necessary.
  12. At any point, use the Preview icon () to preview data to help configure the pipeline.

    Preview becomes available in partial pipelines when all existing stages are connected and configured.

  13. When the pipeline is validated and complete, use the Start icon to run the pipeline.
    When Transformer starts the pipeline, monitor mode displays real-time statistics for the pipeline.