Cloudera Data Engineering

You can run Transformer pipelines using Spark deployed on an existing Cloudera Data Engineering (CDE) virtual cluster. Transformer supports several CDE versions. For a complete list, see Cluster Compatibility Matrix.

To run a pipeline on a CDE virtual cluster, configure the pipeline to use Cloudera Data Engineering as the cluster manager type on the Cluster tab of the pipeline properties.

Important: The cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in Granting the Spark Cluster Access to Transformer.

When you configure a pipeline to run on a CDE cluster, you specify a jobs API URL, an authentication API URL, a workload user and password, and the log level to use. You also specify a job resource and a resource file prefix for pipeline-related files.

Note: Before using a CDE pipeline to read from or write to Amazon S3 or HDFS, complete the prerequisite tasks.

When configuring the pipeline, be sure to include only the supported stages.

The following image shows the Cluster tab of a pipeline configured to run on a Cloudera Data Engineering cluster:

Supported Stages

Transformer does not support using all available stages in pipelines that run on Cloudera Data Engineering clusters.

When you configure a CDE pipeline, you can use stages that are included in the following stage libraries:

AWS cluster-provided libraries
Basic
File
Hive
Kafka cluster-provided libraries

Job Resource and Resource File Prefix

To run pipelines on a Cloudera Data Engineering cluster, Transformer stores resource files in the Cloudera Data Engineering job resource that you specify in the pipeline properties.

To differentiate those files from unrelated resources, Transformer adds a prefix to the resource file names. You can customize the prefix that the pipeline uses. By default, the prefix is streamsets, which results in a directory-like structure to the file name, as follows:

streamsets/<file name>

Transformer stores the following files in the job resource defined for the pipeline:

Files that can be reused across pipelines

Transformer stores files in the job resource that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers. Reusable files include the Transformer version in the name, as follows:

<resource file prefix>/<Transformer version>/<file name>

For example, say you use the default resource file prefix for a pipeline run on Transformer version 6.0.0. Then, Transformer names all reusable files stored in the job resource as follows:

/streamsets/6.0.0/<file name>

Files specific to each pipeline

Transformer stores files in the job resource that are specific to each pipeline, such as the pipeline JSON file. Pipeline-specific files include the pipeline ID and run ID, as follows:

<resource file prefix>/<pipelineId>/<runId>/<file name>

For example, say you use the default staging directory and run a pipeline named KafkaToJDBC. Then, Transformer might name all pipeline-specific files stored in the job resource as follows:

streamsets/KafkaToJDBC03a0-f622-4a68-b161-7f22/run15576328/<file name>

Amazon S3 and HDFS Prerequisites

Before reading from or writing to Amazon S3 or HDFS, complete the following prerequisite tasks:

Amazon S3

To read from or write to Amazon S3, complete the following tasks:

In Cloudera Data Engineering, ensure that the workload user specified in the CDE pipeline properties has a role mapping that includes the following permissions for the Amazon S3 bucket specified in the Amazon S3 stage:
- s3:ListBucket
- s3:GetBucketLocation
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
In the Amazon S3 origin or destination, on the Advanced tab, add one of the following Spark properties to the Additional Configuration property:
- For Spark 3.x clusters, add the spark.kerberos.access.hadoopFileSystems property and set it to the Amazon S3 bucket to access. For example: s3://<s3-bucket-name>

HDFS

To read from or write to HDFS, on the File tab of the Files origin or destination, add the following Spark properties to the Additional Configuration property:

Add one of the following properties based on the Spark version that the cluster uses:
- For Spark 3.x clusters, add the spark.kerberos.access.hadoopFileSystems property and set it to the HDFS file system to access. For example: hdfs://<hdfs-host>:<port>
To avoid connection errors, also add the spark.hadoop.fs.defaultFS property and set it to the same HDFS file system as above.
To avoid Kerberos errors, add both of the following properties, and set them to your HDFS principal:
- spark.hadoop.dfs.datanode.kerberos.principal
- spark.hadoop.dfs.namenode.kerberos.principal