SQL Server 2019 Big Data Cluster

You can run Transformer pipelines using Spark deployed on SQL Server 2019 Big Data Cluster (BDC). Transformer supports SQL Server 2019 Cumulative Update 5 or later. SQL Server 2019 BDC uses Apache Livy to submit Spark jobs.

To run a pipeline on SQL Server 2019 BDC, configure the pipeline to use SQL Server 2019 BDC as the cluster manager type on the Cluster tab of pipeline properties.
Important: The SQL Server 2019 BDC cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in Granting the Spark Cluster Access to Transformer.

You specify the Livy endpoint, as well as the user name and password to access the cluster through the endpoint. When you start the pipeline, Transformer uses these credentials to launch the Spark application. You also define the staging directory within the cluster to store the StreamSets libraries and resources needed to run the pipeline.

Note the minimum Spark recommendations for running pipelines on SQL Server 2019 BDC.

Important: Due to an unresolved SQL Server 2019 BDC issue, you must complete the following task before running a pipeline. On SQL Server 2019 BDC, remove the mssql-mleap-lib-assembly-1.0.jar file from the following HDFS ZIP file: /system/spark/spark_libs.zip. This issue should be fixed in the next SQL Server 2019 BDC release.

The following image displays a pipeline configured to run using Spark deployed on SQL Server 2019 BDC at the specified Livy endpoint:

Note: The first time that you run a pipeline on SQL Server 2019 BDC, it can take 5-10 minutes for the pipeline to start. This occurs because Transformer must deploy Transformer files across the cluster. This should only occur the first time that you run a Transformer pipeline on the cluster.

Transformer Installation Location

When you use SQL Server 2019 BDC as a cluster manager, Transformer must be installed in a location that allows submitting Spark jobs to the cluster.

StreamSets recommends installing Transformer in the Kubernetes pod where SQL Server 2019 BDC is located.

Recommended Spark Settings

The following table lists the minimum Spark settings recommended when running pipelines on SQL Server 2019 BDC. You can configure these properties on the cluster or in the pipeline:

Spark Property Recommended Minimum Setting
spark.driver.memory 4 GB
spark.driver.cores 1
spark.executor.instances or

spark.dynamicAllocation.minExecutors

5
spark.executor.memory 4 GB
spark.executor.cores 1

Retrieving Connection Information

When using SQL Server 2019 BDC as a cluster manager for a pipeline, you need to provide the following connection information:
Livy endpoint
The SQL Server 2019 BDC Livy endpoint enables submitting Spark jobs. You can retrieve the Livy endpoint using the command line or using a client application such as Azure Data Studio. For information about using the command line, see the SQL Server 2019 BDC documentation.
In the results of the command line request, the Livy endpoint appears at the bottom of the list:
In Azure Data Studio, the Livy endpoint appears as follows:
User name and password
For the user name, use the SQL Server 2019 BDC controller user name, which can submit Spark jobs through the Livy endpoint. When you start the pipeline, Transformer uses these credentials to launch the Spark application.
The controller user name can have several passwords that provide access to different functionality. To access Spark through the Livy endpoint, use the Knox password.

For more information about the controller user name and related passwords, see the SQL Server 2019 BDC workshop on Github.

Staging Directory

To run pipelines on SQL Server 2019 BDC, Transformer must store files in a staging directory on SQL Server 2019 BDC.

You can configure the root directory to use as the staging directory. The default staging directory is /streamsets.

Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.

Transformer stores the following files in the staging directory:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
/<staging_directory>/<Transformer version>
For example, say you use the default staging directory for Transformer version 5.7.0. Then, Transformer stores the reusable files in the following location:
/streamsets/5.7.0
Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you use the default staging directory and run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328