SQL Server 2019 Big Data Cluster

You can run Transformer pipelines using Spark deployed on SQL Server 2019 Big Data Cluster (BDC). Transformer supports SQL Server 2019 Cumulative Update 5 or later. SQL Server 2019 BDC uses Apache Livy to submit Spark jobs.

To run a pipeline on SQL Server 2019 BDC, configure the pipeline to use SQL Server 2019 BDC as the cluster manager type on the Cluster tab of pipeline properties.
Important: The SQL Server 2019 BDC cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in the installation instructions.

You specify the Livy endpoint, as well as the user name and password to access the cluster through the endpoint. When you start the pipeline, Transformer uses these credentials to launch the Spark application. You also define the staging directory within the cluster to store the StreamSets libraries and resources needed to run the pipeline.

Note the minimum Spark recommendations for running pipelines on SQL Server 2019 BDC.

Important: Due to an unresolved SQL Server 2019 BDC issue, you must complete the following task before running a pipeline. On SQL Server 2019 BDC, remove the mssql-mleap-lib-assembly-1.0.jar file from the following HDFS ZIP file: /system/spark/spark_libs.zip. This issue should be fixed in the next SQL Server 2019 BDC release.

The following image displays a pipeline configured to run using Spark deployed on SQL Server 2019 BDC at the specified Livy endpoint:

Note: The first time that you run a pipeline on SQL Server 2019 BDC, it can take 5-10 minutes for the pipeline to start. This occurs because Transformer must deploy Transformer files across the cluster. This should only occur the first time that you run a Transformer pipeline on the cluster.

StreamSets provides a quick start deployment script that enables you to easily try using SQL Server 2019 BDC as a cluster manager for Transformer pipelines without additional configuration. For example, you might use the script to try using SQL Server 2019 BDC as a cluster manager but aren't ready to upgrade to Transformer 3.13.x or later.

Transformer Installation Location

When you use SQL Server 2019 BDC as a cluster manager, Transformer must be installed in a location that allows submitting Spark jobs to the cluster.

StreamSets recommends installing Transformer in the Kubernetes pod where SQL Server 2019 BDC is located.

Recommended Spark Settings

The following table lists the minimum Spark settings recommended when running pipelines on SQL Server 2019 BDC. You can configure these properties on the cluster or in the pipeline:

Spark Property Recommended Minimum Setting
spark.driver.memory 4 GB
spark.driver.cores 1
spark.executor.instances or

spark.dynamicAllocation.minExecutors

5
spark.executor.memory 4 GB
spark.executor.cores 1

Retrieving Connection Information

When using SQL Server 2019 BDC as a cluster manager for a pipeline, you need to provide the following connection information:
Livy endpoint
The SQL Server 2019 BDC Livy endpoint enables submitting Spark jobs. You can retrieve the Livy endpoint using the command line or using a client application such as Azure Data Studio. For information about using the command line, see the SQL Server 2019 BDC documentation.
In the results of the command line request, the Livy endpoint appears at the bottom of the list:
In Azure Data Studio, the Livy endpoint appears as follows:
User name and password
For the user name, use the SQL Server 2019 BDC controller user name, which can submit Spark jobs through the Livy endpoint. When you start the pipeline, Transformer uses these credentials to launch the Spark application.
The controller user name can have several passwords that provide access to different functionality. To access Spark through the Livy endpoint, use the Knox password.

For more information about the controller user name and related passwords, see the SQL Server 2019 BDC workshop on Github.

Staging Directory

To run pipelines on SQL Server 2019 BDC, Transformer must store files in a staging directory on SQL Server 2019 BDC.

You can configure the root directory to use as the staging directory. The default staging directory is /streamsets.

Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.

Transformer stores the following files in the staging directory:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
/<staging_directory>/<Transformer version>
For example, say you use the default staging directory for Transformer version 5.8.0. Then, Transformer stores the reusable files in the following location:
/streamsets/5.8.0
Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you use the default staging directory and run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328

Quick Start Script

StreamSets provides a deployment script that you can run to quickly try using SQL Server 2019 BDC with StreamSets Control Hub, Transformer, and Data Collector.

The script deploys a Control Hub Provisioning Agent, as well as a SQL Server 2019 BDC-enabled Transformer and Data Collector on a Kubernetes cluster. The Data Collector is enabled for authoring SQL Server 2019 BDC pipelines. The Transformer is enabled for authoring SQL Server 2019 BDC pipelines and executing them on SQL Server 2019 BDC.

Use the script for development only. For more information, see the deployment script on Github.

Note: If you already have a Control Hub organization with an authoring Data Collector and registered Transformer, you might skip the script and simply configure pipelines to use SQL Server 2019 BDC, as follows:
  • In Transformer version 3.13.x or later, you can select SQL Server 2019 BDC as the cluster manager type to run a Transformer pipeline.
  • In Data Collector version 3.12.x or later with the SQL Server 2019 BDC enterprise stage library installed, you can use the SQL Server 2019 BDC origin and destination in Data Collector pipelines.