SQL Server 2019 Big Data Cluster

You can run Transformer pipelines using Spark deployed on SQL Server 2019 Big Data Cluster (BDC). Transformer supports SQL Server 2019 Cumulative Update 4 or later. SQL Server 2019 BDC uses Apache Livy to submit Spark jobs.

To run a pipeline on SQL Server 2019 BDC, configure the pipeline to use SQL Server 2019 BDC as the cluster manager type on the Cluster tab of pipeline properties.
Important: The SQL Server 2019 BDC cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in the installation instructionsGranting the Spark Cluster Access to Transformer.

You specify the Livy endpoint, as well as the user name and password to access the cluster through the endpoint. When you start the pipeline, Transformer uses these credentials to launch the Spark application.

You also define the staging directory within the cluster to store the StreamSets libraries and resources needed to run the pipeline.

Important: Due to an unresolved SQL Server 2019 BDC issue, you must complete the following task before running a pipeline. On the SQL Server 2019 BDC cluster, remove the mssql-mleap-lib-assembly-1.0.jar file from the following HDFS ZIP file: /system/spark/spark_libs.zip. This issue should be fixed in the next SQL Server 2019 BDC release.

The following image displays a pipeline configured to run using Spark deployed on SQL Server 2019 BDC at the specified Livy endpoint:

Note: The first time that you run a pipeline on SQL Server 2019 BDC, it can take 5-10 minutes for the pipeline to start. This occurs because Transformer must deploy Transformer files across the cluster. This should only occur the first time that you run a Transformer pipeline on the cluster.

StreamSets provides a quick start deployment script that enables you to easily try using SQL Server 2019 BDC as a cluster manager for Transformer pipelines without additional configuration. For example, you might use the script to try using SQL Server 2019 BDC as a cluster manager but aren't ready to upgrade to Transformer 3.13.x or later.