Cluster Configuration

When provisioning a cluster for a pipeline, Databricks creates a new Databricks job cluster upon the initial run of a pipeline. You define the Databricks cluster properties to use in the Cluster Configuration pipeline property. Transformer uses Databricks default values for all Databricks cluster properties that are not defined in the Cluster Configuration pipeline property.

When needed, you can override the Databricks default values by defining additional cluster properties in the Cluster Configuration pipeline property. For example, to provision a cluster that uses an instance pool, you can add and define the instance_pool_id property in the Cluster Configuration property.

When defining cluster configuration properties, use the property names and values as expected by Databricks. The Cluster Configuration property defines cluster properties in JSON format.

When provisioning a Databricks cluster for a pipeline, you must define the following properties in the Cluster Configuration property:


Databricks Cluster Property	Description
num_workers	Number of worker nodes in the cluster.
spark_version	Databricks Runtime and Apache Spark version.
node_type_id	Type of worker node.

Note: When provisioning a cluster for a pipeline that includes a PySpark processor, you must include additional cluster details. For more information, see the PySpark processor documentation.

For information about other Databricks cluster properties, see the Databricks documentation.