Existing Databricks Cluster

Before using the PySpark processor in pipelines that run on an existing Databricks cluster, set the required environment variables on the cluster.

When running the pipeline on a provisioned Databricks cluster, you configure the environment variables in the pipeline cluster configuration property. For more information, see Provisioned Databricks Cluster Requirements.

On an existing Databricks cluster, the PySpark processor requires the following environment variables to be configured as follows:

PYSPARK_PYTHON=/databricks/python3/bin/python3
PYSPARK_DRIVER_PYTHON=/databricks/python3/bin/python3
PYTHONPATH=/databricks/spark/python/lib/py4j-<version>-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark

Note that the PYTHONPATH variable requires the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses py4j version 0.10.7:

PYTHONPATH=/databricks/spark/python/lib/py4j-0.10.7-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark

Tip: You can configure environment variables in your Databricks Workspace by clicking Clusters > Advanced Options > Spark. Then, enter the environment variables in the Environment Variables property. Restart the cluster to enable your changes.