Existing Databricks Cluster
Before using the PySpark processor in pipelines that run on an existing Databricks cluster, set the required environment variables on the cluster.
When running the pipeline on a provisioned Databricks cluster, you configure the environment variables in the pipeline cluster configuration property. For more information, see Provisioned Databricks Cluster Requirements.
On an existing Databricks cluster, the PySpark processor requires the following
environment variables to be configured as
follows:
PYSPARK_PYTHON=/databricks/python3/bin/python3
PYSPARK_DRIVER_PYTHON=/databricks/python3/bin/python3
PYTHONPATH=/databricks/spark/python/lib/py4j-<version>-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
Note that the
PYTHONPATH variable requires the py4j version used by the cluster. For example, you
might set the variable as follows for a cluster that uses py4j version
0.10.7:
PYTHONPATH=/databricks/spark/python/lib/py4j-0.10.7-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
Tip: You can configure environment variables in your Databricks Workspace by
clicking Clusters > Advanced Options > Spark. Then, enter the environment variables
in the Environment Variables property. Restart the cluster to enable your
changes.