Other Existing Clusters and Local Pipelines
-
Install Python 3 on all nodes in the Spark cluster, or on the Transformer machine for local pipelines.
The processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.
You can use a package manager to install Python from the command line. Or you can download and install Python from the Python download page.
-
Set the following Python environment variable on all nodes in the Spark
cluster, or on the Transformer machine for local pipelines:
Python Environment Variable Description PYTHONPATH Lists one or more directory paths that contain Python modules available for import. Include the following paths:$SPARK_HOME/libexec/python/lib/py4j-<version>-src.zip $SPARK_HOME/libexec/python $SPARK_HOME/bin/pyspark
For example:export PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/libexec/python:$SPARK_HOME/bin/pyspark:$PYTHONPATH
For more information about this environment variable, see the Python documentation.
-
Set the following Spark environment variables on all nodes in the Spark
cluster, or on the Transformer machine for local pipelines:
Spark Environment Variable Description PYSPARK_PYTHON Path to the Python binary executable to use for PySpark for both the Spark driver and workers. PYSPARK_DRIVER_PYTHON Path to the Python binary executable to use for PySpark for the Spark driver only. When not set, the Spark driver uses the path defined in PYSPARK_PYTHON.
For more information about these environment variables, see the Apache Spark documentation.