Other Existing Clusters and Local Pipelines

Complete the following prerequisite tasks before using the PySpark processor in a pipeline that runs on existing clusters other than Databricks or EMR or in a local pipeline, which runs on the local Transformer machine.
  1. Install Python 3 on all nodes in the Spark cluster, or on the Transformer machine for local pipelines.

    The processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

    You can use a package manager to install Python from the command line. Or you can download and install Python from the Python download page.

  2. Set the following Python environment variable on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Python Environment Variable Description
    PYTHONPATH Lists one or more directory paths that contain Python modules available for import.
    Include the following paths:
    $SPARK_HOME/libexec/python/lib/py4j-<version>-src.zip
    $SPARK_HOME/libexec/python
    $SPARK_HOME/bin/pyspark
    For example:
    export PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/libexec/python:$SPARK_HOME/bin/pyspark:$PYTHONPATH

    For more information about this environment variable, see the Python documentation.

  3. Set the following Spark environment variables on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Spark Environment Variable Description
    PYSPARK_PYTHON Path to the Python binary executable to use for PySpark for both the Spark driver and workers.
    PYSPARK_DRIVER_PYTHON Path to the Python binary executable to use for PySpark for the Spark driver only.

    When not set, the Spark driver uses the path defined in PYSPARK_PYTHON.

    For more information about these environment variables, see the Apache Spark documentation.