Existing EMR Cluster

Before using the PySpark processor in pipelines that run on an existing EMR cluster, complete all of the following prerequisite tasks.

The tasks must be performed on the master and all worker instances in the cluster:
Master instance
Complete the following steps to set the required environment variables on the master instance:
  1. Add the following variables to the /etc/spark/conf/spark-env.sh file.
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH

    Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:

    export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
  2. Create an environment variable file, /etc/profile.d/transformer.sh, then add the following variables to the file:
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark

    Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:

    export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
Worker instance
Complete the following steps to set up each worker instance in the cluster:
  1. Run the following command to install wheel to manage package dependencies and PySpark for Python 3:
    pip-3.6 install wheel
    pip-3.6 install pyspark==<Spark cluster version>

    For more information about wheel, see the Python Package Index (PyPI) documentation.

  2. Create an environment variable file, /etc/profile.d/transformer.sh, then add the following variables to the file:
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.<minor version>x/dist-packages
    Note that the PYTHONPATH variable requires the Python version. For example, you might set the variable as follows for a cluster that uses Python version 3.6:
    export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages

For more information about EMR instance types, see the Amazon EMR documentation.