Existing EMR Cluster
Before using the PySpark processor in pipelines that run on an existing EMR cluster, complete all of the following prerequisite tasks.
- Master instance
- Complete the following steps to set the required environment variables on
the master instance:
- Add the following variables to the
/etc/spark/conf/spark-env.sh
file.
export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
- Create an environment variable file,
/etc/profile.d/transformer.sh
, then add the following variables to the file:export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
- Add the following variables to the
/etc/spark/conf/spark-env.sh
file.
- Worker instance
- Complete the following steps to set up each worker instance in the
cluster:
- Run the following command to install
wheel
to manage package dependencies and PySpark for Python 3:pip-3.6 install wheel pip-3.6 install pyspark==<Spark cluster version>
For more information about wheel, see the Python Package Index (PyPI) documentation.
- Create an environment variable file,
/etc/profile.d/transformer.sh
, then add the following variables to the file:export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.<minor version>x/dist-packages
Note that the PYTHONPATH variable requires the Python version. For example, you might set the variable as follows for a cluster that uses Python version 3.6:export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages
- Run the following command to install
For more information about EMR instance types, see the Amazon EMR documentation.