Stage-Related Prerequisites
Some stages require that you complete prerequisite tasks before using them in a pipeline. If you do not use a particular stage, you can skip the tasks for the stage.
Local Pipeline Prerequisites for Amazon S3 and ADLS Gen2
- Amazon S3
- Microsoft Azure Data Lake Storage Gen2
To run pipelines that connect to these systems, Spark requires access to Hadoop client libraries and system-related client libraries.
Cluster pipelines require no action because supported cluster managers include the required libraries. However, before you run a local pipeline that connects to these systems, you must complete the following prerequisite tasks.
Ensure Access to Hadoop Client Libraries
To run a local pipeline that connects to Amazon S3 or Azure Data Lake Storage Gen2, the Spark installation on the Transformer machine must be configured to use Hadoop 3.2.0 client libraries.
Some Spark installations include Hadoop by default, but they include versions earlier than Hadoop 3.2.0.
- Install Spark without Hadoop on the Transformer machine.
You can download Spark without Hadoop from the Spark website. Select the version named
Pre-build with user-provided Apache Hadoop
. - Install Hadoop 3.2.0.
You can download Hadoop 3.2.0 from the Apache website.
- Configure the
SPARK_DIST_CLASSPATH
to include the Hadoop libraries.Spark recommends adding an entry to theFor more information, see the Spark documentation.conf/spark-env.sh
file. For example:export SPARK_DIST_CLASSPATH=$(/<path to>/hadoop/bin/hadoop classpath)
- If the Transformer machine includes multiple versions of Spark, configure the machine to use the
Spark version without Hadoop. For example:
export SPARK_HOME="/<path to>/spark-2.4.3-bin-without-hadoop"
Use the Transformer Stage Library
Before running a local pipeline that connects to Amazon S3 or Azure Data Lake Storage Gen2, configure each related stage to use the Transformer stage library.
The Transformer stage library includes the required system-related client library with the pipeline, enabling a local Spark installation to execute the pipeline.
- In the stage properties, on the General tab, configure
the Stage Library property to use the Transformer stage library:
- For Amazon S3 stages, select AWS Library from Transformer (for Hadoop 3.2.0).
- For ADLS Gen2 stages, select ADLS Library from Transformer (for Hadoop 3.2.0).
- Repeat for any stage that connects to Amazon S3 or Azure Data Lake Storage Gen2.
PySpark Processor Prerequisites
You can use the PySpark processor to develop custom PySpark code in pipelines that provision a Databricks cluster, in standalone pipelines, and in pipelines that run on any existing cluster except for Dataproc. Do not use the processor in Dataproc pipelines or in pipelines that provision non-Databricks clusters.
Before using the PySpark processor in an existing cluster, you must complete several prerequisite tasks. The tasks that you perform depend on where the pipeline runs:
- Existing Databricks cluster
- Existing EMR cluster
- Other existing cluster or local Transformer machine
When using the processor in a pipeline that provisions a Databricks cluster, perform the required tasks.
Existing Databricks Cluster
Before using the PySpark processor in pipelines that run on an existing Databricks cluster, set the required environment variables on the cluster.
When running the pipeline on a provisioned Databricks cluster, you configure the environment variables in the pipeline cluster configuration property. For more information, see Provisioned Databricks Cluster Requirements.
PYSPARK_PYTHON=/databricks/python3/bin/python3
PYSPARK_DRIVER_PYTHON=/databricks/python3/bin/python3
PYTHONPATH=/databricks/spark/python/lib/py4j-<version>-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
PYTHONPATH=/databricks/spark/python/lib/py4j-0.10.7-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
Existing EMR Cluster
Before using the PySpark processor in pipelines that run on an existing EMR cluster, complete all of the following prerequisite tasks.
- Master instance
- Complete the following steps to set the required environment variables on
the master instance:
- Add the following variables to the
/etc/spark/conf/spark-env.sh
file.
export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
- Create an environment variable file,
/etc/profile.d/transformer.sh
, then add the following variables to the file:export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:
export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
- Add the following variables to the
/etc/spark/conf/spark-env.sh
file.
- Worker instance
- Complete the following steps to set up each worker instance in the
cluster:
- Run the following command to install
wheel
to manage package dependencies and PySpark for Python 3:pip-3.6 install wheel pip-3.6 install pyspark==<Spark cluster version>
For more information about wheel, see the Python Package Index (PyPI) documentation.
- Create an environment variable file,
/etc/profile.d/transformer.sh
, then add the following variables to the file:export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3 export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.<minor version>x/dist-packages
Note that the PYTHONPATH variable requires the Python version. For example, you might set the variable as follows for a cluster that uses Python version 3.6:export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages
- Run the following command to install
For more information about EMR instance types, see the Amazon EMR documentation.
Other Existing Clusters and Local Pipelines
-
Install Python 3 on all nodes in the Spark cluster, or on the Transformer machine for local pipelines.
The processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.
You can use a package manager to install Python from the command line. Or you can download and install Python from the Python download page.
-
Set the following Python environment variable on all nodes in the Spark
cluster, or on the Transformer machine for local pipelines:
Python Environment Variable Description PYTHONPATH Lists one or more directory paths that contain Python modules available for import. Include the following paths:$SPARK_HOME/libexec/python/lib/py4j-<version>-src.zip $SPARK_HOME/libexec/python $SPARK_HOME/bin/pyspark
For example:export PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/libexec/python:$SPARK_HOME/bin/pyspark:$PYTHONPATH
For more information about this environment variable, see the Python documentation.
-
Set the following Spark environment variables on all nodes in the Spark
cluster, or on the Transformer machine for local pipelines:
Spark Environment Variable Description PYSPARK_PYTHON Path to the Python binary executable to use for PySpark for both the Spark driver and workers. PYSPARK_DRIVER_PYTHON Path to the Python binary executable to use for PySpark for the Spark driver only. When not set, the Spark driver uses the path defined in PYSPARK_PYTHON.
For more information about these environment variables, see the Apache Spark documentation.
Scala Processor and Preprocessing Script Prerequisites
- Ensure that Transformer uses the version of Spark installed on your cluster.
- Set the SPARK_HOME environment variable on all nodes in the Spark cluster.
Set the SPARK_HOME Environment Variable
When using a Scala processor or preprocessing script, set the SPARK_HOME environment variable on all nodes in the Spark cluster.
The SPARK_HOME environment variable must be set to the version of Spark installed on your cluster.
You can also use the Extra Spark Configuration pipeline property to set the environment variable for the duration of the pipeline run.