Stage-Related Prerequisites

Some stages require that you complete prerequisite tasks before using them in a pipeline. If you do not use a particular stage, you can skip the tasks for the stage.

The following stages and features include prerequisite tasks:

Local Pipeline Prerequisites for Amazon S3 and ADLS

Transformer uses Hadoop APIs to connect to the following external systems:
  • Amazon S3
  • Microsoft Azure Data Lake Storage Gen1 and Gen2

To run pipelines that connect to these systems, Spark requires access to Hadoop client libraries and system-related client libraries.

Cluster pipelines require no action because supported cluster managers include the required libraries. However, before you run a local pipeline that connects to these systems, you must complete the following prerequisite tasks.

Ensure Access to Hadoop Client Libraries

To run a local pipeline that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2, the Spark installation on the Transformer machine must be configured to use Hadoop 3.2.0 client libraries.

Some Spark installations include Hadoop by default, but they include versions earlier than Hadoop 3.2.0.

  1. Install Spark without Hadoop on the Transformer machine.

    You can download Spark without Hadoop from the Spark website. Select the version named Pre-build with user-provided Apache Hadoop.

  2. Install Hadoop 3.2.0.

    You can download Hadoop 3.2.0 from the Apache website.

  3. Configure the SPARK_DIST_CLASSPATH to include the Hadoop libraries.
    Spark recommends adding an entry to the conf/spark-env.sh file. For example:
    export SPARK_DIST_CLASSPATH=$(/<path to>/hadoop/bin/hadoop classpath)
    For more information, see the Spark documentation.
  4. If the Transformer machine includes multiple versions of Spark, configure the machine to use the Spark version without Hadoop.
    For example:
    export SPARK_HOME="/<path to>/spark-2.4.3-bin-without-hadoop"

Use the Transformer Stage Library

Before running a local pipeline that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2, configure each related stage to use the Transformer stage library.

The Transformer stage library includes the required system-related client library with the pipeline, enabling a local Spark installation to execute the pipeline.

  1. In the stage properties, on the General tab, configure the Stage Library property to use the Transformer stage library:
    • For Amazon S3 stages, select AWS Library from Transformer (for Hadoop 3.2.0).
    • For ADLS stages, select ADLS Library from Transformer (for Hadoop 3.2.0).
  2. Repeat for any stage that connects to Amazon S3, or Azure Data Lake Storage Gen1 or Gen2.

PySpark Processor Prerequisites

You can use the PySpark processor to develop custom PySpark code in pipelines that provision a Databricks cluster, in standalone pipelines, and in pipelines that run on any existing cluster except for Dataproc. Do not use the processor in Dataproc pipelines or in pipelines that provision non-Databricks clusters.

Before using the PySpark processor in an existing cluster, you must complete several prerequisite tasks. The tasks that you perform depend on where the pipeline runs:

When using the processor in a pipeline that provisions a Databricks cluster, perform the required tasks.

Existing Databricks Cluster

Before using the PySpark processor in pipelines that run on an existing Databricks cluster, set the required environment variables on the cluster.

When running the pipeline on a provisioned Databricks cluster, you configure the environment variables in the pipeline cluster configuration property. For more information, see Provisioned Databricks Cluster Requirements.

On an existing Databricks cluster, the PySpark processor requires the following environment variables to be configured as follows:
PYSPARK_PYTHON=/databricks/python3/bin/python3
PYSPARK_DRIVER_PYTHON=/databricks/python3/bin/python3
PYTHONPATH=/databricks/spark/python/lib/py4j-<version>-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
Note that the PYTHONPATH variable requires the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses py4j version 0.10.7:
PYTHONPATH=/databricks/spark/python/lib/py4j-0.10.7-src.zip:/databricks/spark/python:/databricks/spark/bin/pyspark
Tip: You can configure environment variables in your Databricks Workspace by clicking Clusters > Advanced Options > Spark. Then, enter the environment variables in the Environment Variables property. Restart the cluster to enable your changes.

Existing EMR Cluster

Before using the PySpark processor in pipelines that run on an existing EMR cluster, complete all of the following prerequisite tasks.

The tasks must be performed on the master and all worker instances in the cluster:
Master instance
Complete the following steps to set the required environment variables on the master instance:
  1. Add the following variables to the /etc/spark/conf/spark-env.sh file.
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH

    Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:

    export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark:$PYTHONPATH
  2. Create an environment variable file, /etc/profile.d/transformer.sh, then add the following variables to the file:
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/lib/python3.<minor version>/dist-packages:$SPARK_HOME/python/lib/py4j-<version>-src.zip:$SPARK_HOME/python:/usr/bin/pyspark

    Note that the PYTHONPATH variable requires the Python version and the py4j version used by the cluster. For example, you might set the variable as follows for a cluster that uses Python version 3.6 and py4j version 0.10.7:

    export PYTHONPATH=/usr/lib/python3.6/dist-packages:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python:/usr/bin/pyspark
Worker instance
Complete the following steps to set up each worker instance in the cluster:
  1. Run the following command to install wheel to manage package dependencies and PySpark for Python 3:
    pip-3.6 install wheel
    pip-3.6 install pyspark==<Spark cluster version>

    For more information about wheel, see the Python Package Index (PyPI) documentation.

  2. Create an environment variable file, /etc/profile.d/transformer.sh, then add the following variables to the file:
    export PYSPARK_PYTHON=/usr/bin/python3
    export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
    export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.<minor version>x/dist-packages
    Note that the PYTHONPATH variable requires the Python version. For example, you might set the variable as follows for a cluster that uses Python version 3.6:
    export PYTHONPATH=/usr/bin/pyspark:/usr/lib/python3.6/dist-packages

For more information about EMR instance types, see the Amazon EMR documentation.

Other Existing Clusters and Local Pipelines

Complete the following prerequisite tasks before using the PySpark processor in a pipeline that runs on existing clusters other than Databricks or EMR or in a local pipeline, which runs on the local Transformer machine.
  1. Install Python 3 on all nodes in the Spark cluster, or on the Transformer machine for local pipelines.

    The processor can use any Python 3.x version. However, StreamSets recommends installing the latest version.

    You can use a package manager to install Python from the command line. Or you can download and install Python from the Python download page.

  2. Set the following Python environment variable on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Python Environment Variable Description
    PYTHONPATH Lists one or more directory paths that contain Python modules available for import.
    Include the following paths:
    $SPARK_HOME/libexec/python/lib/py4j-<version>-src.zip
    $SPARK_HOME/libexec/python
    $SPARK_HOME/bin/pyspark
    For example:
    export PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/libexec/python:$SPARK_HOME/bin/pyspark:$PYTHONPATH

    For more information about this environment variable, see the Python documentation.

  3. Set the following Spark environment variables on all nodes in the Spark cluster, or on the Transformer machine for local pipelines:
    Spark Environment Variable Description
    PYSPARK_PYTHON Path to the Python binary executable to use for PySpark for both the Spark driver and workers.
    PYSPARK_DRIVER_PYTHON Path to the Python binary executable to use for PySpark for the Spark driver only.

    When not set, the Spark driver uses the path defined in PYSPARK_PYTHON.

    For more information about these environment variables, see the Apache Spark documentation.

Scala Processor and Preprocessing Script Prerequisites

Before you use a Scala processor or preprocessing script in a pipeline, complete the following prerequisite tasks:
  • Ensure that Transformer uses the version of Spark installed on your cluster.
  • Set the SPARK_HOME environment variable on all nodes in the Spark cluster.

Set the SPARK_HOME Environment Variable

When using a Scala processor or preprocessing script, set the SPARK_HOME environment variable on all nodes in the Spark cluster.

The SPARK_HOME environment variable must be set to the version of Spark installed on your cluster.

You can also use the Extra Spark Configuration pipeline property to set the environment variable for the duration of the pipeline run.