PySpark Processor Prerequisites

You can use the PySpark processor to develop custom PySpark code in pipelines that provision a Databricks cluster, in standalone pipelines, and in pipelines that run on any existing cluster except for Dataproc. Do not use the processor in Dataproc pipelines or in pipelines that provision non-Databricks clusters.

Before using the PySpark processor in an existing cluster, you must complete several prerequisite tasks. The tasks that you perform depend on where the pipeline runs:

Existing Databricks cluster
Existing EMR cluster
Other existing cluster or local Transformer machine

When using the processor in a pipeline that provisions a Databricks cluster, perform the required tasks.