PySpark
The PySpark processor transforms data based on custom PySpark code. You develop the custom code using the Python API for Spark, or PySpark. The PySpark processor supports Python 3.
The processor can receive multiple input streams, but can produce only a single output
stream. When the processor receives multiple input streams, it receives one Spark
DataFrame from each input stream. The custom PySpark code must produce a single
DataFrame.
Tip: In streaming pipelines, you can use a Window
processor upstream from this processor to generate larger batch sizes for
evaluation.
You can use the PySpark processor in pipelines that provision a Databricks cluster, in standalone pipelines, and in pipelines that run on any existing cluster except for Dataproc. Do not use the processor in Dataproc pipelines or in pipelines that provision non-Databricks clusters.