Preprocessing Script

You can specify a Scala preprocessing script in pipeline properties to perform a task before the pipeline starts. Develop the script using the Spark APIs for the version of Spark installed on your cluster. Complete the prerequisite tasks before using a preprocessing script in a pipeline.

You might use the preprocessing script to register a user-defined function (UDF) that you want to use in the pipeline. After you register a UDF, you can use it in anywhere in the pipeline that allows the use of Spark SQL, including the Spark SQL Expression or Spark SQL Query processors, the Join or Filter processors, and so on.

Note that UDFs are not optimized by Spark, so should be used with care.

For example, the following Scala script increments an integer by one and registers the UDF as a Scala function named inc and a Spark function named inc:

def inc(i: Integer): Integer = {
  i + 1
}
spark.udf.register("inc", inc _)

When you have this UDF registered in the pipeline as a preprocessing script, you can call the UDF in pipeline stages by calling inc _ as a Spark function.

To specify a preprocessing script for a pipeline, in the pipeline properties panel, click the Advanced tab and define the script in the Preprocessing Script property.

For more information about Spark Scala APIs, see the Spark documentation.