Preprocessing Script
You can specify a Scala preprocessing script in pipeline properties to perform a task before the pipeline starts. Develop the script using the Spark APIs for the version of Spark installed on your cluster. Complete the prerequisite tasks before using a preprocessing script in a pipeline.
You might use the preprocessing script to register a user-defined function (UDF) that you want to use in the pipeline. After you register a UDF, you can use it in anywhere in the pipeline that allows the use of Spark SQL, including the Spark SQL Expression or Spark SQL Query processors, the Join or Filter processors, and so on.
Note that UDFs are not optimized by Spark, so should be used with care.
inc
and a Spark function named
inc
:def inc(i: Integer): Integer = {
i + 1
}
spark.udf.register("inc", inc _)
When you have this UDF registered in the pipeline as a preprocessing script, you can call
the UDF in pipeline stages by calling inc _
as a Spark function.
To specify a preprocessing script for a pipeline, in the pipeline properties panel, click the Advanced tab and define the script in the Preprocessing Script property.
For more information about Spark Scala APIs, see the Spark documentation.