Preprocessing Script
You can specify a Scala preprocessing script in pipeline properties to perform a task before the pipeline starts. Complete the prerequisite tasks before using a preprocessing script in a pipeline.
You might use the preprocessing script to register a user-defined function (UDF) that you want to use in the pipeline. After you register a UDF, you can use it in anywhere in the pipeline that allows the use of Spark SQL, including the Spark SQL Expression or Spark SQL Query processors, the Join or Filter processors, and so on.
Note that UDFs are not optimized by Spark, so should be used with care.
inc
and a Spark function named
inc
:def inc(i: Integer): Integer = {
i + 1
}
spark.udf.register("inc", inc _)
When you have this UDF registered in the pipeline as a preprocessing script, you can call
the UDF in pipeline stages by calling inc _
as a Spark function.
To specify a preprocessing script for a pipeline, in the pipeline properties panel, click the Advanced tab and define the script in the Preprocessing Script property.
For more information about Spark Scala APIs, see the Spark documentation.
Preprocessing Script Requirements
To ensure that a preprocessing script runs as expected, make sure that the following requirements are met:
- Compatible Scala version on the cluster
- The Scala version on the cluster must be compatible with the Scala processing mechanism included with Transformer.
- Valid code
- The code in the preprocessing script must be compatible with the Scala and Spark versions used by the cluster.