Preprocessing Script

You can specify a Scala preprocessing script in pipeline properties to perform a task before the pipeline starts. Complete the prerequisite tasks before using a preprocessing script in a pipeline.

You might use the preprocessing script to register a user-defined function (UDF) that you want to use in the pipeline. After you register a UDF, you can use it in anywhere in the pipeline that allows the use of Spark SQL, including the Spark SQL Expression or Spark SQL Query processors, the Join or Filter processors, and so on.

Note that UDFs are not optimized by Spark, so should be used with care.

For example, the following Scala script increments an integer by one and registers the UDF as a Scala function named inc and a Spark function named inc:

def inc(i: Integer): Integer = {
  i + 1
}
spark.udf.register("inc", inc _)

When you have this UDF registered in the pipeline as a preprocessing script, you can call the UDF in pipeline stages by calling inc _ as a Spark function.

To specify a preprocessing script for a pipeline, in the pipeline properties panel, click the Advanced tab and define the script in the Preprocessing Script property.

For more information about Spark Scala APIs, see the Spark documentation.

Preprocessing Script Requirements

To ensure that a preprocessing script runs as expected, make sure that the following requirements are met:

Compatible Scala version on the cluster

The Scala version on the cluster must be compatible with the Scala processing mechanism included with Transformer.

Transformer is prebuilt with a specific Scala version. To handle the preprocessing script, Transformer uses the scala.tools.nsc package in the Scala API, which can change between Scala patch releases. For example, the package changed between Scala 2.12.10 and 2.12.14.

To ensure that a preprocessing script runs as expected, use one of the following recommended Scala versions on the cluster:


Transformer Version	Recommended Cluster Runtime Scala Versions
Prebuilt with Scala 2.12	2.12.10 or 2.12.14

Note: If the cluster includes a Scala compiler tool in the pipeline application classpath at runtime, the version of the Scala compiler tool takes precedence over the one included with Transformer.

For example, if the cluster includes a Scala 2.12 compiler tool, then use a cluster runtime with a Scala version as recommended in the table above for Scala 2.12.

Valid code

The code in the preprocessing script must be compatible with the Scala and Spark versions used by the cluster.

Develop the script using the Spark APIs for the version of Spark installed on your cluster.