Expressions in Pipeline and Stage Properties

Some pipeline and stage properties allow you to specify an expression. When configuring an expression, use one of the following languages:

Spark SQL query language

Spark SQL is the relational query language used with Spark. Because processing for Transformer pipelines occurs on a Spark cluster, you must use Spark SQL for all expressions that manipulate pipeline data.

For example, when using the Filter processor to remove data from the pipeline, you define the filter condition using any Spark SQL syntax that can be used in the WHERE clause of a query.

Stages that require using Spark SQL include examples of the syntax that you might use. For information on specifying field names in Spark SQL expressions, see Referencing Fields in Spark SQL Expressions. For more information about Spark SQL functions, see the Apache Spark SQL Functions documentation.

StreamSets expression language

The StreamSets expression language is based on the JSP 2.0 expression language. If you use StreamSets Data Collector or Control Hub, you are probably familiar with the StreamSets expression language.

In Transformer, you can use the StreamSets expression language in pipeline or stage properties that are evaluated only once, before pipeline processing begins. This includes properties such as connection details and runtime parameters.

For example, you can use the following expression in the Password property of a stage to use a Base64 encoded password:

${base64:decodeString("bXlwYXNzd29yZA==", "UTF-8"}

Note: You cannot use the StreamSets expression language in properties that evaluate pipeline data. As a result, some functions you might be accustomed to using in other StreamSets products, such as the record or field functions, are not supported in Transformer.

For more information about using the StreamSets expression language with Transformer, see StreamSets Expression Language.

Referencing Fields in Spark SQL Expressions

To reference a first-level field in a record in a Spark SQL expression, you simply specify the field name. Transformer does not perform the case-sensitive evaluation of field names within a pipeline.

For example, to deduplicate data based on an ID field, you configure a Deduplicate processor to deduplicate based on fields. Then, you can specify ID, Id, iD, or id as the field to use.

To reference a field within a Map field, use dot notation (.) to specify the path to the field, as follows:

<top level>.<next level>.<next level>.<field to use>

For example, customer.transactions.2019.

To reference an item in a List field, use bracket notation ([#]) to indicate the position in a list. Use 0 to indicate the first item in the list, 1 to indicate the second, and so on.

For example, to reference the second item in an appt_date List field, enter appt_date[1].

Tip: After running preview for a pipeline, you can also copy a field path from the preview results or when you view the input and output schema for a stage.