Expressions in Pipeline and Stage Properties
Some pipeline and stage properties allow you to specify an expression. When configuring
an expression, use one of the following languages:
- Spark SQL query language
- Spark SQL is the relational query language used with Spark. Because processing for Transformer pipelines occurs on a Spark cluster, you must use Spark SQL for all expressions that manipulate pipeline data.
- StreamSets expression language
- The StreamSets expression language is based on the JSP 2.0 expression language. If you use StreamSets Data Collector or Control Hub, you are probably familiar with the StreamSets expression language.
Referencing Fields in Spark SQL Expressions
To reference a first-level field in a record in a Spark SQL expression, you simply specify the field name. Transformer does not perform the case-sensitive evaluation of field names within a pipeline.
For example, to deduplicate data based on an ID
field, you configure a
Deduplicate processor to deduplicate based on fields. Then, you can specify
ID
, Id
, iD
, or
id
as the field to use.
To reference a field within a Map field, use dot notation (
.
) to specify
the path to the field, as
follows:<top level>.<next level>.<next level>.<field to use>
For example, customer.transactions.2019
.
To reference an item in a List field, use bracket notation ([#]
) to
indicate the position in a list. Use 0 to indicate the first item in the list, 1 to
indicate the second, and so on.
For example, to reference the second item in an
appt_date
List field,
enter appt_date[1]
. Tip: After running
preview for a pipeline, you can also copy a field path from the preview results or
when you view the input and output schema for a stage.