Filter

The Filter processor passes records that match the filter condition to downstream stages. Records that do not match the filter condition are removed from the pipeline.

Use the Filter processor to remove unwanted records from the pipeline. To route data to separate downstream branches based on different conditions, use the Stream Selector processor.

When you configure the Filter processor, you specify the filter condition to use.

Filter Condition

The filter condition determines the data that passes downstream. The filter condition must evaluate to true or false. Records that evaluate to true pass to the rest of the pipeline.

You can use a condition as simple as AccountID is NOT NULL or you can create as complex a condition as needed.

Here are some guidelines for filter conditions:

When you define a condition, you typically base it on field values in the record. For information about referencing fields in the condition, see Referencing Fields in Spark SQL Expressions.
You can use any Spark SQL syntax that can be used in the WHERE clause of a query, including functions such as isnull or trim and operators such as = or <=.
You can also use user-defined functions (UDFs), but you must define the UDFs in the pipeline. Use a pipeline preprocessing script to define UDFs.
For more information about Spark SQL functions, see the Apache Spark SQL Functions documentation.

For example, the following condition passes only records where the year of the transaction date value is 2000 or later:

year(transaction_date) >= 2000

Sample Conditions

The following table lists some common scenarios that you might adapt for your use:


Condition Example	Description
`total > 0`	If the value in the `total` field is greater than 0, the record passes downstream. If not, the record is dropped from the pipeline.
`total <= 0`	If the value in the `total` field is less than or equal to 0, the record passes downstream. If not, the record is dropped from the pipeline.
`accountId is NOT NULL`	If the record has a value in the `accountId` field, the record passes downstream. If the field contains a null value, the record is dropped from the pipeline. Note that `NULL` is not case sensitive. For example, you can alternatively use `null` or `Null` in the condition.
`upper(message) like '%ERROR%'`	If the `message` field contains the string, `ERROR`, the record passes downstream. If not, the record is dropped from the pipeline. The condition changes the strings in the `message` field to uppercase before performing the evaluation. This allows the condition to also apply to `error` and `Error`, for example.
`initcap(country) like 'China' OR initcap(country) like 'Japan'`	If the value in the `country` field is `China` or `Japan`, the record passes downstream. If not, the record is dropped from the pipeline. The condition changes the strings in the `country` field to capitalize the first letter before performing the evaluation. This allows the condition to also apply to `CHINA` and `japan`, for example.

Configuring a Filter Processor

Configure a Filter processor to allow only the records that match a specified condition to pass downstream.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

On the Filter tab, specify the filter condition to use.