Union

The Union processor merges data from two or more incoming streams.

When you configure the Union processor, you connect the upstream stages to the processor and then specify the operation type and column handling option to use. In some cases, you also indicate whether to evaluate all columns or only matching columns.

For example, say you want to merge data from all connected streams and to pass all data, regardless of whether they are duplicates or if they have columns that do not exist in the other streams. To do this, you configure the processor as follows:
  • Operation property: Union All
  • Column Handling property: Pass All Columns

Operation Types

The operation type determines how the Union processor merges data from multiple incoming streams. Row evaluation depends on the column handling method that you select.

The Union processor provides the following operation types:
  • Union - Passes unique rows from all incoming streams.
  • Union All - Passes all rows from all incoming streams, including duplicates.
  • Intersect - Passes only rows that exist in all incoming streams.
  • Except - Passes only the rows from Input 1 of the processor that do not have matching rows from the other incoming streams.
    Note: With the Except operation, the input order of upstream stages is important. Stages are assigned an input order based on the order that you connect them to the processor.

For information about how this property works with different column handling methods, see Output Rows.

Column Handling Method

The column handling method determines how the Union processor evaluates for matching rows. It also defines the columns that are included in the output rows.

The Union processor provides the following column handling methods:
  • Require the same columns in all data sets - The processor adjusts the order of the columns in incoming data based on the column order of the first batch of Input 1. This provides uniformity in column order which enables Snowflake to perform the union, when possible.

    Though not concerned with column order, the processor generates a run error when incoming data has different schemas.

  • Only pass the columns that exist in all data sets - The processor drops columns that do not exist in all rows, and then performs the selected operation type.
  • Pass all columns - By default, the processor evaluates all columns when performing the selected operation type, and passes all columns downstream.

    When used with the Intersect or Except operation, the processor can evaluate all columns or matching columns. In both cases, the processor passes all columns downstream. Matching columns are those with the same name that are populated with non-null values.

For information about how this property works with different operation types, see Output Rows.

Output Rows

The rows that the Union processor passes downstream depends on both the operation type and column handling method that you select.

For example, if you use the Union operation and pass all columns, the processor passes the maximum number of rows downstream, without duplicates. In contrast, if you use the Union operation and only pass the columns that exist in all data sets, the processor passes only the rows that have exact matching columns in all incoming streams.

To see how the properties work together, let's start with some sample data and look at the possible combinations:

Input 1

ID Name Inventory
1 toolbox 8
2 hammer 11
3 multitool 6

Input 2

ID Name Inventory Bin
1 toolbox 8 2-1
3 multitool 25 2-1
4 wrench 30 2-2

Require the Same Columns / All Operation Types

When you use the Require the same columns with all data sets column handling method with any operation type, the processor requires the incoming rows from all incoming streams to have the same schema. However, the column order can differ.

When the incoming data has the same schema, then the operation type behaves as documented. When the incoming data has different schemas, the processor generates a run error and stops the pipeline.

Since the sample data has different schemas, the processor generates an error like the following before stopping the pipeline:
The pipeline is configured to require that all data sets have the same schema, but the schema <input1 column names> and <other input column names> do not match.

Only Pass Columns in All Data Sets / Union Operation

The Union operation passes all unique rows from all incoming streams. When you use the Only pass columns that exist in all data sets column handling method with the Union operation type, the processor drops columns that do not exist in all rows, then passes all unique rows downstream.

The sample data results in the following output:

ID Name Inventory
1 toolbox 8
2 hammer 11
3 multitool 6
3 multitool 25
4 wrench 30
Notice how the Bin column has been dropped. Also, the toolbox row, which has an exact match in both incoming streams, appears only once because the processor does not pass duplicate rows.

Only Pass Columns in All Data Sets / Union All Operation

The Union All operation passes all rows from all incoming streams, including duplicates. When you use the Only pass columns that exist in all data sets column handling method with the Union All operation type, the processor drops columns that do not exist in all rows, then passes all rows downstream.

The sample data results in the following output:
ID Name Inventory
1 toolbox 8
1 toolbox 8
2 hammer 11
3 multitool 6
3 multitool 25
4 wrench 30
Notice how the Bin column has been dropped. Also, note how this output includes a duplicate toolbox row, since that row exists in both incoming streams.

Only Pass Columns in All Data Sets / Intersect Operation

The Intersect operation passes only rows that exist in all incoming streams. When you use the Only pass columns that exist in all data sets column handling option with the Intersect operation type, the processor ignores any columns that do not exist in all rows.

With the sample data, the processor ignores the Bin column, then passes only the rows that exist in all incoming streams.

The sample data results in the following output because only the toolbox row has an exact match in both incoming streams:
ID Name Inventory
1 toolbox 8

Only Pass Columns in All Data Sets / Except Operation

The Except operation passes only rows from Input 1 that do not have matching rows in the other incoming streams. When you use the Only pass columns that exist in all data sets column handling option with the Except operation type, the processor ignores any columns that do not exist in all rows.

With the sample data above, the processor ignores the Bin column, then passes only the rows in Input 1 that do not have matches in Input 2.

For the sample data, the processor passes only two of the rows from Input 1 because the toolbox row has a matching row in Input 2:
ID Name Inventory
2 hammer 11
3 multitool 6

Pass All Columns / Union Operation

The Union operation passes all unique rows from all incoming streams. When you use the Pass all columns column handling method with the Union operation type, the processor passes all unique rows while ensuring that the output rows include all columns from all incoming rows.

The sample data results in the following output:

ID Name Inventory Bin
1 toolbox 8
2 hammer 11
3 multitool 6
1 toolbox 8 2-1
3 multitool 25 2-1
4 wrench 30 2-2

Notice how the first three rows, which are from Input 1, now have a null Bin column.

Pass All Columns / Union All Operation

The Union All operation passes all rows from all incoming streams, including duplicates. When you use the Pass all columns column handling method with the Union All operation type, the processor passes all unique rows while ensuring that the output rows include all columns from all incoming rows.

The sample data results in the following output:
ID Name Inventory Bin
1 toolbox 8
2 hammer 11
3 multitool 6
1 toolbox 8 2-1
3 multitool 25 2-1
4 wrench 30 2-2

Notice how the first three rows, which are from Input 1, now have a null Bin column. With this configuration, no duplicate rows exist to be passed downstream.

Pass All Columns / Intersect Operation

The Intersect operation passes only rows that exist in all incoming streams. When you use the Pass all columns column handling option with the Intersect operation type, you can configure an additional property that determines if the processor compares all columns, or only the columns that exist in all rows:
Compare all columns
When comparing all columns, the processor passes only rows with matching columns that exist in all incoming streams. Since the rows in the sample data above have different schemas, when comparing all columns, there are no matching rows. No rows pass downstream.
Compare only matching columns
When comparing only matching columns, the processor ignores any columns that do not exist in all rows. With the sample data above, the processor ignores the Bin column, then passes only the rows that exist in all data sets.
The sample data results in the following output:
ID Name Inventory
1 toolbox 8

Pass All Columns / Except Operation

The Except operation passes only rows from Input 1 that do not have matching rows in the other incoming streams. When you use the Pass all columns column handling option with the Except operation type, you can configure an additional property that determines if the processor compares all columns, or only the columns that exist in all rows:
Compare all columns
When comparing all columns, the processor passes only rows from Input 1 that do not have rows with matching columns in the other incoming streams.
For the sample data, the processor passes all rows from Input 1 because Input 2 does not include any matching rows:
ID Name Inventory
1 toolbox 8
2 hammer 11
3 multitool 6
Compare only matching columns
When comparing only matching columns, the processor ignores columns that do not exist in all rows and passes only rows from Input 1 that do not have matches in other incoming streams.
For the sample data, the processor passes only two of the rows from Input 1 because the toolbox row in Input 1 has a matching row in Input 2:
ID Name Inventory
2 hammer 11
3 multitool 6

Input Order

Input order for the Union processor is important only when you use the Except operation type.

Upstream stages are assigned an input order based on the order that you connect them to the processor. To assign a stage to Input 1, connect the stage to the processor first, before other stages.

When working with two incoming streams, you can reorder the inputs by selecting the processor to display the pop-up menu, then clicking the Reorder icon: .

Swapping is not available when the processor has more than two incoming streams.

Configuring a Union Processor

Configure a Union processor to merge data from two or more incoming streams.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches processed data.
  2. On the Union tab, configure the following property:
    Union Property Description
    Operation Operation type to perform. Used with the Column Handling property to determine the rows that pass to downstream stages.
    Column Handling Determines how the processor evaluates for matching rows. Also defines the columns that are included in the output rows.
    Only Compare Matching Columns Compares only matching columns, ignoring non-matching columns.

    Available when using the Intersect or Except operation type with the Pass All Columns column handling option.

    For examples of how this property works with the Operation and Column Handling properties, see Output Rows.