Union
The Union processor merges data from two or more incoming streams.
When you configure the Union processor, you connect the upstream stages to the processor and then specify the operation type and column handling option to use. In some cases, you also indicate whether to evaluate all columns or only matching columns.
- Operation property: Union All
- Column Handling property: Pass All Columns
Operation Types
The operation type determines how the Union processor merges data from multiple incoming streams. Row evaluation depends on the column handling method that you select.
- Union - Passes unique rows from all incoming streams.
- Union All - Passes all rows from all incoming streams, including duplicates.
- Intersect - Passes only rows that exist in all incoming streams.
- Except - Passes only the rows from Input 1 of the processor that do not have
matching rows from the other incoming streams.Note: With the Except operation, the input order of upstream stages is important. Stages are assigned an input order based on the order that you connect them to the processor.
For information about how this property works with different column handling methods, see Output Rows.
Column Handling Method
The column handling method determines how the Union processor evaluates for matching rows. It also defines the columns that are included in the output rows.
- Require the same columns in all data sets - The processor adjusts the order of
the columns in incoming data based on the column order of the first batch of
Input 1. This provides uniformity in column order which enables Snowflake to
perform the union, when possible.
Though not concerned with column order, the processor generates a run error when incoming data has different schemas.
- Only pass the columns that exist in all data sets - The processor drops columns that do not exist in all rows, and then performs the selected operation type.
- Pass all columns - By default, the processor evaluates all columns when performing
the selected operation type, and passes all columns downstream.
When used with the Intersect or Except operation, the processor can evaluate all columns or matching columns. In both cases, the processor passes all columns downstream. Matching columns are those with the same name that are populated with non-null values.
For information about how this property works with different operation types, see Output Rows.
Output Rows
The rows that the Union processor passes downstream depends on both the operation type and column handling method that you select.
For example, if you use the Union operation and pass all columns, the processor passes the maximum number of rows downstream, without duplicates. In contrast, if you use the Union operation and only pass the columns that exist in all data sets, the processor passes only the rows that have exact matching columns in all incoming streams.
To see how the properties work together, let's start with some sample data and look at the possible combinations:
Input 1
ID | Name | Inventory |
---|---|---|
1 | toolbox | 8 |
2 | hammer | 11 |
3 | multitool | 6 |
Input 2
ID | Name | Inventory | Bin |
---|---|---|---|
1 | toolbox | 8 | 2-1 |
3 | multitool | 25 | 2-1 |
4 | wrench | 30 | 2-2 |
Require the Same Columns / All Operation Types
When you use the Require the same columns with all data sets
column
handling method with any operation type, the processor requires the incoming rows
from all incoming streams to have the same schema. However, the column order can
differ.
When the incoming data has the same schema, then the operation type behaves as documented. When the incoming data has different schemas, the processor generates a run error and stops the pipeline.
The pipeline is configured to require that all data sets have the same schema, but the schema <input1 column names> and <other input column names> do not match.
Only Pass Columns in All Data Sets / Union Operation
The Union operation passes all unique rows from all incoming streams. When you use
the Only pass columns that exist in all data sets
column handling
method with the Union operation type, the processor drops columns that do not exist
in all rows, then passes all unique rows downstream.
The sample data results in the following output:
ID | Name | Inventory |
---|---|---|
1 | toolbox | 8 |
2 | hammer | 11 |
3 | multitool | 6 |
3 | multitool | 25 |
4 | wrench | 30 |
Bin
column has been dropped. Also, the
toolbox row, which has an exact match in both incoming streams, appears only once
because the processor does not pass duplicate rows.Only Pass Columns in All Data Sets / Union All Operation
The Union All operation passes all rows from all incoming streams, including
duplicates. When you use the Only pass columns that exist in all data
sets
column handling method with the Union All operation type, the
processor drops columns that do not exist in all rows, then passes all rows
downstream.
ID | Name | Inventory |
---|---|---|
1 | toolbox | 8 |
1 | toolbox | 8 |
2 | hammer | 11 |
3 | multitool | 6 |
3 | multitool | 25 |
4 | wrench | 30 |
Bin
column has been dropped. Also, note how
this output includes a duplicate toolbox row, since that row exists in both incoming
streams.Only Pass Columns in All Data Sets / Intersect Operation
The Intersect operation passes only rows that exist in all incoming streams. When
you use the Only pass columns that exist in all data sets
column
handling option with the Intersect operation type, the processor ignores any columns
that do not exist in all rows.
With the sample data, the processor ignores the Bin
column, then
passes only the rows that exist in all incoming streams.
ID | Name | Inventory |
---|---|---|
1 | toolbox | 8 |
Only Pass Columns in All Data Sets / Except Operation
The Except operation passes only rows from Input 1 that do not have matching
rows in the other incoming streams. When you use the Only pass columns
that exist in all data sets
column handling option with the Except
operation type, the processor ignores any columns that do not exist in all rows.
With the sample data above, the processor ignores the Bin
column,
then passes only the rows in Input 1 that do not have matches in Input 2.
ID | Name | Inventory |
---|---|---|
2 | hammer | 11 |
3 | multitool | 6 |
Pass All Columns / Union Operation
The Union operation passes all unique rows from all incoming streams. When you use
the Pass all columns
column handling method with the Union operation
type, the processor passes all unique rows while ensuring that the output rows
include all columns from all incoming rows.
The sample data results in the following output:
ID | Name | Inventory | Bin |
---|---|---|---|
1 | toolbox | 8 | |
2 | hammer | 11 | |
3 | multitool | 6 | |
1 | toolbox | 8 | 2-1 |
3 | multitool | 25 | 2-1 |
4 | wrench | 30 | 2-2 |
Notice how the first three rows, which are from Input 1, now have a null
Bin
column.
Pass All Columns / Union All Operation
The Union All operation passes all rows from all incoming streams, including
duplicates. When you use the Pass all columns
column handling
method with the Union All operation type, the processor passes all unique rows while
ensuring that the output rows include all columns from all incoming rows.
ID | Name | Inventory | Bin |
---|---|---|---|
1 | toolbox | 8 | |
2 | hammer | 11 | |
3 | multitool | 6 | |
1 | toolbox | 8 | 2-1 |
3 | multitool | 25 | 2-1 |
4 | wrench | 30 | 2-2 |
Notice how the first three rows, which are from Input 1, now have a null
Bin
column. With this configuration, no duplicate rows exist to
be passed downstream.
Pass All Columns / Intersect Operation
Pass all columns
column handling option with the
Intersect operation type, you can configure an additional property that determines
if the processor compares all columns, or only the columns that exist in all rows: - Compare all columns
- When comparing all columns, the processor passes only rows with matching columns that exist in all incoming streams. Since the rows in the sample data above have different schemas, when comparing all columns, there are no matching rows. No rows pass downstream.
- Compare only matching columns
- When comparing only matching columns, the processor ignores any columns
that do not exist in all rows. With the sample data above, the
processor ignores the
Bin
column, then passes only the rows that exist in all data sets.
Pass All Columns / Except Operation
Pass all
columns
column handling option with the Except operation type, you can
configure an additional property that determines if the processor compares all
columns, or only the columns that exist in all rows: - Compare all columns
- When comparing all columns, the processor passes only rows from Input 1 that do not have rows with matching columns in the other incoming streams.
- Compare only matching columns
- When comparing only matching columns, the processor ignores columns that do not exist in all rows and passes only rows from Input 1 that do not have matches in other incoming streams.
Input Order
Input order for the Union processor is important only when you use the Except operation type.
Upstream stages are assigned an input order based on the order that you connect them to the processor. To assign a stage to Input 1, connect the stage to the processor first, before other stages.
When working with two incoming streams, you can reorder the inputs by selecting the processor to display the pop-up menu, then clicking the Reorder icon: .
Swapping is not available when the processor has more than two incoming streams.
Configuring a Union Processor
Configure a Union processor to merge data from two or more incoming streams.
-
On the General tab, configure the following
properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches processed data. -
On the Union tab, configure the following property:
Union Property Description Operation Operation type to perform. Used with the Column Handling property to determine the rows that pass to downstream stages. Column Handling Determines how the processor evaluates for matching rows. Also defines the columns that are included in the output rows. Only Compare Matching Columns Compares only matching columns, ignoring non-matching columns. Available when using the Intersect or Except operation type with the Pass All Columns column handling option.
For examples of how this property works with the Operation and Column Handling properties, see Output Rows.