Union
The Union processor merges data from two or more input streams. All data must have the same schema.
- Union - Passes all records from all incoming streams.
- Intersect - Passes only the records that exist in all incoming streams.
- Except - Passes only the records from Input 1 of the processor that do not have matching records from the other input streams.
When you configure the Union processor, you connect the upstream stages to the processor and then specify the operation to use.
When you use the Except operation, the input order of upstream stages is important. Stages are assigned to input streams based on the order that you connect them to the processor. To assign a stage to Input 1, connect it to the processor before any other stages. When working with two input streams, you can swap the inputs by clicking the processor in the canvas, and then clicking . Swapping is not available when the processor has more than two input streams.
Operation Examples
To illustrate how the different operations work, say you want to merge the following simple records from three input streams. Note that the record schemas are the same for all of the data, as required.
Input 1:
username | color |
---|---|
tmorrison | blue |
h-lee | red |
cbronte | purple |
ntozake.s | green |
username | color |
---|---|
tmorrison | blue |
h-lee | green |
a_walker | red |
Input 3:
username | color |
---|---|
tmorrison | blue |
h-lee | red |
a_walker | red |
maya | violet |
Union
The Union operation passes all data from all inputs, without regard for duplicates.
username | color |
---|---|
tmorrison | blue |
h-lee | red |
cbronte | purple |
ntozake.s | green |
tmorrison | blue |
h-lee | green |
a_walker | red |
tmorrison | blue |
h-lee | red |
a_walker | red |
maya | violet |
Intersect
The Intersect operation passes only the records that exist in all incoming streams.
username | color |
---|---|
tmorrison | blue |
This is the only record that exists in all input streams.
Except
The Except operation passes only the records in Input 1 that do not have matching records in the other incoming streams.Using this operation results in the following output:
username | color |
---|---|
cbronte | purple |
ntozake.s | green |
The tmorrison
and h-lee
records are
not included because the tmorrison
record is in both Input 2 and 3,
and the h-lee
record is in Input 3.
Configuring a Union Processor
Configure a Union processor to merge data from two or more input streams. All data must share the same schema.
-
On the General tab, configure the following
properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Union tab, configure the following property:
Union Property Description Operation Operation to perform when merging data: - Union - Passes all records from all incoming streams.
- Intersect - Passes only the records that exist in all incoming streams.
- Except - Passes only the records from Input 1 of the processor that do not have matching records from the other input streams.