Union

The Union processor merges data from two or more input streams. All data must have the same schema.

The Union processor can generate different output based on the operation that you select:
  • Union - Passes all records from all incoming streams.
  • Intersect - Passes only the records that exist in all incoming streams.
  • Except - Passes only the records from Input 1 of the processor that do not have matching records from the other input streams.

When you configure the Union processor, you connect the upstream stages to the processor and then specify the operation to use.

When you use the Except operation, the input order of upstream stages is important. Stages are assigned to input streams based on the order that you connect them to the processor. To assign a stage to Input 1, connect it to the processor before any other stages. When working with two input streams, you can swap the inputs by clicking the processor in the canvas, and then clicking . Swapping is not available when the processor has more than two input streams.

Note: Due to metadata added to Transformer records within the pipeline, the Union processor does not display output when you preview pipelines. When including the processor in pipeline development, you must run the pipeline to review how the Union processor and subsequent stages process the data.

Operation Examples

To illustrate how the different operations work, say you want to merge the following simple records from three input streams. Note that the record schemas are the same for all of the data, as required.

Input 1:

username color
tmorrison blue
h-lee red
cbronte purple
ntozake.s green
Input 2:
username color
tmorrison blue
h-lee green
a_walker red

Input 3:

username color
tmorrison blue
h-lee red
a_walker red
maya violet

Union

The Union operation passes all data from all inputs, without regard for duplicates.

Using this operation results in the following output:
username color
tmorrison blue
h-lee red
cbronte purple
ntozake.s green
tmorrison blue
h-lee green
a_walker red
tmorrison blue
h-lee red
a_walker red
maya violet

Intersect

The Intersect operation passes only the records that exist in all incoming streams.

Using this operation results in the following output:
username color
tmorrison blue

This is the only record that exists in all input streams.

Except

The Except operation passes only the records in Input 1 that do not have matching records in the other incoming streams.

Using this operation results in the following output:

username color
cbronte purple
ntozake.s green

The tmorrison and h-lee records are not included because the tmorrison record is in both Input 2 and 3, and the h-lee record is in Input 3.

Configuring a Union Processor

Configure a Union processor to merge data from two or more input streams. All data must share the same schema.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Union tab, configure the following property:
    Union Property Description
    Operation Operation to perform when merging data:
    • Union - Passes all records from all incoming streams.
    • Intersect - Passes only the records that exist in all incoming streams.
    • Except - Passes only the records from Input 1 of the processor that do not have matching records from the other input streams.