Join

The Join processor joins data from two tables, generating rows based on related input rows.

You can use the Join processor to join data from two different origins, or to join two branches of data in your pipeline. For example, say you created a second branch in your pipeline to perform aggregate calculations and now you want to merge that data back with the other branch of data. You can use the Join processor to do that.

When joining data from two different origins, you can add the Join processor immediately after the origin stages. Or, you can add other processors after the origins to perform additional transformations and then use the Join processor to join the data.

Each Join processor can join data from two tables. To join more than two tables in a single pipeline, use additional Join processors in the pipeline. Or to merge two or more tables, use the Union processor.

When you configure the Join processor, you specify the type of join and the names of columns with matching values used to perform the join.

When the joined tables have identical column names, the processor adds a t1_ or t2_ prefix to the column name to avoid duplicate names. The prefixes indicate whether the column is from the left or first table or from the right or second table.

For example, if you configure the processor to join tables that both include a column named order_id, the processor includes a t1_order_id column and a t2_order_id column in the generated output. To remove or rename one of the columns, use the Column Remover or Column Renamer processor after the Join processor.

Join Columns

The Join processor joins data based on matching values in specific columns. In the Columns to Join On property, you specify the names of the columns used to perform the join. The specified columns can have identical or unique names. When the processor performs the join, it adds both join columns to the output.

For example, you configure the processor to join data by department name when the dept column in the left table matches the value of the department column in the right table. In the generated output, the processor includes a dept column and a department column.

To remove one of the join columns, use the Column Remover processor after the Join processor.

Join Types

The Join processor supports the following types of joins:
  • Inner - Returns rows that have matching values in both tables.
  • Cross - Returns the Cartesian product of two sets of data.
  • Full outer - Returns all rows, including rows that have matching values in both tables and rows from either table that do not have a match.
  • Left anti - Returns rows from the left table that do not have a match in the right table.
  • Left outer - Returns rows from the left table, and the matched rows from the right table.
  • Left semi - Returns rows that have matching values in both tables, but includes only the data from the left table.
  • Right anti - Returns rows from the right table that do not have a match in the left table.
  • Right outer - Returns rows from the right table, and the matched rows from the left table.

In the pipeline canvas, the first input stream that you connect to the Join processor represents the left table. The second input stream that you connect to the processor represents the right table.

For example, in the following image, the customers input stream represents the left table because it was connected to the processor first. The orders input stream represents the right table because it was connected to the processor second.

You can reorder the left and right inputs by selecting the processor to display the pop-up menu, then clicking the Reorder icon: .

Join Type Examples

Let's look at an example for each join type. The example joins customer and order data using matching values in the column named customer_id.

Because the join column name is identical in both tables, the processor adds a t1_ or t2_ prefix to the column name in the generated output.

The customer data is the left table and contains the following rows:

customer_id customer_name
2 Anna Smith
47 Raquel Trujillo
98 Theo Barnes
The order data is the right table and contains the following rows:
customer_id order_id amount
2 1075623 34.56
47 1076645 234.67
342 1050945 126.05

Inner Join

An inner join returns rows that have matching values in both tables.

When the Join processor performs an inner join on the sample data using customer_id as the join column, the processor produces the following output:

t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67

Cross Join

A cross join returns the Cartesian product of two sets of data. A Cartesian product is the set of all possible ordered pairs between the two tables.

When configuring the Join processor to perform a cross join, you do not specify the columns used to perform the join.

When the Join processor performs a cross join on the sample data, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
2 Anna Smith 47 1076645 234.67
2 Anna Smith 342 1050945 126.05
47 Raquel Trujillo 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
47 Raquel Trujillo 342 1050945 126.05
98 Theo Barnes 2 1075623 34.56
98 Theo Barnes 47 1076645 234.67
98 Theo Barnes 342 1050945 126.05

Full Outer Join

A full outer join returns all rows, including rows that have matching values in both tables and rows from either table that do not have a match.

When the Join processor performs a full outer join on the sample data using customer_id as the join column, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
342 1050945 126.05
98 Theo Barnes

Left Anti Join

A left anti join returns rows from the left table that do not have a match in the right table.

When the Join processor performs a left anti join on the sample data using customer_id as the join column, the processor produces the following output:
t1_customer_id customer_name
98 Theo Barnes

Left Outer Join

A left outer join returns rows from the left table, and the matched rows from the right table.

When the Join processor performs a left outer join on the sample data using customer_id as the join column, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
98 Theo Barnes

Left Semi Join

A left semi join returns rows that have matching values in both tables, but does not include the merged data from both tables. The results include only the data from the left table.

When the Join processor performs a left semi join on the sample data using customer_id as the join column, the processor produces the following output:
t1_customer_id customer_name
2 Anna Smith
47 Raquel Trujillo

Right Anti Join

A right anti join returns rows from the right table that do not have a match in the left table.

When the Join processor performs a right anti join on the sample data using customer_id as the join column, the processor produces the following output:
t2_customer_id order_id amount
342 1050945 126.05

Right Outer Join

A right outer join returns rows from the right table, and the matched rows from the left table.

When the Join processor performs a right outer join on the sample data using customer_id as the join column, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
342 1050945 126.05

Configuring a Join Processor

Configure a Join processor to join data from two tables.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches processed data.
  2. On the Join tab, configure the following properties:
    Join Property Description
    Join Type Type of join to perform.
    Columns to Join On Names of the columns with matching values used to perform the join. Specify the column name in the first table and the second table. You can enter column names or select columns from preview data.

    Columns can have identical or unique names.

    Click Add Another to specify another set of join columns.

    Not used for a cross join.