Join
The Join processor joins data from two tables, generating rows based on related input rows.
You can use the Join processor to join data from two different origins, or to join two branches of data in your pipeline. For example, say you created a second branch in your pipeline to perform aggregate calculations and now you want to merge that data back with the other branch of data. You can use the Join processor to do that.
When joining data from two different origins, you can add the Join processor immediately after the origin stages. Or, you can add other processors after the origins to perform additional transformations and then use the Join processor to join the data.
Each Join processor can join data from two tables. To join more than two tables in a single pipeline, use additional Join processors in the pipeline. Or to merge two or more tables, use the Union processor.
When you configure the Join processor, you specify the type of join and the names of columns with matching values used to perform the join.
When the joined tables have identical column names, the processor adds a
t1_
or t2_
prefix to the column name to avoid
duplicate names. The prefixes indicate whether the column is from the left or first
table or from the right or second table.
For example, if you configure the processor to join tables that both include a column
named order_id
, the processor includes a t1_order_id
column and a t2_order_id
column in the generated output. To remove or
rename one of the columns, use the Column Remover or Column Renamer processor after the
Join processor.
Join Columns
The Join processor joins data based on matching values in specific columns. In the Columns to Join On property, you specify the names of the columns used to perform the join. The specified columns can have identical or unique names. When the processor performs the join, it adds both join columns to the output.
For example, you configure the processor to join data by department name when the
dept
column in the left table matches the value of the
department
column in the right table. In the generated output, the
processor includes a dept
column and a department
column.
To remove one of the join columns, use the Column Remover processor after the Join processor.
Join Types
- Inner - Returns rows that have matching values in both tables.
- Cross - Returns the Cartesian product of two sets of data.
- Full outer - Returns all rows, including rows that have matching values in both tables and rows from either table that do not have a match.
- Left anti - Returns rows from the left table that do not have a match in the right table.
- Left outer - Returns rows from the left table, and the matched rows from the right table.
- Left semi - Returns rows that have matching values in both tables, but includes only the data from the left table.
- Right anti - Returns rows from the right table that do not have a match in the left table.
- Right outer - Returns rows from the right table, and the matched rows from the left table.
In the pipeline canvas, the first input stream that you connect to the Join processor represents the left table. The second input stream that you connect to the processor represents the right table.
For example, in the following image, the customers input stream represents the left table because it was connected to the processor first. The orders input stream represents the right table because it was connected to the processor second.
You can reorder the left and right inputs by selecting the processor to display the
pop-up menu, then clicking the Reorder icon: .
Join Type Examples
Let's look at an example for each join type. The example joins customer and order data
using matching values in the column named customer_id
.
Because the join column name is identical in both tables, the processor adds a
t1_
or t2_
prefix to the column name in the
generated output.
The customer data is the left table and contains the following rows:
customer_id | customer_name |
---|---|
2 | Anna Smith |
47 | Raquel Trujillo |
98 | Theo Barnes |
customer_id | order_id | amount |
---|---|---|
2 | 1075623 | 34.56 |
47 | 1076645 | 234.67 |
342 | 1050945 | 126.05 |
Inner Join
An inner join returns rows that have matching values in both tables.
When the Join processor performs an inner join on the sample data using
customer_id
as the join column, the processor produces the
following output:
t1_customer_id | customer_name | t2_customer_id | order_id | amount |
---|---|---|---|---|
2 | Anna Smith | 2 | 1075623 | 34.56 |
47 | Raquel Trujillo | 47 | 1076645 | 234.67 |
Cross Join
A cross join returns the Cartesian product of two sets of data. A Cartesian product is the set of all possible ordered pairs between the two tables.
When configuring the Join processor to perform a cross join, you do not specify the columns used to perform the join.
t1_customer_id | customer_name | t2_customer_id | order_id | amount |
---|---|---|---|---|
2 | Anna Smith | 2 | 1075623 | 34.56 |
2 | Anna Smith | 47 | 1076645 | 234.67 |
2 | Anna Smith | 342 | 1050945 | 126.05 |
47 | Raquel Trujillo | 2 | 1075623 | 34.56 |
47 | Raquel Trujillo | 47 | 1076645 | 234.67 |
47 | Raquel Trujillo | 342 | 1050945 | 126.05 |
98 | Theo Barnes | 2 | 1075623 | 34.56 |
98 | Theo Barnes | 47 | 1076645 | 234.67 |
98 | Theo Barnes | 342 | 1050945 | 126.05 |
Full Outer Join
A full outer join returns all rows, including rows that have matching values in both tables and rows from either table that do not have a match.
customer_id
as the join column, the processor produces the
following output: t1_customer_id | customer_name | t2_customer_id | order_id | amount |
---|---|---|---|---|
2 | Anna Smith | 2 | 1075623 | 34.56 |
47 | Raquel Trujillo | 47 | 1076645 | 234.67 |
342 | 1050945 | 126.05 | ||
98 | Theo Barnes |
Left Anti Join
A left anti join returns rows from the left table that do not have a match in the right table.
customer_id
as the join column, the processor produces the
following output:t1_customer_id | customer_name |
---|---|
98 | Theo Barnes |
Left Outer Join
A left outer join returns rows from the left table, and the matched rows from the right table.
customer_id
as the join column, the processor produces the
following output:t1_customer_id | customer_name | t2_customer_id | order_id | amount |
---|---|---|---|---|
2 | Anna Smith | 2 | 1075623 | 34.56 |
47 | Raquel Trujillo | 47 | 1076645 | 234.67 |
98 | Theo Barnes |
Left Semi Join
A left semi join returns rows that have matching values in both tables, but does not include the merged data from both tables. The results include only the data from the left table.
customer_id
as the join column, the processor produces the
following output:t1_customer_id | customer_name |
---|---|
2 | Anna Smith |
47 | Raquel Trujillo |
Right Anti Join
A right anti join returns rows from the right table that do not have a match in the left table.
customer_id
as the join column, the processor produces the
following output:t2_customer_id | order_id | amount |
---|---|---|
342 | 1050945 | 126.05 |
Right Outer Join
A right outer join returns rows from the right table, and the matched rows from the left table.
customer_id
as the join column, the processor produces the
following output:t1_customer_id | customer_name | t2_customer_id | order_id | amount |
---|---|---|---|---|
2 | Anna Smith | 2 | 1075623 | 34.56 |
47 | Raquel Trujillo | 47 | 1076645 | 234.67 |
342 | 1050945 | 126.05 |
Configuring a Join Processor
Configure a Join processor to join data from two tables.
-
On the General tab, configure the following
properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches processed data. -
On the Join tab, configure the following properties:
Join Property Description Join Type Type of join to perform. Columns to Join On Names of the columns with matching values used to perform the join. Specify the column name in the first table and the second table. You can enter column names or select columns from preview data. Columns can have identical or unique names.
Click Add Another to specify another set of join columns.
Not used for a cross join.