Join

The Join processor joins related data from two inputs, generating rows that include data from both inputs. If you simply want to merge two or more sets of data into a single data flow, use the Union processor.

You can use the Join processor to join data from two different origins by connecting the Join processor to both origins. You can also use the Join processor to join two branches of data in your pipeline.

For example, you can use the processor immediately after two origins to merge that data. Now say you create a second branch in your pipeline to perform aggregate calculations. You can then use the Join processor to merge the results of the calculations with the other branch of data.

The Join processor can join data from two inputs. To join more than two inputs in a single pipeline, add additional Join processors to the pipeline.

When you configure the Join processor, you specify the type of join to perform and the join condition to use.

When the joined inputs have identical column names, the processor adds a t1_ or t2_ prefix to the column name to avoid duplicate names: t1_ for the left input and t2_ for the right input. To remove or rename one of the columns, use a Column Remover or Column Renamer processor downstream.

For example, if you configure the processor to join inputs that both include a column named comments, the processor includes a t1_comments column and a t2_comments column in the generated output. If you want to keep both columns, you might use the Column Renamer processor to replace the t1_ and t2_ prefixes with the originating table names.

Join Types

The Join processor supports the following types of joins:
  • Inner - Returns rows that have matching values in both inputs.
  • Cross - Returns the Cartesian product of two sets of data.
  • Full outer - Returns all rows, including rows that have matching values in both inputs, and rows from either input that do not have a match.
  • Left anti - Returns rows from the left input that do not have a match in the right input.
  • Left outer - Returns rows from the left input, and matching rows from the right input.
  • Left semi - Returns rows that have matching values in both inputs, but includes only the data from the left input.
  • Right anti - Returns rows from the right input that do not have a match in the left input.
  • Right outer - Returns rows from the right input, and matching rows from the left input.

In the pipeline canvas, the first input stream that you connect to the Join processor represents the left input. The second input stream that you connect to the processor represents the right input.

For example, in the following image, the customers input stream represents the left input because it was connected to the processor first. The orders input stream represents the right input because it was connected to the processor second.

You can reorder the left and right inputs by selecting the processor to display the pop-up menu, then clicking the Reorder icon: .

Join Type Examples

Let's look at an example for each join type. The examples join customer and order data using matching values in the column named customer_id and the default Columns to Join On join condition.

Because the join column name is identical in both inputs, the processor adds a t1_ or t2_ prefix to the column name in the generated output. The processor uses t1_ for the left input and t2_ for the right input.

The customer data is the left input and contains the following rows:

customer_id customer_name
2 Anna Smith
47 Raquel Trujillo
98 Theo Barnes
The order data is the right input and contains the following rows:
customer_id order_id amount
2 1075623 34.56
47 1076645 234.67
342 1050945 126.05

Inner Join

An inner join returns rows that have matching values in both inputs.

When the Join processor performs an inner join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:

t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67

Cross Join

A cross join returns the Cartesian product of two sets of data. A Cartesian product is the set of all possible ordered pairs between the two inputs.

When configuring the Join processor to perform a cross join, you do not specify a join condition.

When the Join processor performs a cross join on the sample data, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
2 Anna Smith 47 1076645 234.67
2 Anna Smith 342 1050945 126.05
47 Raquel Trujillo 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
47 Raquel Trujillo 342 1050945 126.05
98 Theo Barnes 2 1075623 34.56
98 Theo Barnes 47 1076645 234.67
98 Theo Barnes 342 1050945 126.05

Full Outer Join

A full outer join returns all rows, including rows that have matching values in both inputs and rows from either input that do not have a match.

When the Join processor performs a full outer join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
342 1050945 126.05
98 Theo Barnes

Left Anti Join

A left anti join returns rows from the left input that do not have a match in the right input.

When the Join processor performs a left anti join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t1_customer_id customer_name
98 Theo Barnes

Left Outer Join

A left outer join returns rows from the left input, and matched rows from the right input.

When the Join processor performs a left outer join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
98 Theo Barnes

Left Semi Join

A left semi join returns rows that have matching values in both inputs, but does not include the merged data from both inputs. The results include only the data from the left input.

When the Join processor performs a left semi join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t1_customer_id customer_name
2 Anna Smith
47 Raquel Trujillo

Right Anti Join

A right anti join returns rows from the right input that do not have a match in the left input.

When the Join processor performs a right anti join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t2_customer_id order_id amount
342 1050945 126.05

Right Outer Join

A right outer join returns rows from the right input, and matched rows from the left input.

When the Join processor performs a right outer join on the sample data using Columns to Join On as the join condition and customer_id for both join columns, the processor produces the following output:
t1_customer_id customer_name t2_customer_id order_id amount
2 Anna Smith 2 1075623 34.56
47 Raquel Trujillo 47 1076645 234.67
342 1050945 126.05

Join Conditions

When you configure the Join processor, you specify the join condition using one of the following options:
Columns to Join On property
The Columns to Join On property enables specifying the join columns for each input. The specified columns can have identical or unique names. When the processor performs the join, it adds each set of join columns to the output, in addition to the other columns added for the join type.

For example, you specify the dept column as the join column for the Input 1 (Left) property and the department column for the Input 2 (Right) property. In the generated output, the processor includes both the dept and department columns.

This option is best used when join column names are not the same or when you want to include each set of join columns in the output. When needed, you can remove columns with a Column Remover processor.

This is the default join condition. To configure this option, define the Columns to Join On properties.

Join condition: Using
The Using join condition also enables specifying join columns. However, you specify a single column name for both inputs, so the column names must be the same. When the processor performs the join, it adds only one of the join columns to the output, in addition to the other columns added for the join type.
For example, if you specify the dept column for the join, both inputs must include a dept column that can be used to match rows. In the generated output, the processor includes only one dept column instead of including a duplicate dept column.
Use this option when the join columns have the same name, and when you want to include only one of each set of join columns in the output. If the column names do not match, use one of the other join condition options.
To configure a Using join condition, enable the Define Join Condition property and set the Join Command property to Using. Then, specify the join columns in the Join Using Columns property.
Join condition: On
The On join condition enables you to define a custom condition for the join. When defining the join condition, use prefixes when referencing the two sets of data. The processor uses t1_ and t2_ by default, t1_ for the left input and t2_ for the right input. If you specify custom prefixes, you must use them in the join condition.
For example, you can create an On join condition that performs a FLOOR operation on one join column before the match and uses the default prefixes, as follows: FLOOR(t1_sales_total) = (t2_TOTAL_SALES).
You might also add a condition that ensures that rows are joined only when the values in both join columns are positive, as follows: FLOOR(t1_sales_total) = (t2_TOTAL_SALES) AND t1_sales_total > 0 AND t2_TOTAL_SALES > 0.
If you define custom prefixes, left_ and right_, then you must use the prefixes in the condition, as follows: FLOOR(left_sales_total) = (right_TOTAL_SALES) AND left_sales_total > 0 AND right_TOTAL_SALES > 0.
Use the On join condition to specify a custom join condition.
To configure the On join condition, enable the Define Join Condition property and set the Join Command property to On. Then, specify the condition to use in the Join Condition property. You can optionally define custom prefixes for the inputs, replacing the default t1_ and t2_ prefixes.

Configuring a Join Processor

Configure a Join processor to join related data from two inputs and generate rows that include data from both inputs. If you simply want to merge two or more sets of data into a single data flow, use the Union processor.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches processed data.
  2. On the Join tab, configure the following properties:
    Join Property Description
    Join Type Type of join to perform.
    Define Join Condition Enables using the On or Using join condition.

    Leave clear to use the default Columns to Join On condition.

    Join Command Join condition to use:
    • On
    • Using

    Available when Define Join Condition is enabled.

    Join Condition Condition to use with the On join condition. Specify any valid Snowflake condition.

    Available with the On join command. Not available for Cross joins.

    Input 1 Prefix Prefix for the first, or left, input. Use the specified prefix when defining the left input in the On join condition.

    The processor adds the prefix to columns from the left input when identical column names exist in the right input.

    Default is t1_.

    Available with the On join command.

    Input 2 Prefix Prefix for the second, or right, input. Use the specified prefix when defining the right input in the On join condition.

    The processor adds the prefix to columns from the right input when identical column names exist in the left input.

    Default is t2_.

    Available with the On join command.

    Join Using Columns Columns to use with the Using join condition. You can enter column names or select columns from preview data.

    Available with the Using join command.

    Columns to Join On Names of the columns with values used to perform the join. Specify a column name for the Input 1 (Left) and the Input 2 (Right) properties. You can enter column names or select columns from preview data.

    Columns can have identical or unique names.

    Click Add Another to specify another set of join columns.

    Not available for Cross joins.