Join

The Join processor joins data from two input streams. When you use more than one origin in a pipeline, you must use the Join processor to join the data read by the origins. When needed, you can use a Join processor to join lookup data to primary pipeline data.

You can add the Join processor immediately after the origin stages. Or, you can add other processors after the origins to perform additional transformations and then use the Join processor to join the data.

Each Join processor can join data from two input streams. To join more than two input streams in a single pipeline, use additional Join processors in the pipeline. However, be aware that the Join processor causes Spark to shuffle the data, redistributing the data so that it's grouped differently across the partitions, which can be an expensive operation.

When you configure the Join processor, you specify the type of join and the criteria used to perform the join. To avoid duplicate field names in the resulting data, you can also specify prefixes to add to field names from each input stream.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.