Comparing IBM StreamSets Pipelines
In Control Hub, you can build Data Collector, Transformer, and Transformer for Snowflake pipelines. To choose which type of pipeline to build, you need to understand how they are similar and different.
At a high level, here's how the pipelines compare:
- Data Collector pipelines
- Data Collector pipelines are data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
- Transformer pipelines
- Transformer pipelines are data processing pipelines run on Apache Spark. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform set-based transformations such as joins, aggregates, and sorts on the entire data set.
- Transformer for Snowflake pipelines
- Transformer for Snowflake pipelines generate SQL queries based on your pipeline configuration and pass the queries to Snowflake for execution. Snowflake pipelines read from and write to Snowflake tables using Snowpark DataFrame-based processing.
You configure all pipeline types on the same canvas. The difference lies in the available functionality and how the engine executes the pipeline.
Some functionality is exactly the same. For example, you can use runtime parameters in all pipelines. And you can use origin, processor, and destination stages to define pipeline processing.
All pipeline types include stages with similar names. Though these stages generally function the same way, they can also operate differently because each engine processes data differently.
All pipeline types also include some unique functionality. For example, a Data Collector pipeline allows only one origin, a Transformer pipeline can include multiple origins, and a Transformer for Snowflake pipeline can have multiple origins but only reads from and writes to Snowflake. A Data Collector pipeline typically reads data in multiple batches, while a Transformer pipeline can read all available data in a single batch, and a Transformer for Snowflake passes queries to Snowflake for execution.
Category | Data Collector Pipeline | Transformer Pipeline |
---|---|---|
Execution engine | Runs on a StreamSets open source engine as a single JVM on bare metal, a VM, a container, or in the cloud. | Runs on a Spark cluster. Can run on a local Transformer machine for development. |
Control Hub job | Can run multiple pipeline instances on multiple Data Collectors for each job. You manually scale out the pipeline processing by increasing the number of pipeline instances for a job. | Runs a single pipeline instance on one Transformer for each job. Spark automatically scales out the pipeline processing across nodes in the cluster. |
Number of origins | Allows one origin. | Allows multiple origins. |
Schema | Allows records within a batch to have different schemas. | Requires all records in a batch to have the same schema.
File-based origins require that all files processed in the same pipeline run have the same schema. As a batch passes through the pipeline, the schema for the data can change, but all data must have the same schema. As a result, if a processor alters the schema of a subset of records in the batch, then the remaining records are similarly altered to ensure they have the same schema. For example, if a processor generates a new field for a subset of records, that field is added to the remaining records with null values. This is expected Spark behavior. |
Streaming pipeline execution | Processes streaming data by default, not stopping until you stop the pipeline. | Provides streaming execution mode to process streaming data. Streaming pipelines do not stop until you stop the pipeline. |
Batch pipeline execution | Enabled by configuring the pipeline to pass a dataflow trigger to a Pipeline Finisher executor to stop a pipeline after processing all data. | Provides batch execution mode to process all available data in a
single batch, then stop the pipeline. This is the default execution mode. |
Dataflow triggers and executor stages | Available. | Not available. |
Calculations across records within a batch | Not available. | Available in stages such as the Aggregate, Deduplicate, Rank, and Sort processors. |
Merging streams | Allows merging multiple streams by simply connecting multiple stages to a single stage. | Provides a Union processor to merge multiple data streams. |
Joining data | Provides lookup processors to enhance data in the primary data stream. | Provides a Join processor to join records from two data streams. |
Expression language | Supports using the IBM StreamSets expression language in expressions. | Supports using the Spark SQL query language for data
processing. See the individual stage documentation for examples, such as the Filter processor or Spark SQL Expression processor. Also supports the IBM StreamSets expression language in properties that are evaluated only once, before pipeline processing begins. |
Merge consecutive batches | Consecutive batches cannot be merged in the pipeline. | Provides the Window processor to merge small streaming batches into larger batches for processing. |
Repartition data | Not available. | Provides the Repartition processor to repartition data. |
Stage library versions | Allows using different stage library versions in a pipeline to access the same external system. | Requires all stages that access the same external system in a pipeline to use the same stage library version. |
Preview display | Lists records based on the input order of the stage. | Lists records based on the input order of the stage. Processors can display records based on the input order or output order of the processor. |
Record format | Records include record header attributes and can include field attributes. | Records do not include record header attributes or field attributes. |
Internal record representation | SDC record data format. | Spark data frame. |
JSON file processing | Files can contain multiple JSON objects or a single JSON array. | Files must include data of the JSON Lines format. JSON objects can be on a single line or span multiple lines. |
Fields and Field Paths | When referencing a field, you typically use a leading slash to denote the field path. | When referencing a field, you do not use a leading slash. |