Cluster Pipelines (deprecated)

A cluster pipeline is a pipeline that runs in cluster execution mode.

Important: This functionality is deprecated and may be removed in a future release. You can use StreamSets Transformer instead. For more information, see the Transformer documentation Transformer documentation.

You can run a pipeline in standalone execution mode or cluster execution mode. In standalone mode, a single Data Collector process runs the pipeline. A pipeline runs in standalone mode by default.

In cluster mode, the Data Collector uses a cluster manager and a cluster application to spawn additional workers as needed. Use cluster mode to read data from a Kafka cluster, MapR cluster, HDFS, or Amazon S3.

When would you choose standalone or cluster mode? Say you want to ingest logs from application servers and perform a computationally expensive transformation. To do this, you might use a set of standalone pipelines to stream log data from each application server to a Kafka or MapR cluster. And then use a cluster pipeline to process the data from the cluster and perform the expensive transformation.

Or, you might use cluster mode to move data from HDFS to another destination, such as Elasticsearch.