Cluster Batch and Streaming Execution Modes

Data Collector can run a cluster pipeline using cluster batch or cluster streaming execution mode.

The execution mode that Data Collector can use depends on the origin system that the cluster pipeline reads from:

Kafka cluster
Data Collector can process data from a Kafka cluster in cluster streaming mode. In cluster streaming mode, Data Collector processes data continuously until you stop the pipeline.
Data Collector runs as an application within Spark Streaming, an open source cluster-computing application.
Spark Streaming runs on either the Mesos or YARN cluster manager to process data from a Kafka cluster. The cluster manager and Spark Streaming spawn a Data Collector worker for each topic partition in the Kafka cluster. As a result, each partition has a Data Collector worker to process data. If you add a partition to the Kafka topic, you must restart the pipeline to enable the Data Collector to generate a new worker to read from the new partition.
When Spark Streaming runs on YARN, you can limit the number of workers spawned by configuring the Worker Count cluster pipeline property. You can also use the Extra Spark Configuration property to pass Spark configurations to the spark-submit script. In addition, you can configure the Kafka Consumer origin in a cluster streaming pipeline on YARN to connect securely through SSL/TLS, Kerberos, or both.

Use the Kafka Consumer origin to process data from a Kafka cluster in cluster streaming mode.

MapR cluster
Data Collector can process data from a MapR cluster in cluster batch mode.
In cluster batch mode, Data Collector processes all available data and then stops the pipeline. Data Collector runs as an application on top of MapReduce, an open-source cluster-computing framework. MapReduce runs on a YARN cluster manager. YARN and MapReduce generate additional worker nodes as needed. MapReduce creates one map task for each MapR FS block.

Use the MapR FS origin to process data from MapR in cluster batch mode.

HDFS
Data Collector can process data from HDFS in cluster batch mode. In cluster batch mode, Data Collector processes all available data and then stops the pipeline.
Data Collector runs as an application on top of MapReduce, an open-source cluster-computing framework. MapReduce runs on a YARN cluster manager. YARN and MapReduce generate additional worker nodes as needed. MapReduce creates one map task for each HDFS block.

Use the Hadoop FS origin to process data from HDFS in cluster batch mode.

Amazon S3
Data Collector can process data from Amazon S3 in the following cluster batch modes:
  • Cluster EMR batch mode - In cluster EMR batch mode, Data Collector runs on an Amazon EMR cluster to process Amazon S3 data. Data Collector can run on an existing EMR cluster or on a new EMR cluster that is provisioned when the pipeline starts. When you provision a new EMR cluster, you can configure whether the cluster remains active or terminates when the pipeline stops.
  • Cluster batch mode - In cluster batch mode, Data Collector runs on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process Amazon S3 data.

In either mode, Data Collector processes all available data and then stops the pipeline.

Data Collector runs as an application on top of MapReduce in the EMR, CDH, or HDP cluster. MapReduce runs on a YARN cluster manager. MapReduce creates one map task for each HDFS block.

Use the Hadoop FS origin to process data from Amazon S3 in cluster EMR or cluster batch mode.