Data Collector
can run a cluster pipeline using cluster batch or cluster streaming execution
mode.
The execution mode that Data Collector can
use depends on the origin system that the cluster pipeline reads from:
- Kafka cluster
- Data Collector can process data from a Kafka cluster in cluster streaming mode. In cluster
streaming mode, Data Collector processes data continuously until you stop the pipeline.
- Data Collector runs as an application within Spark Streaming, an open source
cluster-computing application.
- Spark Streaming runs on either the Mesos or YARN cluster manager to process data
from a Kafka cluster. The cluster manager and Spark Streaming spawn a Data Collector worker for each topic partition in the Kafka cluster. As a result, each
partition has a Data Collector worker to process data. If you add a partition to the Kafka topic, you must
restart the pipeline to enable the Data Collector to generate a new worker to read from the new partition.
- When Spark Streaming runs on YARN, you can limit the number of workers spawned
by configuring the Worker
Count cluster pipeline property. You can also use the Extra Spark
Configuration property to pass Spark configurations to the spark-submit script.
In addition, you can configure the Kafka Consumer origin in a cluster streaming
pipeline on YARN to connect securely through SSL/TLS, Kerberos, or both.
Use the Kafka
Consumer origin to process data from a Kafka cluster in cluster streaming
mode.
- MapR cluster
- Data Collector can process data from a MapR cluster in cluster batch mode.
- In cluster batch mode, Data Collector processes all available data and then stops the pipeline. Data Collector runs as an application on top of MapReduce, an open-source cluster-computing
framework. MapReduce runs on a YARN cluster manager. YARN and MapReduce generate
additional worker nodes as needed. MapReduce creates one map task for each MapR
FS block.
Use the MapR FS origin to process data from MapR in cluster batch
mode.
- HDFS
- Data Collector can process data from HDFS in cluster batch mode. In cluster batch mode, Data Collector processes all available data and then stops the pipeline.
- Data Collector runs as an application on top of MapReduce, an open-source cluster-computing
framework. MapReduce runs on a YARN cluster manager. YARN and MapReduce generate
additional worker nodes as needed. MapReduce creates one map task for each HDFS
block.
Use the Hadoop FS origin to process data from HDFS in cluster batch
mode.
- Amazon S3
-
Data Collector can process data from Amazon S3 in the following cluster batch modes:
- Cluster EMR batch mode - In cluster EMR batch mode, Data Collector runs on an Amazon EMR cluster to process Amazon S3 data. Data Collector can run on an existing EMR cluster or on a new EMR cluster that
is provisioned when the pipeline starts. When you provision a new
EMR cluster, you can configure whether the cluster remains active or
terminates when the pipeline stops.
- Cluster batch mode - In cluster batch mode, Data Collector runs on a Cloudera distribution of Hadoop (CDH) or Hortonworks
Data Platform (HDP) cluster to process Amazon S3 data.
-
In either mode, Data Collector processes all available data and then stops the pipeline.
-
Data Collector runs as an application on top of MapReduce in the EMR, CDH, or HDP
cluster. MapReduce runs on a YARN cluster manager. MapReduce creates one map
task for each HDFS block.
-
Use the Hadoop FS origin to process data from Amazon S3 in cluster EMR or
cluster batch mode.