Cluster Pipelines (deprecated)
You can run a pipeline in standalone execution mode or cluster execution mode. In standalone mode, a single Data Collector process runs the pipeline. A pipeline runs in standalone mode by default.
In cluster mode, the Data Collector uses a cluster manager and a cluster application to spawn additional workers as needed. Use cluster mode to read data from a Kafka cluster, MapR cluster, HDFS, or Amazon S3.
When would you choose standalone or cluster mode? Say you want to ingest logs from application servers and perform a computationally expensive transformation. To do this, you might use a set of standalone pipelines to stream log data from each application server to a Kafka or MapR cluster. And then use a cluster pipeline to process the data from the cluster and perform the expensive transformation.
Cluster Batch and Streaming Execution Modes
Data Collector can run a cluster pipeline using cluster batch or cluster streaming execution mode.
The execution mode that Data Collector can use depends on the origin system that the cluster pipeline reads from:
- Kafka cluster
- Data Collector can process data from a Kafka cluster in cluster streaming mode. In cluster streaming mode, Data Collector processes data continuously until you stop the pipeline.
- MapR cluster
- Data Collector can process data from a MapR cluster in cluster batch mode.
- HDFS
- Data Collector can process data from HDFS in cluster batch mode. In cluster batch mode, Data Collector processes all available data and then stops the pipeline.
- Amazon S3
-
Data Collector can process data from Amazon S3 in the following cluster batch modes:
- Cluster EMR batch mode - In cluster EMR batch mode, Data Collector runs on an Amazon EMR cluster to process Amazon S3 data. Data Collector can run on an existing EMR cluster or on a new EMR cluster that is provisioned when the pipeline starts. When you provision a new EMR cluster, you can configure whether the cluster remains active or terminates when the pipeline stops.
- Cluster batch mode - In cluster batch mode, Data Collector runs on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process Amazon S3 data.
Data Collector Configuration
$SDC_CONF/sdc.properties
, defined on the gateway
node is propagated to the worker nodes with the exception of the following
properties:sdc.base.http.url
http.bindHost
If you modify the sdc.base.http.url
and http.bindHost
properties on the gateway node to configure a specific host name or port number or to
configure a specific IP address that Data Collector
binds to, the modified values are not propagated to the worker nodes. The worker nodes
always use the default values for the sdc.base.http.url
and
http.bindHost
properties so that the worker nodes can dynamically
determine the host name and can bind to any IP address.
sdc.properties
file on the gateway
node:cluster.slave.configs.remove=<property1>,<property2>
For more information on configuring the Data Collector configuration file, see Data Collector Configuration in the Data Collector documentation.
Enable HTTPS
You can enable Data Collector to use HTTPS when you run cluster pipelines. By default Data Collector uses HTTP.
To configure HTTPS for cluster pipelines, you first must configure Data Collector to use HTTPS. Then you generate an SSL/TLS certificate for each worker node in the cluster. Data Collector runs on the master gateway node in the cluster, so the gateway node uses the same keystore file configured for Data Collector.
You then specify the generated keystore file and keystore password file for the worker
nodes in the Data Collector
configuration file, $SDC_CONF/sdc.properties
. You can optionally
generate a truststore file for the gateway and worker nodes.
For more information, see Enabling HTTPS in the Data Collector documentation.
Temporary Directory
Data Collector requires that the Java temporary directory on the gateway node in the cluster is writable.
The Java temporary directory is specified by
the Java system property java.io.tmpdir
. On UNIX, the default value of
this property is typically /tmp and is writable.
Before running cluster pipelines, verify that the Java temporary directory on the gateway node is writable.
Logs
Because cluster pipelines run as either MapReduce or Spark applications, each Data Collector worker in the cluster manages its own log.
The Data Collector workers send log messages to different locations based on the cluster execution mode:
- Cluster batch mode pipelines
- For cluster batch mode pipelines, each Data Collector worker sends log messages to the syslog file on the worker node. You can use the YARN Resource Manager UI to view the syslog file for each MapReduce task.
- Cluster streaming mode pipelines
- For cluster streaming mode pipelines, each Data Collector worker sends log messages to stderr on the worker node. You can use the Spark UI to view stderr for each Spark application.
Cluster pipeline logs can grow in size over time, particularly for cluster streaming pipelines that run continuously. You can optionally configure the Data Collector installed on the gateway node to use the log4j rolling file appender to write log messages to an sdc.log file. This configuration is propagated to the worker nodes such that each Data Collector worker writes log messages to an sdc.log file within the YARN application directory.
The log4j rolling file appender automatically rolls or archives the current log file and
then resumes logging in another file. The
$SDC_CONF/sdc-log4j.properties
file configured for the Data Collector
installed on the gateway node determines how frequently the rolling file appender rolls
files. By default, it writes log messages to a maximum of 10 files, rolling over to the
next file when the current file reaches a size of 256 MB.
When you configure Data Collector to use the rolling file appender, you can view the log files for each worker node by using the YARN Resource Manager UI to locate the sdc.log file within the YARN application directory.
$SDC_CONF/sdc.properties
, defined on the gateway
node:cluster.pipelines.logging.to.stderr=false
Checkpoint Storage for Streaming Pipelines
When Data Collector runs a cluster streaming pipeline, Data Collector generates and stores checkpoint metadata. The checkpoint metadata provides the offset for the origin.
/user/$USER/.streamsets-spark-streaming/<DataCollector ID>/<Kafka topic>/<consumer group>/<pipelineName>
Error Handling Limitation
- Error Records - Write error records to Kafka or discard the records. Stopping the pipeline or writing records to file is not supported at this time.