Pipeline Statistics
When you monitor an active job in Control Hub, you can view real-time statistics and metrics about the running pipeline. However, when the job stops, the real-time statistics are no longer visible.
To monitor statistics and metrics for inactive jobs and for previous job runs, you must configure the pipeline to write statistics to Control Hub or to another system. When a pipeline is configured to write statistics, Control Hub saves the pipeline statistics for each job run.
When a job for an edge pipeline runs on SDC Edge or when a job for a standalone pipeline runs on a single Data Collector, you can configure the pipeline to write the statistics directly to Control Hub.
When a job for a standalone pipeline runs on multiple Data Collectors, a remote pipeline instance runs on each Data Collector. When a job for a cluster pipeline runs on a single Data Collector, remote pipeline instances run on multiple worker nodes in the cluster. To view aggregated statistics for these jobs within Control Hub, you must configure the pipeline to write the statistics to one of the following systems:
- SDC RPC
- Kafka cluster
- Amazon Kinesis Streams
- MapR Streams
When you start a job that includes a pipeline configured to write to Kafka, Kinesis, MapR Streams, or SDC RPC, Control Hub automatically generates and runs a system pipeline for the job. The system pipeline reads the statistics written by each running pipeline instance to Kafka, Kinesis, MapR Streams, or SDC RPC. Then, the system pipeline aggregates and sends the statistics to Control Hub.
Pipeline Execution Mode
Pipelines can run in standalone, cluster, or edge execution mode. Some pipeline execution modes do not support all statistics aggregator options.
Write Statistics Directly to Control Hub
When you write statistics directly to Control Hub, Control Hub does not generate a system pipeline for the job. Instead, the Data Collector or SDC Edge directly sends the statistics to Control Hub.
Write statistics directly to Control Hub in a development environment when the job for a standalone or edge pipeline runs on a single Data Collector or SDC Edge. If the job runs on multiple Data Collectors or Edge Data Collectors, Control Hub can display the pipeline statistics for each individual pipeline. However, Control Hub cannot display an aggregated view of the statistics across all running pipeline instances.
When you write statistics directly to Control Hub, Control Hub cannot generate data delivery reports for the job or trigger data SLA alerts for the job.
Write Statistics to SDC RPC
When you write statistics to SDC RPC, Data Collector effectively adds an SDC RPC destination to the pipeline that you are configuring. Control Hub automatically generates and runs a system pipeline for the job. The system pipeline is a pipeline with a Dev SDC RPC with Buffering origin that reads the statistics passed from the SDC RPC destination, and then aggregates and sends the statistics to Control Hub.
Not valid in Data Collector Edge pipelines.
- SDC RPC connection - The host and port number of the Data Collector machine where Control Hub starts the system pipeline. The host must be a Data Collector machine registered with Control Hub that can run a pipeline for the job. A Data Collector can run the pipeline when it has all labels associated with the job.
For example, if you associate the job with the WestCoast label, then the host specified in the RPC connection must be a machine with a registered Data Collector that also has the WestCoast label.
- SDC RPC ID - A user-defined identifier that allows SDC RPC stages to recognize each other. To avoid mixing statistics from different jobs, use a unique ID for each job.
You can optionally enable encryption to pass data securely and define retry and timeout properties.
For more information about SDC RPC pipelines, see SDC RPC Pipeline Overview (deprecated).
Best Practices for SDC RPC
- To avoid mixing statistics from different jobs, use a unique SDC RPC ID for each job.
- Monitor the disk space where the Dev SDC RPC with Buffering origin in the system
pipeline temporarily buffers the records to disk before passing the records to
the next stage in the pipeline.
The Dev SDC RPC with Buffering origin in the system pipeline temporarily buffers the statistics to a queue on disk. If the system pipeline slows, the temporary location on disk might become full. The temporary statistics are written to the location specified in the
java.io.tmpdir
system property, to a file with the following name:sdc-fragments<file ID>.queueFile
Write Statistics to Kafka
When you write statistics to a Kafka cluster, Data Collector effectively adds a Kafka Producer destination to the pipeline that you are configuring. Control Hub automatically generates and runs a system pipeline for the job. The system pipeline reads the statistics from Kafka, and then aggregates and sends the statistics to Control Hub.
Not valid in Data Collector Edge pipelines.
When you write statistics to a Kafka cluster, you define connection information and the topic to write to.
You also configure the partition strategy. The pipeline passes data to partitions in the Kafka topic based on the partition strategy that you choose. You can add additional Kafka configuration properties as needed. You can also configure the pipeline to connect securely to Kafka through SSL/TLS or Kerberos.
Partition Strategy
The partition strategy determines how to write statistics to Kafka partitions. You can use a partition strategy to balance the work load or to write data semantically.
The pipeline can use one of the following partition strategies:
- Round-Robin
- Writes statistics to a different partition using a cyclical order. Use for load balancing.
- Random
- Writes statistics to a different partition using a random order. Use for load balancing.
- Expression
- Writes statistics to a partition based on the results of the partition expression. Use to perform semantic partitioning.
- Default
- Writes statistics using the default partition strategy that Kafka provides.
Best Practices for a Kafka Cluster
Consider the following best practices when you configure a pipeline to write statistics to a Kafka cluster:
- To avoid mixing statistics from different jobs, use a unique topic name for each job.
- Consider the Kafka retention policy.
Each running pipeline instance writes statistics to Kafka, and then the system pipeline consumes the statistics from Kafka. If the system pipeline unexpectedly shuts down, Kafka retains the statistics for the amount of time determined by the Kafka retention policy. If the system pipeline is down for longer than Kafka retains data, the statistics are lost.
Write Statistics to Kinesis Streams
When you write statistics to Amazon Kinesis Streams, Data Collector effectively adds a Kinesis Producer destination to the pipeline that you are configuring. Control Hub automatically generates and runs a system pipeline for the job. The system pipeline reads the statistics from Kinesis Streams, and then aggregates and sends the statistics to Control Hub.
Not valid in Data Collector Edge pipelines.
When you write statistics to Kinesis Streams, you define connection information and the stream to write to.
You also configure the partition strategy. The pipeline passes data to partitions in Kinesis shards based on the partition strategy that you choose. You can add additional Kinesis configuration properties as needed.
Authentication Method
When a pipeline writes aggregated statistics to Amazon Kinesis Streams, you can configure the pipeline to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys.
For more information about the authentication methods and details on how to configure each method, see Security in Amazon Stages.
Best Practices for Kinesis Streams
Consider the following best practices when you configure a pipeline to write statistics to Amazon Kinesis Streams:
- To avoid mixing statistics from different jobs, use a unique stream name for each job.
- Consider the Kinesis Streams retention policy.
Each running pipeline instance writes statistics to Kinesis Streams, and then the system pipeline reads the statistics from Kinesis Streams. If the system pipeline unexpectedly shuts down, Kinesis Streams retains the statistics for the amount of time determined by the Kinesis Streams retention policy. If the system pipeline is down for longer than Kinesis Streams retains data, the statistics are lost.
Write Statistics to MapR Streams
When you write statistics to MapR Streams, Data Collector effectively adds a MapR Streams Producer destination to the pipeline that you are configuring. Control Hub automatically generates and runs a system pipeline for the job. The system pipeline reads the statistics from MapR Streams, and then aggregates and sends the statistics to Control Hub.
Not valid in Data Collector Edge pipelines.
When you write statistics to MapR Streams, you define the topic to write to. You also configure the partition strategy. The pipeline passes data to partitions in the MapR Streams topic based on the partition strategy that you choose. You can add additional MapR Streams configuration properties as needed.
Before you can write statistics to MapR Streams, you must perform additional steps to enable Data Collector to process MapR data. For more information, see MapR Prerequisites in the Data Collector documentation.
Partition Strategy
The partition strategy determines how to write statistics to MapR Streams partitions. You can use a partition strategy to balance the work load or to write data semantically.
The pipeline can use one of the following partition strategies:
- Round-Robin
- Writes each record to a different partition using a cyclical order. Use for load balancing.
- Random
- Writes each record to a different partition using a random order. Use for load balancing.
- Expression
- Writes each record to a partition based on the results of the partition expression. Use to perform semantic partitioning.
- Default
- Writes each record using the default partition strategy that MapR Streams provides.
Best Practices for MapR Streams
Consider the following best practices when you configure a pipeline to write statistics to MapR Streams:
- To avoid mixing statistics from different jobs, use a unique topic name for each job.
- Consider the MapR Streams retention policy.
Each running pipeline instance writes statistics to MapR Streams, and then the system pipeline consumes the statistics from MapR Streams. If the system pipeline unexpectedly shuts down, MapR Streams retains the statistics for the amount of time determined by the MapR Streams retention policy. If the system pipeline is down for longer than MapR Streams retains data, the statistics are lost.
Configuring a Pipeline to Write Statistics
Configure a pipeline to write statistics when you want to monitor statistics and metrics for inactive jobs and for previous job runs.