Pipeline Statistics

When you monitor an active job in Control Hub, you can view real-time statistics and metrics about the running pipeline. However, when the job stops, the real-time statistics are no longer visible.

To monitor statistics and metrics for inactive jobs and for previous job runs, you must configure the pipeline to write statistics to Control Hub or to another system. When a pipeline is configured to write statistics, Control Hub saves the pipeline statistics for each job run.

When a job for an edge pipeline runs on SDC Edge or when a job for a standalone pipeline runs on a single Data Collector, you can configure the pipeline to write the statistics directly to Control Hub.

When a job for a standalone pipeline runs on multiple Data Collectors, a remote pipeline instance runs on each Data Collector. When a job for a cluster pipeline runs on a single Data Collector, remote pipeline instances run on multiple worker nodes in the cluster. To view aggregated statistics for these jobs within Control Hub, you must configure the pipeline to write the statistics to one of the following systems:

  • SDC RPC
  • Kafka cluster
  • Amazon Kinesis Streams
  • MapR Streams

When you start a job that includes a pipeline configured to write to Kafka, Kinesis, MapR Streams, or SDC RPC, Control Hub automatically generates and runs a system pipeline for the job. The system pipeline reads the statistics written by each running pipeline instance to Kafka, Kinesis, MapR Streams, or SDC RPC. Then, the system pipeline aggregates and sends the statistics to Control Hub.

Important: For standalone and cluster pipelines in a production environment, use a Kafka cluster, Amazon Kinesis Streams, or MapR Streams to aggregate statistics. Using SDC RPC to aggregate statistics is not highly available and might cause the loss of some data. It should be used for development purposes only.