Jobs Overview

A job defines the pipeline to run and the execution engine that runs the pipeline: Data Collector or Transformer. Jobs are the execution of the dataflow.

After you publish pipelines, you create a job to specify the published pipeline to run. You also assign labels to the job so that Control Hub knows which group of execution engines should run the pipeline.

By default when you start a job that contains a Data Collector pipeline, Control Hub sends an instance of the pipeline to one Data Collector with all labels after verifying that the Data Collector does not exceed its resource thresholds. The Data Collector remotely runs the pipeline instance. You can increase the number of pipeline instances that Control Hub runs for a Data Collector job.

In contrast, you cannot increase the number of pipeline instances that Control Hub runs for a Transformer job. When you start a job that contains a Transformer pipeline, Control Hub sends an instance of the pipeline to one Transformer with all labels after verifying that Transformer does not exceed its resource thresholds. Transformer remotely runs the pipeline instance on Apache Spark deployed to a cluster. Because Transformer runs pipelines on Spark, Spark runs the application just as it runs any other application, distributing the processing across nodes in the cluster.

To minimize downtime due to unexpected failures, enable pipeline failover for jobs. Control Hub manages pipeline failover differently for Data Collector and Transformer jobs.

When a Data Collector pipeline is configured to aggregate statistics, Control Hub also creates a system pipeline for the job and instructs one of the Data Collectors to run the system pipeline. The system pipeline collects, aggregates, and pushes metrics for all of the remote pipeline instances back to Control Hub so that you can monitor the progress of the job.

Note: Transformer pipelines cannot be configured to aggregate statistics.

If a job includes a pipeline that uses runtime parameters, you specify the parameter values that the job uses for the pipeline instances. Or, you can enable the job to work as a job template. A job template lets you run multiple job instances with different runtime parameter values from a single job definition.

When you stop a job, Control Hub instructs all execution engines running pipelines for the job to stop the pipelines.

After you create jobs, you create a topology to map multiple related jobs into a single view. Topologies are the end-to-end view of multiple dataflows. From a single topology view, you can start, stop, monitor, and synchronize all jobs included in the topology.