Monitoring Jobs

After you start a job, you can monitor the statistics, error information, logs, and alerts about all remote pipeline instances run from the job. During pipeline test runs or job runs for Data Collector, you can also capture and review a snapshot of the data being processed.

To monitor a job, simply click the name of an active job in the Jobs view. Control Hub displays the pipeline in the canvas and displays real-time statistics for the job in the monitor panel below the canvas. Click the canvas to view statistics for the entire job. Select a stage in the canvas to view statistics for the stage.

Tip: You can also monitor job statistics, error information, and alerts from a topology. You can also view statistics through data delivery reports.

The following image shows the job monitoring view:

The Monitoring panel includes the following tabs:
  • Realtime Summary - Real-time statistics and metrics for the active job. Also displays custom metrics for Data Collector stages that provide custom metrics. For more information, see Custom Metrics in the Data Collector documentation.

    When the job stops, these real-time statistics are no longer visible. Requires that the web browser can access the execution Data Collector or Transformer running the job.

    Statistics and metrics on the Realtime Summary tab are updated every two seconds.

  • Summary - Statistics and metrics saved to Control Hub when the pipeline is configured to write statistics to Control Hub or to another system. Configure the pipeline to write statistics when you design the pipeline. Also includes historical time series data when time series analysis is enabled for the job.

    Statistics and metrics are saved to Control Hub every one minute by default. As a result, you must wait at least one minute to see data in the Summary tab after you start a job. You can change the default value by modifying the Statistics Refresh Interval property for the job. The charts in the Summary tab are updated based on the chart settings that you select from the More icon ().

  • Job Status - Status of the job. For a description of each status, see Job Status.
  • Data Collectors or Transformers - List of execution engines running each remote pipeline instance.
  • Configuration - Configuration details for the pipeline or selected stage.
  • Errors - Error records encountered by a pipeline stage in an active Data Collector job. Displays when you select a stage in the canvas that has encountered errors.
  • Rules - Metric alert rules, data rules, and email IDs for alerts.
  • Info - General information about the pipeline or selected stage or link.
  • History - Job history, including the start and finish time of previous job runs and a summary of each run.
Note the following icons that display in the top and bottom toolbars for the Monitoring panel when you monitor a job. You'll use these icons frequently as you analyze the real-time statistics for a job:
Icon Name Description
View Logs View logs for the execution engine running the remote pipeline instance. Available for Data Collector and Transformer jobs.
Auto Arrange Arrange the stages in the pipeline.
Share Share the job with other users and groups, as described in Permissions.
Schedule Job Schedule the job to start, stop, or upgrade to the latest pipeline version on a regular basis.
Synchronize Job Synchronize an active job after you have updated the labels assigned to execution engines.
Snapshots Capture and review a snapshot of the data being processed during pipeline test runs or job runs for Data Collector.
Stop Job Stop the job.

Time Series Analysis

When time series analysis is enabled for a job, you can view historical time series data when you view the Summary tab in the Monitoring panel.

You can view time series data when the pipeline is configured to write statistics to Control Hub or to another system. The time series charts contain no data when the pipeline is configured to discard statistics.

By default, all new jobs have time series analysis disabled. You might want to enable time series analysis for jobs for debugging purposes or to analyze dataflow performance. You can enable time series analysis for an inactive job when you edit the job.

When time series analysis is enabled, you can generate data delivery reports, monitor topologies with data SLAs, view the record count for a specific time period and can analyze time series charts for the record count, record throughput, and batch throughput. For example, the following image displays the location where you can select a time period for analysis and displays the Record Count Time Series chart:

When time series analysis is disabled, you can still view the total record count and throughput for a job, but you cannot view the data over a period of time. For example, you can’t view the record count for the last five minutes or for the last hour. You also cannot view the time series charts for the job, generate a data delivery report, or monitor topologies with data SLAs.

Job Status

When you view the list of jobs in the Jobs view or when you monitor a job, you can view the job status. You can also view the status of remote pipeline instances run from active jobs.

The job status is color-coded, providing an easy visual indicator of which jobs need your attention. A red status indicates that an error has occurred that you must resolve. A green status indicates that all is well.
Note: A job template is simply a job definition and does not have a status.

The following table describes each job status:

Job Status Description
Job is inactive. A job that has never run has an inactive status. A job transitions from an active to an inactive status when you stop the job or when all remote pipeline instances run from the job have reached a finished state.
Job is inactive after stopping automatically due to an error.

For example, a red inactive status can occur when either the pipeline or Data Collector generates an error that causes Control Hub to stop the job.

Job is inactive and has an error that you must acknowledge.

This status can occur when at least one execution engine reported an error while attempting to stop the remote pipeline instance. For example, one Data Collector might have shut down and so could not properly stop the remote pipeline instance.

You cannot perform actions on jobs with an inactive_error status until you acknowledge the error message. To acknowledge the error, view the job details or monitor the job and acknowledge reading the error message. For more information, see Acknowledging Job Errors.

Control Hub is in the process of starting the job.

You cannot perform actions on activating jobs.

Job is active and remote pipeline instances are running on the execution engines assigned the same labels as the job.
Job is active, but there are some issues you must look into.
For example, a red active status can indicate one of the following issues:
  • One of the assigned execution engines is not currently running.
  • One of the assigned execution engines encountered an error while running the pipeline.
  • All assigned execution engines have exceeded their resource thresholds.
  • The Data Collector pipeline is running and is configured to write statistics to Amazon Kinesis Streams, Kafka, or MapR Streams, but the system pipeline is not running.
Control Hub is in the process of stopping a job as requested or as expected. Control Hub is communicating with the execution engines to stop all remote pipeline instances.

You cannot perform actions on deactivating jobs.

Control Hub is in the process of stopping a job automatically due to an error. Control Hub is communicating with the execution engines to stop all remote pipeline instances.

You cannot perform actions on deactivating jobs.

Pipeline Status

When you view the list of jobs in the Jobs view, you can view the status of remote pipeline instances run from active jobs. Inactive jobs do not display a pipeline status.

The Jobs view displays a pipeline status when all remote pipeline instances run from the active job have the same status. When remote pipeline instances run from the active job have different statuses, the Jobs view displays an asterisk (*) for the pipeline status. To view the pipeline statuses, you must log into the UI for each Data Collector or Transformer running the pipeline instances.

The pipeline status is color-coded, providing an easy visual indicator of when that status was last reported by the engine running the pipeline. Each pipeline status can display in the following colors:
Pipeline Status Color Description
Green A green pipeline status indicates that the engine sent the pipeline status to Control Hub less than 2 minutes ago.
Red A red pipeline status indicates that the engine sent the pipeline status to Control Hub over 2 minutes ago. In this case, the status may no longer be accurate.

For example, a green RUNNING status indicates that the pipeline is running and that Data Collector updated Control Hub with that status within the last 2 minutes. A red RUNNING status indicates that Data Collector last reported the pipeline as running, but Data Collector has not updated the pipeline status for over 2 minutes. As a result, the pipeline status might have changed in that time.

The following pipeline statuses often display in the Jobs view:

  • EDITED - The pipeline has been created or modified, and has not run since the last modification.
  • FINISHED - The pipeline has completed all expected processing and has stopped running.
  • RUN_ERROR - The pipeline encountered an error while running and stopped.
  • RUNNING - The pipeline is running.
  • STOPPED - The pipeline was manually stopped.
  • START_ERROR - The pipeline encountered an error while starting and failed to start.
  • STOP_ERROR - The pipeline encountered an error while stopping.

The following pipeline statuses are transient and rarely display in the Jobs view:

  • CONNECT_ERROR - When running a cluster pipeline, the execution engine cannot connect to the underlying cluster manager, such as Hadoop YARN or Amazon EMR.
  • CONNECTING - The pipeline is preparing to restart after an execution engine restart.
  • DISCONNECTED - The pipeline is disconnected from external systems, typically because the engine is restarting or shutting down.
  • DISCONNECTING - The pipeline is in the process of disconnecting from external systems, typically because the engine is restarting or shutting down.
  • FINISHING - The pipeline is in the process of finishing all expected processing.
  • RETRY - The pipeline is trying to run after encountering an error while running. This occurs only when the pipeline is configured for a retry upon error.
  • RUNNING_ERROR - The pipeline encounters errors while running.
  • STARTING - The pipeline is initializing, but hasn't started yet.
  • STARTING_ERROR - The pipeline encounters errors while starting.
  • STOPPING - The pipeline is in the process of stopping after a manual request to stop.
  • STOPPING_ERROR - The pipeline encounters errors while stopping.

Pipeline Status Examples

Here are some examples of how pipelines can move through statuses:
Starting a pipeline
When a job successfully starts a pipeline for the first time, a pipeline transitions through the following statuses:
(EDITED)... STARTING... RUNNING
When a job starts a pipeline for the first time but it cannot start, the pipeline transitions through the following statuses:
(EDITED)... STARTING... STARTING_ERROR... START_ERROR
Stopping or restarting an engine

When an engine shuts down, running pipelines transition through the following statuses:

(RUNNING)... DISCONNECTING... DISCONNECTED
When an engine restarts, any pipelines that were running transition through the following statuses:
DISCONNECTED... CONNECTING... STARTING... RUNNING
Retrying a Data Collector pipeline
When a Data Collector pipeline is configured to retry upon error, Data Collector performs the specified number of retries when the pipeline encounters errors while running.
When retrying upon error and successfully retrying, a pipeline transitions through the following statuses:
(RUNNING)... RUNNING_ERROR... RETRY... STARTING... RUNNING
When retrying upon error and encountering another error, a pipeline transitions through the following statuses:
(RUNNING)... RUNNING_ERROR... RETRY... STARTING... RUNNING... RUNNING_ERROR... 
When performing a final retry and unable to return to a Running state, a pipeline transitions through the following statuses:
(RUNNING)... RUNNING_ERROR... RUN_ERROR
Stopping a pipeline
When you successfully stop a job, the pipeline transitions through the following statuses:
(RUNNING)... STOPPING... STOPPED
When you stop a job and the pipeline encounters errors, the pipeline transitions through the following statuses:
(RUNNING)... STOPPING... STOPPING_ERROR... STOP_ERROR

Jobs and Unresponsive Data Collector Engines

When you start a Data Collector job, Control Hub sends an instance of the pipeline to a Data Collector engine. The engine remotely runs the pipeline instance, communicating with Control Hub at regular one minute intervals to report a heartbeat, pipeline status, and last-saved offset.

If a Data Collector engine fails to communicate with Control Hub before the maximum engine heartbeat interval expires, 5 minutes by default, then Control Hub considers the engine unresponsive.

Engines can become unresponsive for the following reasons:

  • The engine loses its connection to Control Hub.
  • The engine gracefully shuts down due to a Control Hub request.
  • The engine unexpectedly shuts down.

Control Hub handles currently active jobs on unresponsive Data Collector engines differently, depending on the reason for the unresponsive engine and whether pipeline failover is enabled for the job.

Engine Loses Connection

When an unexpected network or system outage occurs, a Data Collector engine running a pipeline can lose its connection to Control Hub. The engine continues to remotely run the pipeline, temporarily saving the pipeline status and last-saved offset in data files on the engine machine.

After the engine misses the first communication interval, Control Hub displays a green ACTIVE status for the job and a green DISCONNECTED status for the pipeline.

If the engine reconnects to Control Hub before the maximum engine heartbeat interval expires, the engine reports the saved pipeline data to Control Hub. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status.

If the maximum engine heartbeat interval expires before the engine reconnects to Control Hub, Control Hub considers the engine unresponsive. Control Hub handles jobs on unresponsive engines based on whether pipeline failover is enabled for the job:
Pipeline failover disabled
The job transitions to a red ACTIVE status, and the pipeline transitions to a red DISCONNECTED status.
The unresponsive engine continues to remotely run the pipeline, unless you manually shut down the engine. The engine locally saves the pipeline status and last-saved offset in data files on the engine machine. When the engine reconnects, it reports this pipeline data to Control Hub. The job transitions to a green ACTIVE status, and the pipeline transitions to a green RUNNING status.
Pipeline failover enabled
Control Hub restarts the pipeline on another available engine. The new engine starts the pipeline at the last-saved offset. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status.
The unresponsive engine continues to run the pipeline, unless you manually shut down the engine. When the unresponsive engine reconnects to Control Hub, the engine is instructed to stop the pipeline. Data duplication can occur since two engines might run the same pipeline at the same time.

Engine Gracefully Shuts Down

A Data Collector engine gracefully shuts down when it receives a shut-down request from Control Hub. A Control Hub shut-down request can occur for several reasons, including when a user shuts down or restarts engines.
Important: As a best practice, stop all active jobs before restarting or shutting down engines.

When an engine gracefully shuts down while running a pipeline, Control Hub immediately displays a green ACTIVE status for the job and a green DISCONNECTED status for the pipeline.

After the engine misses the first communication interval, the job remains in a green ACTIVE status and the pipeline transitions to a red DISCONNECTED status.

If the engine restarts before the maximum engine heartbeat interval expires, the engine continues running the pipeline using the last-saved offset stored in data files on the engine machine. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status.

If the maximum engine heartbeat interval expires before the engine restarts, Control Hub considers the engine unresponsive. Control Hub handles jobs on unresponsive engines based on whether pipeline failover is enabled for the job:
Pipeline failover disabled
The job transitions to a red ACTIVE status, and the pipeline remains in a red DISCONNECTED status.
When the engine restarts, the engine continues running the pipeline using the last-saved offset stored in data files on the engine machine. The job transitions to a green ACTIVE status, and the pipeline transitions to a green RUNNING status.
Pipeline failover enabled
Control Hub restarts the pipeline on another available engine. The new engine starts the pipeline at the last-saved offset. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status.
When the unresponsive engine restarts, it immediately continues running the pipeline using the last-saved offset stored in data files on the engine machine. When the engine reconnects to Control Hub, the engine is instructed to stop the pipeline. Data duplication can occur since two engines might run the same pipeline for a brief amount of time.

Engine Unexpectedly Shuts Down

When a Data Collector engine unexpectedly shuts down while running a pipeline, engine processes do not terminate gracefully. As a result, unexpected behavior can occur.

After the engine misses the first communication interval, Control Hub displays a green ACTIVE status for the job and a red RUNNING status for the pipeline.

If the engine restarts before the maximum engine heartbeat interval expires, the engine continues running the pipeline using the last-saved offset stored in data files on the engine machine. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status. However, data loss or data duplication can occur because the offset might not have been correctly saved before the engine shut down.

If the maximum engine heartbeat interval expires before the engine restarts, Control Hub considers the engine unresponsive. Control Hub handles jobs on unresponsive engines based on whether pipeline failover is enabled for the job:
Pipeline failover disabled
The job transitions to a red ACTIVE status, and the pipeline remains in a red RUNNING status.
When the engine restarts, the engine continues running the pipeline using the last-saved offset stored in data files on the engine machine. The job transitions to a green ACTIVE status, and the pipeline transitions to a green RUNNING status. Data loss or data duplication can occur because the offset might not have been correctly saved before the engine shut down.
Pipeline failover enabled
Control Hub restarts the pipeline on another available engine. The new engine starts the pipeline at the last-saved offset. The job remains in a green ACTIVE status, and the pipeline transitions to a green RUNNING status. Data loss or data duplication can occur because the offset might not have been correctly saved before the engine shut down.
When the unresponsive engine restarts and reconnects to Control Hub, the engine is instructed to stop the pipeline.

Acknowledging Job Errors

By default when a job has an inactive error status, you cannot perform actions on the job until you acknowledge the error message. You can acknowledge job errors from the Jobs view or when monitoring a job. You can also acknowledge job errors from a topology.

Note: You can optionally configure a job to skip job error acknowledgement when you create the jobs. However, be aware that skipping job error acknowledgement might hide errors that the job has encountered.
Acknowledging errors from the Jobs view

To acknowledge job errors from the Jobs view, click the row listing the job with the inactive error to display the job details. The details list the error message for the job and for the system job. Review the message, taking action as needed, and then click Acknowledge Error in the bottom left corner of the job details.

Acknowledging errors when monitoring a job

To acknowledge job errors when monitoring a job with an inactive error, click the Job Status tab in the monitoring panel. The Job Status tab lists the error message for the job and for the system job. Review the message, taking action as needed, and then click the Acknowledge Error icon () in the top toolbar, as displayed in the following image:

Resetting Metrics for Jobs

When a job is inactive, you can reset the metrics for the job by resetting the origin for the job. You might want to reset metrics when you are testing jobs and want to view the metrics from the current job run only.

For more information about resetting the origin, see Resetting the Origin for Jobs.

Monitoring Stage-Related Errors

When you monitor an active Data Collector job, you can view the errors related to each pipeline stage. Stage-related errors include the error records that the stage produces and other errors encountered by the stage.

To monitor stage-related errors, the web browser must be able to access the execution Data Collector running the job.

Note: Stage-related errors do not display for Transformer jobs.

To view stage-related errors, select the stage in the canvas and then click the Errors tab in the Monitor panel. The Errors tab displays the following tabs:

Error Records

Displays a sample of the 10 most recent error records with related error messages. You can expand and review the data in each error record. If the error was produced by an exception, you can click View Stack Trace to view the full stack trace.

Stage Errors
Displays a list of stage errors. Stage errors are operational errors, such as an origin being unable to create a record because of invalid source data.

The following image displays a sample Errors tab for a Field Renamer processor that has encountered errors:

Logs

When monitoring an active Data Collector or Transformer job, you can view logs for the execution engine running the remote pipeline instance.

To view logs, the web browser must be able to access the execution engine running the job.

The information displayed in the logs depends on the execution engine type:
Data Collector
The Data Collector log includes information about the Data Collector application, such as start-up messages, user logins, or pipeline editing. The log also includes information about all pipelines running on the engine. As a result, you might see messages about pipelines being run from other active jobs when you view the log.
Transformer
You can view the following logs for Transformer jobs:
  • Transformer log - The Transformer log provides information about the Transformer application, such as start-up messages, user logins, or pipeline display in the canvas.

    The Transformer log can also include some information about local pipelines or cluster pipelines run on Hadoop YARN in client deployment mode. For these types of pipelines, the Spark driver program is launched on the local Transformer machine. As a result, some pipeline processing messages are included in the Transformer log. The Transformer log does not include information about other types of cluster pipelines.

  • Spark driver log - The Spark driver log provides information about how Spark runs, previews, and validates pipelines.

Viewing the Execution Engine Log

You can view and download the execution engine - Data Collector or Transformer - log from the Control Hub UI.

Note: For information about the log format and how to modify the log level, see Log Format in the Data Collector documentation or Log Format in the Transformer documentation.
  1. As you monitor a job run, click the Select Executor icon () in the bottom toolbar of the Monitoring panel and select the execution engine that you want to view logs for.
  2. Click the View Logs icon () in the top toolbar of the Monitoring panel.

    Control Hub displays the log for the selected execution engine. The log displays the last 50 KiB of messages.

  3. To view earlier messages, click Load Previous Logs.
    Note: For Data Collector version 3.18.0 or earlier and Transformer version 3.15.0 or earlier, you cannot load previous logs. To view all log messages, you must access the logs from the Data Collector or Transformer UI by clicking Open in Executor.
  4. To filter the messages by log level, select a level from the Severity list.

    By default, the log displays messages for all severity levels.

  5. To refresh the log data, click Refresh.
  6. Click Close to close the log.

Viewing the Spark Driver Log

You can view and download the Spark driver log from the Control Hub UI for some pipeline types. The supported pipeline types depend on the version of the execution Transformer.

The following table lists the pipeline types and execution Transformer version that provide the Spark driver log through the UI:
Pipeline Types Transformer Version
  • Local pipelines
  • Cluster pipelines run in Spark standalone mode
  • Cluster pipelines run on Hadoop YARN in client deployment mode
  • Cluster pipelines run on Kubernetes
3.16.0 or later
  • Cluster pipelines run on Amazon EMR
3.17.0 or later
To access the Spark driver log for local pipelines or cluster pipelines run on Hadoop YARN in client deployment mode for earlier Transformer versions, open the Spark driver log file written to the following location on the Transformer machine: $TRANSFORMER_DATA/runInfo/<pipelineID>/run<timestamp>/driver-all.log

For all other cluster pipelines, the Spark driver program is launched remotely on one of the worker nodes inside the cluster. To view the Spark driver logs for these pipelines, access the Spark web UI for the application launched for the pipeline. Control Hub provides easy access to the Spark web UI for many cluster types.

Note: By default, messages in the Spark driver log are logged at the ERROR severity level. To modify the log level, change the Log Level property on the Cluster tab for the pipeline.
  1. As you monitor a Transformer job, click the Summary tab in the Monitoring panel, and then click Driver Logs in the Runtime Statistics section:

    To view the Spark driver log for a previous Transformer job run, click the History tab in the Monitoring panel, and then click Driver Logs in the Summary column.

    Control Hub displays the most recent driver log information.

  2. Click Refresh to view the latest data.
  3. To download the latest log data, click Download.

Cluster and Spark URLs

When you monitor a Transformer job for a cluster pipeline, the Monitoring panel provides URLs for the cluster or the Spark application that runs the pipeline.

Use the URL to access additional information about the cluster or Spark application. For example, the Spark web UI can include information such as completed jobs, memory usage, running executors, and the Spark driver log.

Cluster and Spark URLs display in the Runtime Statistics section of the Monitoring panel. For example, when you monitor a Databricks pipeline, the Databricks Job URL displays with the other runtime statistics, as follows:

The following table lists the URLs that display for each cluster manager type:
Cluster Manager Type URL
Amazon EMR
  • Amazon EMR cluster URL
  • Spark Web UI URL
Apache Spark for HD Insight
  • Livy Batch URL

  • Livy Spark URL

  • Livy YARN URL

Databricks
  • Databricks Job URL
Dataproc
  • Spark Web UI URL
Hadoop YARN
  • Spark Web UI URL
Spark Standalone
  • Spark Web UI URL
SQL Server 2019 Big Data Cluster
  • Livy Batch URL

  • Livy Spark URL

  • Livy YARN URL

Snapshots

A snapshot is a set of data captured as it moves through a running pipeline. You can capture and review snapshots during pipeline test runs or job runs for Data Collector.

To capture and review snapshots, the web browser must be able to access the execution Data Collector running the job.

Note: Snapshots are not available for pipeline test runs or job runs for Data Collector Edge and Transformer.

View a snapshot to verify how a Data Collector pipeline processes data. Like data preview, you can view how snapshot data moves through a pipeline stage by stage or across multiple stages. You can drill down to review the values of each record to determine if the stage or group of stages transforms data as expected.

Unlike data preview, you cannot edit data to perform testing when you review a snapshot. Instead, you can use the snapshot as source data for data preview. You might use a snapshot for data preview to test the pipeline with production data.

Snapshots captured for jobs are available for the duration of the job run. When you stop the job, all captured snapshots are deleted.

Snapshots captured for pipeline test runs are still available after the test run stops.

The Data Collector instance used for the snapshot depends on where you take the snapshot:
  • Snapshots taken from a pipeline test run use the selected authoring Data Collector.
  • Snapshots taken while monitoring a job use the execution Data Collector for the job run. When there is more than one execution Data Collector, the snapshot uses the Data Collector selected in the Monitoring panel.

Failure Snapshots

When a pipeline test run fails, Control Hub captures a failure snapshot which you can view to troubleshoot the problem. A failure snapshot is a partial snapshot that occurs automatically when the pipeline stops due to unexpected data.

Note: Failure snapshots are not captured for jobs.

A failure snapshot captures the data in the pipeline that was in memory when the problem occurred. As a result, a failure snapshot includes the data that caused the problem and might include other unrelated data, but does not include data in each stage like a full snapshot.

Data Collector standalone pipelines generate the failure snapshot by default. Data Collector cluster pipelines do not generate failure snapshots.

You can configure standalone pipelines to skip generating the failure snapshot by clearing the Create Failure Snapshot pipeline property.

Viewing a Failure Snapshot

After a pipeline test run generates a failure snapshot, you can review the snapshot to determine the cause of the error.

  1. As you view the draft pipeline in the pipeline canvas, click the More icon, and then click Snapshots.
  2. In the Snapshots dialog box, find the failure snapshot and click View.

    Failure snapshots use the following naming convention: Failure at <time of failure>.

  3. When the failure snapshot displays, click through the stages.

    Stages that encountered no errors will typically not display any data. The stage that contains data should be the stage that encountered the errors. Examine the data that caused the errors and edit the pipeline as needed.

  4. To exit the snapshot review, click Close Snapshot.

Capturing and Viewing a Snapshot

You can capture a snapshot of data when you run a test of a draft Data Collector pipeline or when you monitor a Data Collector job.

After you capture a snapshot, you can view the snapshot data stage by stage or through a group of stages, like data preview.

  1. As you monitor a pipeline test run or a job run, click the Snapshots icon: .

    Or when viewing an inactive draft pipeline, click the More icon and then click Snapshots.

  2. In the Snapshots dialog box, click Capture Snapshot to capture a set of data.

    Control Hub captures a snapshot of the next batch that passes through the pipeline and displays it in the list.

  3. To view a snapshot, click View for the snapshot that you want to use.

    The canvas highlights the origin stage of the pipeline. The Monitor panel displays snapshot data in the Output Data column. Since this is the origin of the pipeline, no input data displays.

  4. To view data for a different stage, select the stage in the canvas.
  5. To view the snapshot for multiple stages, click Multiple.

    The canvas highlights the first stage and the last stage. The Monitor panel displays the input and output data for the selected group of stages.

    1. To change the first stage in the group, select the current first stage and then select the desired stage.
    2. To change the last stage in the group, select the current last stage and then select the desired stage.
  6. To exit the snapshot review, click Close Snapshot.

Renaming a Snapshot

Snapshots use the following naming convention: Snapshot<number>, for example Snapshot1 or Snapshot2. You can rename a snapshot captured for a pipeline test run or for a job run so that it is more easily identified.

For example, let's say that you've captured four snapshots for a pipeline test run and would like to use Snapshot3 as the source data for a preview of the pipeline. You rename Snapshot3 to SnapshotForPreview so it's easier to identify that snapshot when you configure the source data for the preview.

  1. As you monitor a pipeline test run or a job run, click the Snapshots icon: .

    Or when viewing an inactive draft pipeline, click the More icon and then click Snapshots.

  2. In the Snapshots dialog box, click the name of the snapshot that you want to rename, and then type the new name.
  3. To exit the snapshot review, click Close Snapshot.

Downloading a Snapshot

When needed, you can download a snapshot captured for a pipeline test run or job run. You might download a snapshot to use the Dev Snapshot Replaying origin to read records from the downloaded file.

When you download a snapshot, it downloads to the default download location on your machine.

Downloaded snapshots for pipeline test runs begin with the prefix testRun.

Downloaded snapshots for job runs begin with the prefix snapshot.

  1. As you monitor a pipeline test run or a job run, click the Snapshots icon: .

    Or when viewing an inactive draft pipeline, click the More icon and then click Snapshots.

    The Snapshots dialog box displays all available snapshots for the pipeline or job.

  2. Click Download for the snapshot that you want to download.

Deleting a Snapshot

Control Hub retains all snapshots for a pipeline test run, even after the test run stops. Control Hub retains all snapshots for a job run only for the duration of the job run. When needed, you can delete snapshots for a pipeline test run or for a job run.
Note: When you delete a snapshot, the information is irrevocably removed. You cannot retrieve a deleted snapshot.
  1. As you monitor a pipeline test run or a job run, click the Snapshots icon: .

    Or when viewing an inactive draft pipeline, click the More icon and then click Snapshots.

    The Snapshots dialog box displays all available snapshots for the pipeline or job.

  2. Click Delete for the snapshot that you want to delete.

Viewing the Job Run History

You can view the run history of a job and a summary of each run when you configure or monitor a job.

Note: Control Hub automatically deletes the job history on a predetermined basis, as configured by administrators for your organization.
The history for a Data Collector job shows the following information:
  • Run count
  • Job status
  • Time the job started or finished
  • Input, output, and error record count for the job run
  • Access to each job run summary
The history for a Transformer job shows the following information:
  • Spark application ID and name
  • Job status
  • Time the job started or finished
  • Input, output, and error record count for the job run
  • Access to each job run summary and a link to the Spark driver logs depending on the pipeline type
Note: Control Hub displays timestamps using the browser time zone, which is determined by your local operating system.

Click the History tab in the job properties or monitor panel to view the run history. The following image shows a sample run history:

Viewing a Run Summary

You can view a run summary for each job run when you view the job history.

You can view run summaries for completed and currently active job runs when the pipeline is configured to write statistics to Control Hub or to another system. The run summaries contain no data when the pipeline is configured to discard statistics.

A run summary includes the following information:
  • Input, output, and error record count for the job
  • Input, output, and error record throughput for the job
  • Batch processing statistics

To view a run summary, on the History tab of the job, click View Summary for a specific job run.