Monitor Topologies with Data SLAs

A data SLA (service level agreement) defines the data processing rates that jobs within a topology must meet.

In addition to measuring the performance of jobs within a topology, you can configure data SLAs to define the expected thresholds of the data throughput rate or the error record rate. Data SLAs trigger an alert when the specified threshold is reached. Data SLA alerts provide immediate feedback on the data processing rates expected by your team. They enable you to monitor your data operations and quickly investigate issues that arise, ensuring that data is delivered in a timely manner.

You configure data SLAs on the jobs included in a topology. Define each data SLA for one job in the topology.

When you configure a data SLA, you select one of the following QOS (quality of service) parameters to measure:
  • Throughput rate - Number of records processed per second.
  • Error rate - Number of error records encountered per second.

For each parameter, you define either a minimum or maximum value for the threshold. You also define the alert text that displays in the Alerts view when the data SLA triggers an alert.

Example

You are part of the data operations team at an oil and gas company that analyzes data to improve efficiency and reduce operational costs. Your company gathers data from equipment sensors which must be monitored in real time to avoid failures. Your data operations team is responsible for collecting and cleansing sensor data to quickly make it available to the analytics team for further analysis.

You work under a strict time line and have service level agreements with the analytics team to ensure that cleansed data is made available to them in a timely manner. One of the SLAs states that the Sensor Collection job must process approximately 2,000 records per second. You need to be immediately notified when the throughput falls above or below that average so that you can resolve any performance issues.

You define and activate a data SLA that triggers an alert when the Sensor Collection job encounters a throughput rate of more than 3,000 records per second. You define and activate another data SLA that triggers an alert when the same job encounters a throughput rate less than 1,500 records per second.

You enable time series analysis for all jobs in the topology, start the jobs, and begin monitoring their progress. After several days, one of the equipment sensors goes offline, causing the throughput rate for the Sensor Collection job to fall below the threshold of 1,500 records per second. As a result, the data SLA triggers an alert and displays a red Alerts icon () in the top toolbar of Control Hub. The triggered data SLA displays a graph of the record throughput rate. The red line in the graph represents the defined threshold, as follows:

Since the triggered data SLA immediately notifies you that your service agreement with the analytics team is not being met, you can quickly investigate and resolve the issue.

Prerequisites

Before you create a data SLA on a job included in a topology, note the following prerequisites:
  • The pipeline included in the job must be configured to write statistics to an external system . Pipelines cannot be configured to write to Control Hub or to discard statistics.

    You configure a pipeline to write statistics when you design the pipeline.

  • The job must have time series analysis enabled.

    You enable time series analysis when you configure a job.

Working with Data SLAs

View a topology to configure data SLAs. The Data SLA section in the topology detail pane lists all data SLAs created for the topology.

You can complete the following tasks in the Data SLA section:

  • Create data SLAs.
  • Share a data SLA with other users and groups.
  • Activate and deactivate data SLAs.
  • Delete data SLAs.

The following image shows a list of data SLAs in the Data SLA section of the topology detail pane. Each data SLA is listed with the last modified time and status:

Note the following icons that display for the Data SLA section when you select a data SLA. You'll use these icons frequently as you manage data SLAs:

Icon Name Description
Add Add a data SLA.
Share Share the data SLA with other users and groups, as described in Permissions.
Activate Activate the data SLA so that it can trigger an alert.
Deactivate Deactivate the data SLA to temporarily stop it from measuring the defined threshold.
Delete Delete the data SLA.
Refresh Refresh the list of data SLAs.

Creating Data SLAs

Create data SLAs to define the expected thresholds of the data processing rates for a job. Define each data SLA for one job in the topology. The job must have time series analysis enabled.

You can create data SLAs for active or inactive jobs in the topology.

  1. View the details of a topology.
  2. In the Data SLA section, click the Add icon: .
  3. On the Add Data SLA window, configure the following properties:
    Data SLA Property Description
    Label Label for the data SLA.
    Job Job that the data SLA is defined for. Verify that the job meets all prerequisites.
    QOS Parameter Quality of service parameter to measure:
    • Throughput rate - Number of records processed per second.
    • Error rate - Number of error records encountered per second.
    Function Type Specifies whether the data SLA measures a maximum or minimum value.
    Max or Min Value Value of the expected threshold.

    The data SLA triggers an alert when this threshold is reached.

    Alert Text Text to display when the alert is triggered.
  4. Click Add.
    By default, data SLA alerts are inactive. You must activate them before they can trigger an alert.
  5. Select the new data SLA in the list, and then click the Activate icon: .

Managing Data SLA Alerts

When a data SLA triggers an alert, Control Hub displays a red Alerts icon () in the top toolbar and lists the alert in the Alerts view.

The Alerts view lists all triggered alerts. You acknowledge and delete alerts in the Alerts view. After viewing an alert, select the alert in the Alerts view and click the Acknowledge icon () to acknowledge the alert. You can filter the Alerts view by active or acknowledged alerts. When you filter by acknowledged alerts, you can select one or more acknowledged alerts and then delete them.
Note: Control Hub automatically displays each triggered data SLA alert in the Alerts view. You can also optionally create a subscription to perform an action when a data SLA alert is triggered.

The following image displays an active alert in the Alerts view:

To view details about the data SLA threshold that was reached, click the alert message in the Alerts view. Control Hub opens the topology where the data SLA was configured. Open the topology detail pane, and then locate the Data SLA section in the detail pane. Click the data SLA label to display the details and the graph of the measured parameter. The red line in the graph represents the defined threshold, as follows:

You can modify the data SLA in real time. For example, while viewing a triggered alert, you might realize that you incorrectly configured the maximum or minimum value. Simply edit the value and click Save. The data SLA graph immediately reflects the modified threshold value.