Sequences

Sequences Overview

A sequence is a collection of jobs that are run in specified order based on conditions.

A sequence can include jobs that run on different types of StreamSets engines. For example, you can create a sequence that first runs a Data Collector job to load data to a data warehouse, and then runs a Transformer job to transform that data.

You can add a start condition to a sequence to schedule the time when the sequence starts. You can also configure a condition between steps that determines if the next step automatically starts when a job in the previous step encounters an error.

Although sequences run on a schedule, you can also manually run a sequence so that the sequence starts immediately.

By default when you add a job to a sequence, the job is added as an additional step in the sequence, such that each job runs sequentially, one after the other. You can optionally add multiple jobs to a single step to run those jobs in parallel.

Each job can be included in only one sequence.

When a sequence is active, you can monitor the overall sequence status and the status of each step within the sequence. You can also view the sequence history and the list of errors encountered by sequence steps.

After you create a sequence, you can share the sequence with other users and groups, as described in Permissions.
Note: Sequences are designated a Technology Preview feature. They are not meant for use in production.

Example

As a retail company, you need to regularly update your data warehouse with sales information from various data sources. You've designed the following pipelines to meet all of your processing needs:
  • Data Collector pipeline that reads raw sales data from an Oracle database and writes the data to a temporary Amazon S3 bucket.
  • Data Collector pipeline that reads raw sales data from CSV files located on an FTP server and writes the data to another temporary Amazon S3 bucket.
  • Transformer pipeline that uses two origins to read from the temporary Amazon S3 buckets, a Join processor to join the data from both inputs, a Deduplicate processor to remove duplicate records, and a Field Renamer processor to rename several fields. This pipeline writes the cleansed data to a third temporary Amazon S3 bucket.
  • Transformer pipeline that reads from the third temporary Amazon S3 bucket and uses an Aggregate processor to aggregate the cleansed data to calculate the total sales data by store and region. The pipeline writes the aggregated data to a final temporary Amazon S3 bucket.
  • Data Collector pipeline that loads the final aggregated data to your Snowflake data warehouse.

You've finished testing the pipelines, have published them, and added them to jobs. You can run the first two Data Collector jobs that read raw sales data in parallel because they do not depend on each other. But the remaining jobs must run sequentially, only after the previous job has successfully completed. You don't want to manually monitor the status of each job, and then manually start the next one. You also want to run the jobs during off-business hours, and repeat the process every 7 days.

You create a sequence and add the jobs to the sequence in the following order:
  1. Data Collector jobs - Raw Sales Oracle to S3 and Raw Sales FTP to S3

    You add these jobs to the same sequence step so that they run in parallel. The sequence lists the jobs as steps 1a and 1b.

  2. Transformer job - Sales Cleansing
  3. Transformer job - Sales Aggregation
  4. Data Collector job - Refined Sales to Snowflake

You create a start condition for the sequence to start the sequence at 11 PM every 7 days. You also configure the sequence to start the next step only when the jobs in the previous step have run successfully.

When you finish configuring the sequence, you enable it. Note that the sequence remains inactive until the start condition is met.

Your finished sequence looks as follows:

Batch and Streaming Jobs

Before adding a job to a sequence, consider whether the job is a batch job or a streaming job:
Batch job
A batch job includes a pipeline that processes all available data, and then stops. When you add a batch job to a sequence, the sequence starts the next job step after the batch job completes.
Streaming job

A streaming job includes a pipeline that maintains a connection to the origin system and processes data as it becomes available. The pipeline runs continuously until you manually stop it because you expect data to arrive continuously.

When you add a streaming job to a sequence, the sequence runs the streaming job indefinitely until you manually stop the job in the Job Instances view. After you manually stop the streaming job, the sequence can start the next job step.

Alternatively, you can redesign the pipeline so that the pipeline stops after processing all available data. For example, for Data Collector pipelines, you can use the Pipeline Finisher executor to stop the pipeline when all data is processed.

Start Conditions

When you create a sequence, you add a start condition to schedule the time when the sequence starts.

You can also configure a start condition between steps that determines if the next step automatically starts when a job in the previous step encounters an error.

Sequence Start Condition

You can add a start condition to a sequence to schedule the time when the sequence starts. After you enable a sequence, the sequence remains inactive until the start condition is met or until you manually run the sequence.

When you add a start condition, you specify the start date and select the time zone.

You can optionally configure the start condition to repeat on a regular basis, such as daily, weekly, or monthly. When you configure a repeat start condition, you can also configure an end date for the condition.

When a repeat start condition is met while the sequence is inactive, the sequence starts as expected. This occurs when the sequence is in an INACTIVE or ERROR status.

When a repeat start condition is met while the sequence is running, the sequence logs an error message and does not start again. It simply completes the current sequence run, then stops. This occurs when the sequence is in an ACTIVE status. For example, you configure a sequence to start at 9 AM and to repeat every hour. The initial run of the sequence takes 90 minutes to complete, so the sequence does not start again at 10 AM because the sequence is already active.

Time Zone

You select a time zone for the sequence start condition.

You can select any time zone, regardless of your current time zone or the time zone of your browser. For example, suppose you are currently located in the US/Pacific time zone but want to schedule a sequence to start in the US/Eastern time zone. You specify the US/Eastern time zone for the start condition and configure the sequence to start daily at 6 AM. The sequence starts daily at 6 AM in the US/Eastern time zone and at 3 AM in the US/Pacific time zone.

Time zones automatically adjust for daylight saving time when appropriate. For example, the US/Pacific time zone observes daylight saving time, but the US/Arizona time zone does not. Therefore, in June, 6:00 AM in the US/Pacific time zone is 6:00 AM in the US/Arizona time zone. But in January, 6:00 AM in the US/Pacific time zone is 7:00 AM in the US/Arizona time zone.

When viewing a sequence, Control Hub shows all scheduled and historical times in the time zone of the browser.

Step Start Condition

When a sequence contains multiple steps, you configure the Auto-Start Next Step When in Error condition between the steps. The step start condition determines if the next step automatically starts when a job in the previous step encounters an error:

Auto-start next step is enabled
When the Auto-Start Next Step When in Error condition is enabled, the sequence remains active and automatically starts the next step when a job in the previous step encounters an error.
By default, Control Hub enables the condition between each step.
For example, in the following sequence, the auto-start next step condition is enabled between each step. When the Cleanse Logs job in step 2 encounters an error, the sequence remains active and automatically starts running the job in step 3.

Auto-start next step is disabled
When the Auto-Start Next Step When in Error condition is disabled, the sequence transitions to an ERROR status and stops running the remaining steps when a job in the previous step encounters an error.
For example, in the following sequence, the auto-start next step condition is disabled between steps 2 and 3. When the Cleanse Logs job in step 2 encounters an error, the sequence transitions to an ERROR status and does not run the job in step 3.

Parallel Jobs

By default when you add a job to a sequence, the job is added as an additional step in the sequence, such that each job runs sequentially, one after the other. You can optionally add multiple jobs to a single step to run those jobs in parallel.

When a step includes multiple jobs, the sequence starts the next step only when all jobs in the previous step have completed.

You can add multiple jobs to a single step in the following ways:
  • When adding jobs to a sequence, select multiple jobs in the Select Jobs dialog box and then select Add jobs to the same step.
  • When viewing the details of a sequence, select the drag icon () for a step and then drag the step into another step.

When you add multiple jobs to a single step, the sequence creates a substep for each job, using alphabetic characters to number the substeps. For example, if you add two jobs to step 2, the sequence numbers the steps as 2a and 2b.

The following image displays a sequence with parallel jobs in step 1, numbered as 1a and 1b. The sequence starts the job in step 2 only after both the Read Orders DB and Read Web Logs jobs in step 1 have completed.

Managing Sequences

Create a sequence to run a collection of jobs in specified order. When you finish configuring a sequence, you must enable it. After you enable a sequence, the sequence remains inactive until the start condition is met. When needed, you can manually run a sequence so that the sequence starts immediately.

You can edit an existing sequence to reorder the steps or to add or remove jobs from the sequence. You can also edit or delete the start condition.

To temporarily stop a sequence from running jobs, you can disable the sequence. When needed, you can delete a sequence.

You can also share a sequence with other users and groups, as described in Permissions.

Note: You cannot edit, disable, or delete a sequence that is actively running jobs.

Creating a Sequence

Create a sequence to run a collection of jobs in specified order. Each job can be included in only one sequence.

  1. In the Navigation panel, click Run > Sequences, and then click the Add Job Sequence icon: .
  2. Enter a name for the sequence and an optional description.
  3. Click Save.

    An empty sequence in a DISABLED status displays.

  4. Click Add a Start Condition to schedule the time when the sequence starts.

    Define the start condition and then click Save.

  5. Click Add Jobs.
    1. In the Select Jobs dialog box, you can search for the jobs to add.
    2. Select one or more jobs.

      By default when you select multiple jobs, the sequence adds each job as a unique step. To add the selected jobs to the same step so that they run in parallel, select Add jobs to the same step.

    3. Click Add.

    The sequence lists the added jobs as ordered steps.

  6. To reorder a step, select the More icon () for the step and then click Move to Previous Step or Move to Next Step.

    Alternatively, select the drag icon () for the step and then drag the step to a new location.

  7. When you finish configuring the sequence, click the More icon () to the right of the start condition, and then click Enable Sequence.

    The sequence transitions to an INACTIVE status until the start condition is met or until you manually run the sequence.

Reordering Steps

You can reorder the steps in a sequence when the sequence is not actively running jobs.

Note: The order of substeps, such as steps 1a, 1b, and 1c, does not matter. When a single step includes multiple jobs, the sequence runs the jobs in parallel.
  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. Select the More icon () for the step and then click Move to Previous Step or Move to Next Step.

    Alternatively, select the drag icon () for the step and then drag the step to a new location.

Modifying the Jobs in a Sequence

You can add jobs to or remove jobs from a sequence when the sequence is not actively running jobs.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. To add jobs to the sequence, click Add Jobs.
    1. In the Select Jobs dialog box, you can search for the jobs to add.
    2. Select one or more jobs.

      By default when you select multiple jobs, the sequence adds each job as a unique step. To add the selected jobs to the same step so that they run in parallel, select Add jobs to the same step.

    3. Click Add.
  4. To remove a job from the sequence, select the More icon () for a step and then click Remove from Sequence.

Editing a Start Condition

You can edit or delete a start condition when the sequence is not actively running jobs.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. To edit a start condition:
    1. Click the defined start condition.
    2. In the Edit Start Condition dialog box, edit the condition as needed.
    3. Click Save.
  4. To delete a start condition:
    1. Click the X icon next to the defined start condition.

    2. Click Delete in the confirmation dialog box.

Manually Running a Sequence

When needed, you can manually run a sequence so that the sequence starts immediately.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of an inactive sequence to display the sequence details.
  3. Click the More icon () to the right of the start condition, and then click Run Now.

Disabling a Sequence

You can disable a sequence in an INACTIVE or ERROR status to temporarily stop the sequence from running jobs. A disabled sequence does not start at the next scheduled time.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. Click Disable Sequence.

Enabling a Sequence

You can enable a disabled sequence. After you enable a sequence, the sequence remains inactive until the start condition is met or until you manually run the sequence.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a disabled sequence to display the sequence details.
  3. Click Enable Sequence.

Deleting Sequences

You can delete sequences that are not actively running jobs.

  1. In the Navigation panel, click Run > Sequences.
  2. Select sequences in the list, and then click the Delete icon: .
  3. Click Delete to confirm.

Monitoring Sequences

You can monitor the overall sequence status and the status of each step within the sequence. You can also view the sequence history and the list of errors encountered by sequence steps.

Tip: To monitor the details of a running job, click the job name from the sequence details.

Sequence Status

The Sequences view lists the status of each sequence. The sequence details also list the sequence status, in addition to the status of each step.

The following table describes each sequence status:
Sequence Status Description
ACTIVE Sequence has started and is actively running jobs.
DISABLED Sequence is disabled and cannot run jobs.

You must enable the sequence before it can start again.

ERROR Sequence has encountered an error and has stopped running jobs.

When a step encounters an error and the auto-start next step condition is disabled, the sequence also transitions to an ERROR status.

When a sequence with an ERROR status is configured with a repeat start condition, the sequence starts again at the next scheduled time.

INACTIVE Sequence is waiting for the start condition to be met.

If a sequence does not have a defined start condition, the sequence remains inactive indefinitely unless you manually run the sequence.

Step Status

When you view the details of a sequence, you can monitor the status of each step within the sequence. The step status is simply a summary of the running job included in the step. The step status is not the same as the job status.

To monitor the details of a running job and view the job status, click the job name from the sequence details.

The following table describes each step status:
Step Status Description
INACTIVE Step is waiting to be started.
ERROR Job included in the step has encountered an error.
RUNNING Step is running a job.

Viewing Sequence History

You can view the global history for all steps in the sequence, or you can view the history for a single step. The history includes log messages for the last run of the sequence.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. To view the global history for all steps in the sequence, click the More icon () to the right of the start condition, and then click View Global History.
  4. To view the history for a single step, click the More icon () for a step and then click View History.

    Both the global and single step history display in the right side panel as follows:

Viewing Sequence Errors

You can view the global errors encountered by all steps in the sequence, or you can view the errors encountered by a single step. Control Hub displays the errors for the last run of the sequence.

To view more detailed errors for a job run, monitor the job from the Job Instances view.

  1. In the Navigation panel, click Run > Sequences.
  2. Click the name of a sequence to display the sequence details.
  3. To view the global errors encountered by all steps in the sequence, click the More icon () to the right of the start condition, and then click View Global Errors.
  4. To view the errors encountered by a single step, click the More icon () for a step and then click View Errors.

    Both the global and single step errors display in the right side panel as follows:

Troubleshooting

Use the following tips for help with sequence management:

I’ve enabled my sequence, but it remains in an INACTIVE status indefinitely.

After you enable a sequence, the sequence remains in an INACTIVE status until the start condition is met. If the sequence does not have a start condition, then the sequence remains inactive indefinitely.

You can manually run the sequence. Alternatively, edit the sequence to add a start condition.

My sequence runs one job continuously and does not start the job in the next step.
The job is likely a streaming job which includes a pipeline that maintains a connection to the origin system and processes data as it becomes available. The pipeline runs continuously until you manually stop it.
If you add a streaming job to a sequence and want to start additional jobs when the streaming job completes, you can manually stop the job in the Job Instances view.
Alternatively, you can redesign the pipeline so that the pipeline stops after processing all available data. For example, for Data Collector pipelines, you can use the Pipeline Finisher executor to stop the pipeline when all data is processed.