Pipeline Maintenance
Understanding Pipeline States
A pipeline state is the current condition of the pipeline, such as "running" or "stopped". The pipeline state can display in the All Pipelines list. The state of a pipeline can also appear in the Data Collector log.
- EDITED - The pipeline has been created or modified, and has not run since the last modification.
- FINISHED - The pipeline has completed all expected processing and has stopped running.
- RUN_ERROR - The pipeline encountered an error while running and stopped.
- RUNNING - The pipeline is running.
- STOPPED - The pipeline was manually stopped.
- START_ERROR - The pipeline encountered an error while starting and failed to start.
- STOP_ERROR - The pipeline encountered an error while stopping.
- CONNECT_ERROR - When running a cluster-mode pipeline, Data Collector cannot connect to the underlying cluster manager.
- CONNECTING - The pipeline is preparing to restart after a Data Collector restart.
- DISCONNECTED - The pipeline is disconnected from external systems, typically because Data Collector is restarting or shutting down.
- DISCONNECTING - The pipeline is in the process of disconnecting from external systems, typically because Data Collector is restarting or shutting down.
- FINISHING - The pipeline is in the process of finishing all expected processing.
- RETRY - The pipeline is trying to run after encountering an error while running. This occurs only when the pipeline is configured for a retry upon error.
- RUNNING_ERROR - The pipeline encounters errors while running.
- STARTING - The pipeline is initializing, but hasn't started yet.
- STARTING_ERROR - The pipeline encounters errors while starting.
- STOPPING - The pipeline is in the process of stopping after a manual request to stop.
- STOPPING_ERROR - The pipeline encounters errors while stopping.
State Transition Examples
- Starting a pipeline
- When you successfully start a pipeline for the first time, a pipeline
transitions through the following
states:
(EDITED)... STARTING... RUNNING
- Stopping or restarting Data Collector
-
When Data Collector shuts down, running pipelines transition through the following states:
(RUNNING)... DISCONNECTING... DISCONNECTED
- Retrying a pipeline
- When a pipeline is configured to retry upon error, Data Collector performs the specified number of retries when the pipeline encounters errors while running.
- Stopping a pipeline
- When you successfully stop a pipeline, a pipeline transitions through the
following
states:
(RUNNING)... STOPPING... STOPPED
Starting Pipelines
You can start Data Collector pipelines when they are valid. When you start a pipeline, Data Collector runs the pipeline until you stop the pipeline or shut down Data Collector.
For most origins, when you restart a pipeline, Data Collector starts the pipeline from where it last stopped by default. You can reset the origin to read all available data.
A Kafka Consumer origin starts processing data based on the offset passed from the Kafka ZooKeeper.
You can start pipelines from the following locations:
- From the Home page, select pipelines in the list and then click the Start icon.
- From the pipeline canvas, click the Start icon.
If the Start icon is not enabled, the pipeline is not valid.
Starting Pipelines with Parameters
If you defined runtime parameters for a pipeline, you can specify the parameter values to use when you start the pipeline.
For more information, see Runtime Parameters.
-
From the pipeline canvas, click the More icon, and then
click Start with Parameters.
If Start with Parameters is not enabled, the pipeline is not valid.
The Start with Parameters dialog box lists all parameters defined for the pipeline and their default values.
- Override any default values with the values you want to use for this pipeline run.
- Click Start.
Resetting the Origin
You can reset the origin for the following origin stages:
- Amazon S3
- Aurora PostgreSQL CDC Client
- Azure Blob Storage
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure Data Lake Storage Gen2 (Legacy)
- Directory
- Elasticsearch
- File Tail
- Google Cloud Storage
- Groovy Scripting
- Hadoop FS Standalone
- HTTP Client
- JavaScript Scripting
- JDBC Multitable Consumer
- JDBC Query Consumer
- Jython Scripting
- Kinesis Consumer
- MapR DB JSON
- MapR FS Standalone
- MongoDB
- MongoDB Atlas
- MongoDB Atlas CDC
- MongoDB Oplog
- MySQL Binary Log
- Oracle CDC
- Oracle CDC Client
- Oracle Multitable Consumer
- PostgreSQL CDC Client
- Salesforce
- Salesforce Bulk API 2.0
- SAP HANA Query Consumer
- SFTP/FTP/FTPS Client
- SQL Server 2019 BDC Multitable Consumer
- SQL Server CDC Client
- SQL Server Change Tracking
- Teradata Consumer
- Windows Event Log
For these origins, when you stop the pipeline, the Data Collector notes where it stopped processing data. When you restart the pipeline, it continues from where it left off by default. When you want the Data Collector to process all available data instead of continuing from where it stopped, reset the origin. For unique details about resetting the Kinesis Consumer origin, see Resetting the Kinesis Consumer Origin.
You can configure the Kafka and MapR Streams Consumer origins to process all available data by specifying an additional Kafka configuration property. You can reset the Azure IoT/Event Hub Consumer origin by deleting offset details in the Microsoft Azure portal. The remaining origin stages process transient data where resetting the origin has no effect.
You can reset the origin for multiple pipelines at the same time from the Home page. Or, you can reset the origin for a single pipeline from the pipeline canvas.
To reset the origin:
- Select multiple pipelines from the Home page, or view a single pipeline in the pipeline canvas.
- Click the More icon, and then click Reset Origin.
- In the Reset Origin Confirmation dialog box, click Yes to reset the origin.
Stopping Pipelines
Stop pipelines when you want Data Collector to stop processing data for the pipelines.
When stopping a pipeline, Data Collector waits for the pipeline to gracefully complete all tasks for the in-progress batch. In some situations, this can take several minutes.
For example, if a scripting processor includes code with a timed wait, Data Collector waits for the scripting processor to complete its task. Then, Data Collector waits for the rest of the pipeline to complete all tasks before stopping the pipeline.
When Data Collector runs a pipeline, it displays in the Data Collector UI in Monitor mode by default.
-
From the Home page, select the pipelines in the list and
then click the Stop icon. Or to stop a pipeline in the
pipeline canvas, click the Stop icon.
The Stop Pipeline Confirmation dialog box appears.
-
To stop the pipelines, click Yes.
Depending on the pipeline complexity, the pipeline might take some time to stop.
When a pipeline remains in a Stopping state for an unexpectedly long period of time, you can force the pipeline to stop.
Forcing a Pipeline to Stop
When necessary, you can force Data Collector to stop a pipeline.
When forcing a pipeline to stop, Data Collector often stops processes before they complete, which can lead to unexpected results.
-
To force a pipeline to stop from the Home page, click
the More icon for the pipeline, and then click
Force Stop. Or to force a pipeline to stop from the
pipeline canvas, click Force Stop.
The Force Stop Pipeline Confirmation dialog box appears.
- To force the pipelines to stop, click Yes.
Importing Pipelines
Import pipelines to use pipelines developed on a different Data Collector or to restore backup files.
You can import pipelines that were developed on the same version of Data Collector or on an earlier version of Data Collector. Data Collector does not support importing a pipeline developed on a later version of Data Collector.
You can import pipelines from individual pipeline files, from a ZIP file containing multiple pipeline files, or from an external HTTP URL. Pipeline files are JSON files exported from a Data Collector.
Importing a Pipeline
- To import a single pipeline, from the Home page, click .
- In the Import Pipeline dialog box, enter a pipeline title and optional description.
- Browse and select the pipeline file, and then click Open.
- Click Import.
Importing a Set of Pipelines from an Archive File
You can import a set of pipelines from a ZIP file that contains multiple pipeline JSON files. When you import a set of pipelines, Data Collector imports the existing pipeline names. If necessary, you can rename the pipelines after the import.
- To import a set of pipelines, from the Home page, click .
- In the Import Pipelines from Archive dialog box, browse and select the ZIP file that contains the pipeline files, and then click Open.
- To import all pipelines in the file, click Import.
Importing a Pipeline from an HTTP URL
When you import a pipeline from an HTTP URL, you can rename the pipeline during the import.
- To import a pipeline from an HTTP URL, from the Home page, click .
- In the Import Pipeline from HTTP URL dialog box, enter a pipeline title and optional description.
-
Enter the HTTP URL for the pipeline.
For example, to import the sample edge pipeline named MQTT to HTTP included in the StreamSets GitHub repository, enter the following URL:
https://raw.githubusercontent.com/streamsets/datacollector-edge/master/resources/samplePipelines/mqttToHttp/pipeline.json
- Click Import.
Sharing Pipelines
When you create a pipeline, you become the owner of the pipeline. As the owner of a pipeline, you have all permissions for the pipeline, you can configure pipeline sharing, and you can change the owner of the pipeline. A pipeline can have a single user as the owner.
Like the pipeline owner, a user with the Admin role also has all permissions for all pipelines, can configure pipeline sharing and can change the pipeline owner.
By default, all other users have no access to pipelines. To allow other users to work with a pipeline, you must share the pipeline with the users or their groups, and configure pipeline permissions.
Permission | Description |
---|---|
Read | View and monitor the pipeline, and see alerts. View existing snapshot data. |
Write | Edit the pipeline and alerts. |
Execute | Start and stop the pipeline. Preview data and take a snapshot. |
When someone shares a pipeline with you, it displays in the Pipeline library under the Shared With Me label in the pipeline library.
For more information about roles and permissions, see Roles and Permissions.
Sharing a Pipeline
You can share a pipeline if you are the owner of the pipeline or a user with the Admin role.
You can configure pipeline sharing at any time, but pipeline permissions are only enforced when Data Collector is enabled to use pipeline access controls. The sharing configuration goes into effect when sharing is enabled.
-
You can share a pipeline from either of the following locations:
- From the Home page, select the pipeline, click the More icon, and click Share.
- From the pipeline canvas, click the Share icon: .
- In the Sharing Settings dialog box, click in the Select Users and Groups window, select the users and groups that you want to share with, and click Add.
- Configure the permissions that you want each user and group to have and click Save.
Changing the Pipeline Owner
The pipeline owner has all permissions for the pipeline and can configure sharing for other users and groups. There can only be one pipeline owner.
-
You can configure pipeline permissions from the following locations:
- From the Home page, select the pipeline, click the More icon, and click Share.
- From the pipeline canvas, click the Share icon: .
- In the Sharing Settings dialog box, if necessary, add the user that you want to use as the owner.
- To select a new owner, click the More icon for the user and click Is Owner.
- Click Save to save the change.
Adding Labels to Pipelines
You can add labels to pipelines to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.
<label1>/<label2>/<label3>
For example, to
group pipelines in the test environment by the origin system, you might add the
labels Test/HDFS
and Test/Elasticsearch
to the
appropriate pipelines.You can add labels to pipelines from the following locations:
- From the Home page, select pipelines in the list, click the
More icon, and then click Add
Labels. Enter labels and then click
Save.Note: Existing labels that have already been added to the pipeline are ignored.
- From the pipeline canvas, click the General tab and then enter labels for the Labels property.
Exporting Pipelines
Export pipelines to create backups or to use the pipelines with another Data Collector. You can export pipelines with or without plain text credentials configured in the pipeline. You can export a single pipeline or a set of pipelines.
-
From the Home page, select one or more pipelines to
export.
Alternatively, to export a single pipeline, you can open the pipeline.
-
Click the More icon, and choose whether to export the
pipeline with or without configured plain text credentials:
- Export - Data Collector removes any configured plain text credentials from the exported pipelines.
- Export with Plain Text Credentials - Data Collector includes any configured plain text credentials in the exported pipelines.
Data Collector writes a file containing the exported pipelines to your default downloads directory:- When you export a single pipeline, Data Collector generates a JSON file named after the pipeline, as follows:
<pipeline name>.json
. - When you export a set of pipelines, Data Collector creates a ZIP file named
pipelines.zip
.
Exporting Pipelines for Control Hub
If you develop pipelines in a Data Collector that is not registered with Control Hub, export valid pipelines for use in Control Hub.
If you develop pipelines in a Data Collector that is registered with Control Hub, publish the pipelines directly to Control Hub.
You can export a single pipeline or a set of pipelines. When you export pipelines for Control Hub, Data Collector exports the pipelines without plain text credentials.
-
From the Home page, select one or more pipelines to
export.
Alternatively, to export a single pipeline, you can open the pipeline.
-
Click the More icon, and then click Export
for Control Hub.
Data Collector exports the pipelines without any plain text credentials and writes a file containing the exported pipelines to your default downloads directory:
- When you export a single pipeline, Data Collector generates a JSON file named after the pipeline, as follows: <pipeline name>.json. The generated JSON file includes the definition of each stage library used in the pipeline.
- When you export a set of pipelines, Data Collector creates a ZIP file named pipelines.zip.
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.
When you duplicate a pipeline, you can rename the pipeline and specify the number of copies to make.
- From the Home page, select a pipeline in the list view and then click the Duplicate icon. Or to duplicate a pipeline in the pipeline canvas, click the More icon for the pipeline and then click Duplicate.
-
In the Duplicate Pipeline Definition dialog box, enter a
name for the duplicate pipeline and the number of copies to make.
When you create multiple copies, Data Collector appends an integer after the pipeline name. For example, if you enter the name "test" and create two copies of the pipeline, Data Collector names the duplicate pipelines "test1" and "test2".
-
Click Duplicate.
The duplicate pipelines display.
Deleting Pipelines
-
From the Home page, select pipelines in the list and then
click the Delete icon. Or to delete a pipeline in the
pipeline canvas, click the More icon for the pipeline and
then click Delete.
A confirmation window appears.
- To delete the pipelines, click Yes.