Glossary

Glossary of Terms

CDC-enabled origin
An origin that can process changed data and place CRUD operation information in the sdc.operation.type record header attribute.
connection
An object that defines the information required to connect to an external system. Create a connection once and then reuse that connection for multiple pipeline stages.
CRUD-enabled stage
A processor or destination that can use the CRUD operation written in the sdc.operation.type header attribute to write changed data.
Data Collector engine
An engine that runs data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
data SLA

A service level agreement that defines the data processing rates that jobs within a topology must meet.

data preview
Preview of data as it moves through a pipeline. Use to develop and test pipelines.
deployment

A group of identical engine instances deployed within an environment. A deployment defines the StreamSets engine type, version, and configuration to use. You can deploy and launch multiple instances of the configured engine.

destination
A stage type used in a pipeline to represent where the engine writes processed data.
development stages, dev stages
Stages such as the Dev Raw Data Source origin and the Dev Identity Error processor that enable pipeline development and testing. Not meant for use in production pipelines.
draft run
The execution of a draft pipeline. Use draft runs for development purposes only. While editing a pipeline in the pipeline canvas, you can start a draft run to quickly test the pipeline logic. You can run a draft run as long as you'd like. This allows you to monitor the draft run over the course of hours or days, as needed, before you publish the pipeline for use in a production job.
engine
A component of the StreamSets platform that resides in your corporate network, which can be on-premises or on a protected cloud computing platform. An engine runs pipelines and functions as a headless engine without a UI. StreamSets has two data plane engines, Data Collector and Transformer. Both engines can be deployed independently, but managed together in Control Hub.
environment
A representation of the resources needed to run StreamSets engines. An environment defines where to deploy the engines.
event record
A record created by an event-generating stage when a stage-related event occurs, like when an origin starts reading a new file or a destination closes an output file.
executor
A Data Collector stage type used to perform tasks in external systems upon receiving an event record.
explicit validation
A semantic validation that checks all configured values for validity and verifies whether the pipeline can run as configured. Occurs when you click the Validate icon, request data preview, or start the pipeline.
external resources

External files and libraries that an engine requires to run pipelines. For example, JDBC stages require a JDBC driver to access the database. When you use a JDBC stage, you must make the driver available as an external resource.

field path
The path to a field in a record. Use to reference a field.
implicit validation
Lists missing or incomplete configuration. Occurs by default as your changes are saved in the pipeline canvas.
job instance

The execution of a pipeline. You can create a job instance from a pipeline or from a job template. When you create a job instance from a pipeline, you configure all of the job details. When you create a job instance from a job template, you configure pipeline parameter values only since other job details are already defined.

job template
A definition of a job that you can use to create and start multiple job instances. A job template defines the pipeline to run, the StreamSets engine that runs the pipeline, and advanced job runtime details. Job templates allow data engineers to define the job details, while analysts can start job instances from the templates by specifying pipeline parameter values only.
label
A means of grouping deployments, engines, and jobs. Assign labels to deployments and engines so that they function as a group. When you create a job, you assign labels to the job so that Control Hub knows on which group of engines the job should start.
microservice pipeline
A Data Collector pipeline that creates a fine­grained service to perform a specific task.
multithreaded pipelines
A pipeline with an origin that generates multiple threads, enabling the processing of high volumes of data in a single pipeline on one Data Collector.
organization

A secure space provided to a set of users. All engines, pipelines, jobs, and other objects added by any user in the organization belong to that organization. A user logs in to Control Hub as a member of an organization and can only access data that belongs to that organization.

organization administrator

A user account that has the Organization Administrator role, allowing the user to perform administrative tasks for the organization.

origin
A stage type used in a pipeline to represent the source of data.
pipeline
A representation of a stream of data that is processed by an engine.
pipeline fragment

A stage or set of connected stages that you can reuse in pipelines. Use pipeline fragments to easily add the same processing logic to multiple pipelines.

pipeline label

A label that enables grouping similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the Pipelines or Fragments views.

pipeline tag

A pointer to a specific pipeline commit or version.

preconditions
Conditions that a record must satisfy to enter the Data Collector stage for processing. Records that don't meet all preconditions are processed based on stage error handling.
processors
A stage type that performs specific processing on pipeline data.
published pipeline

A pipeline that has a completed design, has been checked in, and is available to be added to a job.

required fields
A required field is a field that must exist in a record to allow it into the Data Collector stage for processing. Records that don't have all required fields are processed based on pipeline error handling.
runtime parameters
Parameters that you define for the pipeline and call from within that same pipeline. Use to specify values for pipeline properties when you start the pipeline.
runtime properties
Properties that you define in a file local to the engine and call from within a pipeline. Use to define different sets of values for different engine instances.
runtime resources
Values that you define in a restricted file local to the engine and call from within a pipeline. Use to load sensitive information from files at runtime.
scheduled task
A long-running task that periodically triggers an action on other Control Hub tasks at the specified frequency. For example, a scheduled task can start or stop a job on a weekly or monthly basis.
subscription
An object that listens for Control Hub events and then completes an action when those events occur.
tag

An identifier used to group Control Hub objects such as environments, deployments, and connections. Use tags to easily search and filter objects in one of the Control Hub views.

topology
An interactive end-to-end view of data as it traverses multiple pipelines that work together. You can map all data flow activities that serve the needs of one business function in a single topology.
Transformer engine
An engine that runs data processing pipelines on Apache Spark. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform set-based transformations such as joins, aggregates, and sorts on the entire data set.