Glossary

Glossary of Terms

batch

A set of records that passes through a pipeline. Data Collector processes data in batches.

CDC-enabled origin

An origin that can process changed data and place CRUD operation information in the sdc.operation.type record header attribute.

control character

A non-printing character in a character set, such as the acknowledgement or escape characters.

CRUD-enabled stage

A processor or destination that can use the CRUD operation written in the sdc.operation.type header attribute to write changed data.

data alerts

Alerts based on rules that gather information about the data that passes between two stages.

Data Collector configuration file (sdc.properties)

Configuration file with most Data Collector properties. Found in the following location:

$SDC_CONF/sdc.properties

data drift alerts

Alerts based on data drift functions that gather information about the structure of data that passes between two stages.

data preview

Preview of data as it moves through a pipeline. Use to develop and test pipelines.

dataflow triggers

Instructions for the pipeline to kick off asynchronous tasks in external systems in response to events that occur in the pipeline. For more information, see Dataflow Triggers Overview.

delivery guarantee

Pipeline property that determines how Data Collector handles data when the pipeline stops unexpectedly.

destination

A stage type used in a pipeline to represent where Data Collector writes processed data.

development stages, dev stages

Stages such as the Dev Data Generator origin and the Dev Random Error processor that enable pipeline development and testing. Not meant for use in production pipelines.

event framework

The event framework enables the pipeline to trigger tasks in external systems based on actions that occur in the pipeline, such as running a MapReduce job after the pipeline writes a file to HDFS. You can also use the event framework to store event information, such as when an origin starts or completes reading a file.

event record

A record created by an event-generating stage when a stage-related event occurs, like when an origin starts reading a new file or a destination closes an output file.

executor

A stage type used to perform tasks in external systems upon receiving an event record.

explicit validation

A semantic validation that checks all configured values for validity and verifies whether the pipeline can run as configured. Occurs when you click the Validate icon, request data preview, or start the pipeline.

field path

The path to a field in a record. Use to reference a field.

implicit validation

Lists missing or incomplete configuration. Occurs by default as changes are saved in the pipeline canvas.

late directories

Origin directories that appear after a pipeline starts.

metric alerts

Alerts based on stage or pipeline metrics.

microservice pipeline

A pipeline that creates a finegrained service to perform a specific task.

multithreaded pipeline

A pipeline with an origin that generates multiple threads, enabling the processing of high volumes of data in a single pipeline.

orchestration pipeline

A pipeline that can schedule and perform a variety of tasks to complete an integrated workflow across the StreamSets ecosystem.

origin

A stage type used in a pipeline to represent the source of data in a pipeline.

pipeline

A representation of a stream of data processing.

pipeline runner

Used in multithreaded pipelines to run a sourceless instance of a pipeline.

preconditions

Conditions that a record must satisfy to enter the stage for processing. Records that don't meet all preconditions are processed based on stage error handling.

processors

A stage type that performs specific processing on pipeline data.

required fields

A required field is a field that must exist in a record to allow it into the stage for processing. Records that don't have all required fields are processed based on pipeline error handling.

runtime parameters

Parameters that you define for the pipeline and call from within that same pipeline.

runtime properties

Properties that you define in a file local to Data Collector and call from within a pipeline.

runtime resources

Values that you define in a restricted file local to Data Collector and call from within a pipeline.

SDC Record data format

A data format used for Data Collector error records and an optional format to use for output records.

sourceless pipeline instance

An instance of the pipeline that includes all of the processors and destinations in the pipeline and represents all pipeline processing after the origin. Used in multithreaded pipelines.

snapshot

A set of data captured as a pipeline runs. You can step through the snapshot like data preview. You can also use it as a source for data preview.

standalone pipeline, standalone mode pipeline

A pipeline configured to run in the default standalone execution mode.