Glossary

Glossary of Terms

authoring Data Collector

A Data Collector dedicated to pipeline design. You can design pipelines in Control Hub after selecting an available authoring Data Collector to use. The selected authoring Data Collector determines the stages and functionality that display in the pipeline canvas.

Or, you can directly log into an authoring Data Collector to design pipelines.

batch

A set of records that passes through a pipeline. Data Collector processes data in batches.

CDC-enabled origin

An origin that can process changed data and place CRUD operation information in the sdc.operation.type record header attribute.

cluster execution mode

Pipeline execution mode that allows you to process large volumes of data from Kafka or HDFS.

cluster pipeline, cluster mode pipeline

A pipeline configured to run in cluster execution mode.

connection

An object that defines the information required to connect to an external system. Create a connection once and then reuse that connection for multiple pipeline stages.

control character

A non-printing character in a character set, such as the acknowledgement or escape characters.

Control Hub controlled pipeline

A pipeline that is managed by Control Hub and run remotely on execution engines. Control Hub controlled pipelines include published and system pipelines run from jobs.

CRUD-enabled stage

A processor or destination that can use the CRUD operation written in the sdc.operation.type header attribute to write changed data.

data alerts

Alerts based on rules that gather information about the data that passes between two stages.

Data Collector configuration file (sdc.properties)

Configuration file with most Data Collector properties. Found in the following location:

$SDC_CONF/sdc.properties

data delivery report

A report that presents data ingestion metrics for a given job or topology.

data SLA

A service level agreement that defines the data processing rates that jobs within a topology must meet.

data drift alerts

Alerts based on data drift functions that gather information about the structure of data that passes between two stages.

data preview

Preview of data as it moves through a pipeline. Use to develop and test pipelines.

dataflow triggers

Instructions for the pipeline to kick off asynchronous tasks in external systems in response to events that occur in the pipeline. For more information, see Dataflow Triggers Overview.

delivery guarantee

Pipeline property that determines how Data Collector handles data when the pipeline stops unexpectedly.

deployment

A logical grouping of Data Collector containers deployed by a Provisioning Agent to a container orchestration system, such as Kubernetes. All Data Collector containers in a deployment are identical and highly available.

destination

A stage type used in a pipeline to represent where the engine writes processed data.

development stages, dev stages

Stages such as the Dev Raw Data Source origin and the Dev Identity Error processor that enable pipeline development and testing. Not meant for use in production pipelines.

event framework

The event framework enables the pipeline to trigger tasks in external systems based on actions that occur in the pipeline, such as running a MapReduce job after the pipeline writes a file to HDFS. You can also use the event framework to store event information, such as when an origin starts or completes reading a file.

event record

A record created by an event-generating stage when a stage-related event occurs, like when an origin starts reading a new file or a destination closes an output file.

execution Data Collector

An execution engine that runs pipelines that can read from and write to a large number of heterogeneous origins and destinations. These pipelines perform lightweight data transformations that are applied to individual records or a single batch.

executor

A Data Collector stage type used to perform tasks in external systems upon receiving an event record.

explicit validation

A semantic validation that checks all configured values for validity and verifies whether the pipeline can run as configured. Occurs when you click the Validate icon, request data preview, or start the pipeline.

field path

The path to a field in a record. Use to reference a field.

implicit validation

Lists missing or incomplete configuration. Occurs by default as your changes are saved in the pipeline canvas.

job

The execution of a dataflow. A job defines the pipeline to run and the engines that run the pipeline.

job template

A job definition that lets you run multiple job instances with different runtime parameter values. When creating a job, you can enable the job to work as a job template if the job includes a pipeline that uses runtime parameters.

label

A grouping of execution engines registered with Control Hub. You assign labels to each execution engine, using the same label for the same type of execution engine that you want to function as a group. When you create a job, you assign labels to the job so that Control Hub knows on which group of execution engines the job should start.

late directories

Origin directories that appear after a pipeline starts.

metric alerts

Monitoring or email alerts based on stage or pipeline metrics.

microservice pipeline

A Data Collector pipeline that creates a finegrained service to perform a specific task.

multithreaded pipelines

A pipeline with an origin that generates multiple threads, enabling the processing of high volumes of data in a single pipeline on one Data Collector.

organization

A secure space provided to a set of users. All engines, pipelines, jobs, and other objects added by any user in the organization belong to that organization. A user logs in to Control Hub as a member of an organization and can only access data that belongs to that organization.

organization administrator

A user account that has the Organization Administrator role, allowing the user to perform administrative tasks for the organization.

origin

A stage type used in a pipeline to represent the source of data.

pipeline

A representation of a stream of data that is processed by an engine.

Pipeline Designer

The Control Hub pipeline development tool. Use Pipeline Designer to design and publish Data Collector, and Transformer pipelines and fragments.

pipeline fragment

A stage or set of connected stages that you can reuse in pipelines. Use pipeline fragments to easily add the same processing logic to multiple pipelines.

pipeline label

A label that enables grouping similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the Pipelines or Fragments views.

pipeline repository

The Control Hub repository that stores all pipelines and fragments designed in the Control Hub Pipeline Designer and all pipelines published or imported from an authoring Data Collector. The pipeline repository maintains a version history of all published and imported pipelines and fragments.

pipeline runner

Used in multithreaded pipelines to run a sourceless instance of a pipeline.

pipeline tag

A pointer to a specific pipeline commit or version.

preconditions

Conditions that a record must satisfy to enter the Data Collector stage for processing. Records that don't meet all preconditions are processed based on stage error handling.

processors

A stage type that performs specific processing on pipeline data.

Provisioning Agent

A containerized application that runs in a container orchestration framework, such as Kubernetes. The agent communicates with Control Hub to automatically provision Data Collector containers in the Kubernetes cluster in which it runs. Provisioning includes deploying, registering, starting, scaling, and stopping the Data Collector containers.

published pipeline

A pipeline that has a completed design, has been checked in, and is available to be added to a job.

required fields

A required field is a field that must exist in a record to allow it into the Data Collector stage for processing. Records that don't have all required fields are processed based on pipeline error handling.

RPC ID

A user-defined identifier configured in the SDC RPC origin and destination to allow the destination to write to the origin.

runtime parameters

Parameters that you define for the pipeline and call from within that same pipeline. Use to specify values for pipeline properties when you start the pipeline.

runtime properties

Properties that you define in a file local to the engine and call from within a pipeline. Use to define different sets of values for different engine instances.

runtime resources

Values that you define in a restricted file local to the engine and call from within a pipeline. Use to load sensitive information from files at runtime.

scheduled task

A long-running task that periodically triggers an action on other Control Hub tasks at the specified frequency. For example, a scheduled task can start or stop a job or generate a data delivery report on a weekly or monthly basis.

SDC Record data format

A data format used for Data Collector error records and an optional format to use for output records.

SDC RPC pipelines

A set of pipelines that use the SDC RPC destination and SDC RPC origin to pass data from one pipeline to another without writing to an intermediary system.

sourceless pipeline instance

An instance of the pipeline that includes all of the processors and destinations in the pipeline and represents all pipeline processing after the origin. Used in multithreaded pipelines.

standalone pipeline, standalone mode pipeline

A pipeline configured to run in the default standalone execution mode.

subscription

An object that listens for Control Hub events and then completes an action when those events occur.

system Data Collector

An authoring Data Collector provided with Control Hub to use for exploration and light development when designing pipelines. The system Data Collector cannot be used for data preview or explicit pipeline validation. Administrators can enable or disable the system Data Collector as an authoring Data Collector for users in an organization.

system pipeline

A pipeline that Control Hub automatically creates when the published pipeline included in a job is configured to aggregate statistics. System pipelines collect, aggregate, and push metrics for all of the remote pipeline instances to Control Hub.

tag

An identifier used to easily search and filter objects in one of the Control Hub views. You can add tags to most Control Hub objects including environments, deployments, and connections. For example, you might want to filter connections in the Connections view by the origin system or by the test or production project.

topology

An interactive end-to-end view of data as it traverses multiple pipelines that work together. You can map all data flow activities that serve the needs of one business function in a single topology.

Transformer

An execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform heavyweight transformations such as joins, aggregates, and sorts on the entire data set.

Transformer configuration file (transformer.properties)

Configuration file with most Transformer properties. Found in the following location:

$TRANSFORMER_CONF/transformer.properties