Authoring Data Collectors

You use authoring Data Collectors to design pipelines and to create connections. You install and register authoring Data Collectors just as you do execution Data Collectors.

You can design pipelines in the Control Hub after selecting an available authoring Data Collector. The selected authoring Data Collector determines the stages, stage libraries, and functionality that display in Pipeline Designer. When you create connections, the selected authoring Data Collector determines the connection types that you can create.

Use an authoring Data Collector that is the same version as the execution Data Collectors that you intend to use to run the pipeline. Using a different Data Collector version can result in pipelines that are invalid for the execution Data Collectors.

For example, if the authoring Data Collector is a more recent version than the execution Data Collector, pipelines might include a stage, stage library, or stage functionality that does not exist in the execution Data Collector.

When using Pipeline Designer, select one of the following types of Data Collectors to use as the authoring Data Collector:
System Data Collector

Control Hub includes a system Data Collector for exploration and light development. Administrators can enable or disable the system Data Collector for use as the default authoring Data Collector in Control Hub.

When you select the system Data Collector, Control Hub displays the latest version of all stage libraries available with the latest version of Data Collector.
Use the system Data Collector to design pipelines only - it cannot be used for data preview or explicit pipeline validation. It also cannot be used to configure a pipeline that uses connections.
Registered Data Collector
You can select a registered Data Collector that meets all of the following requirements:
  • StreamSets recommends using the latest version of Data Collector.

    The minimum supported Data Collector version is 3.0.0.0. To design pipeline fragments, the minimum supported Data Collector version is 3.2.0.0. To create and use connections, the minimum supported Data Collector version is 3.19.0.

  • The Data Collector uses the HTTPS protocol because Control Hub also uses the HTTPS protocol.
    Note: StreamSets recommends using a certificate signed by a certifying authority for a Data Collector that uses the HTTPS protocol. If you use a self-signed certificate, you must first use a browser to access the Data Collector URL and accept the web browser warning message about the self-signed certificate before users can select the Data Collector as the authoring Data Collector.
  • The Data Collector URL is reachable from the Control Hub web browser.
When you select a registered Data Collector, Control Hub displays the stage libraries and custom stage libraries installed in the registered Data Collector. Use a registered Data Collector to design, preview, and explicitly validate pipelines.
Tip: Use labels to clearly designate which Data Collectors are dedicated to pipeline design. For example, assign an Authoring label to the authoring Data Collectors. That way, data engineers can easily determine which Data Collectors are authoring Data Collectors when they use Control Hub.

System Data Collector

Administrators can enable or disable the system Data Collector for use as the default authoring Data Collector in Control Hub. The system Data Collector runs on the public cloud service hosted by StreamSets. When available, you can use the system Data Collector as the authoring Data Collector for exploration and light development to design pipelines and fragments. You cannot use the system Data Collector for data preview or explicit pipeline validation. It also cannot be used to configure a pipeline that uses connections.

The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub applications and the system Data Collector running on the StreamSets cloud service. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.

When you use the system Data Collector to design pipelines and fragments, the web browser sends requests to the Control Hub pipeline store application to save and retrieve pipelines and fragments from the Control Hub pipeline repository. The system Data Collector is stateless - meaning that pipeline and fragment definitions are not saved with the Data Collector. Instead, all definitions are saved in the Control Hub pipeline repository.

As you design a pipeline or fragment, the Control Hub pipeline store application sends requests to the system Data Collector to display the requested stage definitions. Similarly, as the web browser saves your changes, the Control Hub pipeline store application sends requests to the system Data Collector to perform implicit validation. Implicit validation lists missing or incomplete configuration, such as an unconnected stage or a required property that has not been configured.

The following image shows how Pipeline Designer interacts with the system Data Collector when you design pipelines and fragments:

Registered Data Collector

Registered Data Collectors run in your corporate network, either on-premises or on a protected cloud computing platform where you installed them. Use a registered Data Collector as the authoring Data Collector to design pipelines and fragments, and to preview and explicitly validate pipelines.

The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub applications and the registered Data Collector selected as the authoring Data Collector. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.

The registered Data Collector selected as the authoring Data Collector accepts inbound connections from the web browser on the port number configured for the Data Collector. The connection must be HTTPS.

When you use a registered Data Collector as the authoring Data Collector for Pipeline Designer, you can complete the following tasks:

Pipeline and fragment design

When you design pipelines and fragments, the web browser sends requests to the Control Hub pipeline store application to save and retrieve pipelines and fragments from the Control Hub pipeline repository. A registered Data Collector used by Pipeline Designer to design pipelines and fragments is stateless - meaning that no pipeline or fragment definitions are saved with the Data Collector. Instead, all definitions are stored in the Control Hub pipeline repository.

As you design the pipeline or fragment, the web browser sends requests directly to the registered Data Collector to display the requested stage definitions. Similarly, as the web browser saves your changes, the browser sends requests to the registered Data Collector to perform implicit validation. Implicit validation lists missing or incomplete configuration, such as an unconnected stage or a required property that has not been configured.

Data preview
When you preview data in a pipeline, the web browser sends the data preview request directly to the registered Data Collector. No pipeline data is sent through the Control Hub cloud service.
When you preview data, the pipeline definition is temporarily saved with the registered Data Collector to perform the preview. The Data Collector then deletes the pipeline when preview is finished.
Explicit validation
When you click the Validate icon to explicitly validate a pipeline, the web browser sends the validate request directly to the registered Data Collector. No explicit validation requests are sent through the Control Hub cloud service.
When you explicitly validate the pipeline, the pipeline definition is temporarily saved with the registered Data Collector to perform the validation. The temporary pipeline definition is deleted when the validation is finished.

The following image shows how Pipeline Designer interacts with a registered Data Collector when you design pipelines and fragments, and preview and validate pipelines: