Authoring Data Collectors
You use authoring Data Collectors to design pipelines and to create connections. You install and register authoring Data Collectors just as you do execution Data Collectors.
You can design pipelines in the Control Hub after selecting an available authoring Data Collector. The selected authoring Data Collector determines the stages, stage libraries, and functionality that display in Pipeline Designer. When you create connections, the selected authoring Data Collector determines the connection types that you can create.
Use an authoring Data Collector that is the same version as the execution Data Collectors that you intend to use to run the pipeline. Using a different Data Collector version can result in pipelines that are invalid for the execution Data Collectors.
For example, if the authoring Data Collector is a more recent version than the execution Data Collector, pipelines might include a stage, stage library, or stage functionality that does not exist in the execution Data Collector.
- System Data Collector
-
Control Hub includes a system Data Collector for exploration and light development. Administrators can enable or disable the system Data Collector for use as the default authoring Data Collector in Control Hub.
- Registered Data Collector
- You can select a registered Data Collector that meets all of the following requirements:
- StreamSets recommends using the latest version of Data Collector.
The minimum supported Data Collector version is 3.0.0.0. To design pipeline fragments, the minimum supported Data Collector version is 3.2.0.0. To create and use connections, the minimum supported Data Collector version is 3.19.0.
- The Data Collector uses the HTTPS protocol because Control Hub also uses the HTTPS protocol. Note: StreamSets recommends using a certificate signed by a certifying authority for a Data Collector that uses the HTTPS protocol. If you use a self-signed certificate, you must first use a browser to access the Data Collector URL and accept the web browser warning message about the self-signed certificate before users can select the Data Collector as the authoring Data Collector.
- The Data Collector URL is reachable from the Control Hub web browser.
- StreamSets recommends using the latest version of Data Collector.
Authoring
label to the authoring Data Collectors.
That way, data engineers can easily determine which Data Collectors
are authoring Data Collectors
when they use Control Hub.System Data Collector
Administrators can enable or disable the system Data Collector for use as the default authoring Data Collector in Control Hub. The system Data Collector runs on the public cloud service hosted by StreamSets. When available, you can use the system Data Collector as the authoring Data Collector for exploration and light development to design pipelines and fragments. You cannot use the system Data Collector for data preview or explicit pipeline validation. It also cannot be used to configure a pipeline that uses connections.
The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub applications and the system Data Collector running on the StreamSets cloud service. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.
When you use the system Data Collector to design pipelines and fragments, the web browser sends requests to the Control Hub pipeline store application to save and retrieve pipelines and fragments from the Control Hub pipeline repository. The system Data Collector is stateless - meaning that pipeline and fragment definitions are not saved with the Data Collector. Instead, all definitions are saved in the Control Hub pipeline repository.
As you design a pipeline or fragment, the Control Hub pipeline store application sends requests to the system Data Collector to display the requested stage definitions. Similarly, as the web browser saves your changes, the Control Hub pipeline store application sends requests to the system Data Collector to perform implicit validation. Implicit validation lists missing or incomplete configuration, such as an unconnected stage or a required property that has not been configured.
The following image shows how Pipeline Designer interacts with the system Data Collector when you design pipelines and fragments:
Registered Data Collector
Registered Data Collectors run in your corporate network, either on-premises or on a protected cloud computing platform where you installed them. Use a registered Data Collector as the authoring Data Collector to design pipelines and fragments, and to preview and explicitly validate pipelines.
The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub applications and the registered Data Collector selected as the authoring Data Collector. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.
The registered Data Collector selected as the authoring Data Collector accepts inbound connections from the web browser on the port number configured for the Data Collector. The connection must be HTTPS.
When you use a registered Data Collector as the authoring Data Collector for Pipeline Designer, you can complete the following tasks:
- Pipeline and fragment design
-
When you design pipelines and fragments, the web browser sends requests to the Control Hub pipeline store application to save and retrieve pipelines and fragments from the Control Hub pipeline repository. A registered Data Collector used by Pipeline Designer to design pipelines and fragments is stateless - meaning that no pipeline or fragment definitions are saved with the Data Collector. Instead, all definitions are stored in the Control Hub pipeline repository.
As you design the pipeline or fragment, the web browser sends requests directly to the registered Data Collector to display the requested stage definitions. Similarly, as the web browser saves your changes, the browser sends requests to the registered Data Collector to perform implicit validation. Implicit validation lists missing or incomplete configuration, such as an unconnected stage or a required property that has not been configured.
- Data preview
- When you preview data in a pipeline, the web browser sends the data preview request directly to the registered Data Collector. No pipeline data is sent through the Control Hub cloud service.
- Explicit validation
- When you click the Validate icon to explicitly validate a pipeline, the web browser sends the validate request directly to the registered Data Collector. No explicit validation requests are sent through the Control Hub cloud service.
The following image shows how Pipeline Designer interacts with a registered Data Collector when you design pipelines and fragments, and preview and validate pipelines: