Getting Started

What is StreamSets Data Collector?

StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.

To define the flow of data, you design a pipeline in Data Collector. A pipeline consists of stages that represent the origin and destination of the pipeline, and any additional processing that you want to perform. After you design the pipeline, you click Start and Data Collector goes to work.

Data Collector processes data when it arrives at the origin and waits quietly when not needed. You can view real-time statistics about your data, inspect data as it passes through the pipeline, or take a close look at a snapshot of data.

How should I use Data Collector?

Use StreamSets Data Collector like a pipe for a data stream. Throughout your enterprise data topology, you have streams of data that you need to move, collect, and process on the way to their destinations. Data Collector provides the crucial connection between hops in the stream.

To solve your ingest needs, you can use a single Data Collector to run one or more pipelines. Or you might install a series of Data Collectors to stream data across your enterprise data topology.

How does this really work?

Let's walk through it...

After you install Data Collector, you register Data Collector to work with StreamSets Control HubTM and then create your first Data Collector pipeline in Control Hub.

What do you want it to do? Let's say you want to read XML files from a directory and remove the newline characters before moving it into HDFS. To do this, you start with a Directory origin stage and configure it to point to the source file directory. (You can also have the stage archive processed files and write files that were not fully processed to a separate directory for review.)

To remove the newline characters, connect Directory to an Expression Evaluator processor and configure it to remove the newline character from the last field in the record.

To make the data available to HDFS, you connect the Expression Evaluator to a Hadoop FS destination stage. You configure the stage to write the data as a JSON object (though you can use other data formats as well).

You preview data to see how source data moves through the pipeline and notice that some fields have missing data. So you add a Field Replacer to replace null values in those fields.

Now that the data flow is done, you configure the pipeline error record handling to write error records to a file, you create a data drift alert to let you know when field names change, and you configure an email alert to let you know when the pipeline generates more than 100 error records. Then, you start the pipeline and Data Collector goes to work.

The Data Collector goes into Monitor mode and displays summary and error statistics immediately. To get a closer look at the activity, you take a snapshot of the pipeline so you can examine how a set of data passed through the pipeline. You see some unexpected data in the pipeline, so you create a data rule for a link between two stages to gather information about similar data and set an alert to notify you when the numbers get too high.

And what about those error records being written to file? They're saved with error details, so you can create an error pipeline to reprocess that data. Et voila!

StreamSets Data Collector is a powerful tool, but we're making it as simple as possible to use. So give it a try, click the Help icon for information, and contact StreamSets if you need a hand.

What is StreamSets Control Hub?

StreamSets Control Hub is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.

Teams of data engineers use the shared repository provided with Control Hub to collaboratively build pipelines. Control Hub provides full lifecycle management of the pipelines, allowing you to track the version history and giving you full control of the evolving development process.

Control Hub lets you deploy and execute dataflows at scale on manually administered or automatically provisioned Data Collectors.

You can map multiple dataflows in a single visual topology and can view real-time statistics to measure dataflow performance across each topology, from end-to-end or point-to-point. You can also monitor alerts to ensure that incoming data meets business requirements for availability and accuracy.

To use Control Hub, you need a user account within an organization. If you already have an organization defined for your enterprise, ask the organization administrator for a user account. If you don’t have access to an organization, contact StreamSets with a request for a new organization.

For more information about using Control Hub, see Meet StreamSets Control Hub.

Tutorials and Sample Pipelines

StreamSets provides multiple tutorials and sample pipelines to help you learn about using Data Collector.

This documentation includes basic and advanced tutorials in the Tutorials chapter. StreamSets provides additional tutorials on Github. You can also find a terrific set of short how-to's on the StreamSets YouTube channel.

Data Collector includes several sample pipelines. You can use these pipelines to walk through tutorials or as a basis for new development.