Try StreamSets
This tutorial covers the steps needed to try StreamSets DataOps Platform. Although the tutorial provides a simple use case, keep in mind that StreamSets is a powerful platform that enables you to build, run, and monitor large numbers of complex pipelines.
Set up a Deployment
Before you can build a pipeline, you must set up a deployment of StreamSets engine instances.
A deployment is a group of identical engine instances. A deployment defines the engine type, version, and configuration to use. You can deploy and launch multiple instances of the configured engine.
The simplest way to deploy your first engine is to create a self-managed deployment that launches a single Data Collector engine instance on a local on-premises machine. After creating the deployment, you set up the machine and complete the installation requirements for the engine. You then run a command that installs and launches the engine on the machine that you have set up.
These instructions provide steps to create a deployment that launches Data Collector using a tarball. If you'd prefer to deploy Data Collector using a Docker image, simply select Docker image for the engine installation type as you follow the steps below. Or, to quickly deploy Data Collector, click in the top toolbar.
Build a Pipeline
Build a pipeline to define how data flows from origin to destination systems and how the data is processed along the way.
This tutorial builds a pipeline that reads a sample CSV file from an HTTP resource URL, processes the data to convert the data type of several fields, and then writes the data to a JSON file on your local machine.
The sample CSV file includes some invalid data, so you'll also see how StreamSets handles errors when you preview the pipeline.
Run a Job
Next, you'll check in the pipeline to indicate that your design is complete and the pipeline is ready to be added to a job and run. When you check in a pipeline, you enter a commit message. StreamSets maintains the commit history of each pipeline.
Jobs are the execution of the dataflow. Jobs enable you to manage and orchestrate large scale dataflows that run across multiple engines.
Since this pipeline processes one file, there's no need to enable the job to start on multiple engines or to increase the number of pipeline instances that run for the job. As a result, you can simply use the default values when creating the job. As you continue to use StreamSets, you can explore how to run pipelines at scale.
Monitor the Job
Next, you'll monitor the progress of the job. When you start a job, Control Hub sends the pipeline to the Data Collector engine. The engine runs the pipeline, sending status updates and metrics back to Control Hub.
Next Steps
- Invite others to join
- Invite other users to join your organization and collaboratively manage pipelines as a team.
- Modify your first pipeline
- Modify your first pipeline to add a different Data Collector destination to write to another external system. You can also add additional processors to explore the other types of processing available with Data Collector pipelines.
- Explore sample pipelines
- Explore the sample pipelines included with Control Hub.
- Explore engines
-
- Compare the StreamSets engines - learn about their differences and similarities.
- Set up and deploy an engine in your cloud service provider account, including Amazon Web Services (AWS) or Google Cloud Platform (GCP).
- Learn how engines communicate with Control Hub to securely process your data.
- Explore team-based features
-
- Learn how teams of data engineers can use Control Hub to collaboratively build pipelines. Control Hub provides full lifecycle management of the pipelines, allowing you to track the version history and giving you full control of the evolving development process.
- To create a multitenant environment within your organization, create groups of users. Grant roles to these groups and share objects within the groups to grant each group access to the appropriate objects.
- Use connections to limit the number of users that need to know the security credentials for external systems. Connections also provide reusability - you create a connection once and then other users can reuse that connection in multiple pipelines.
- Use job templates to hide the complexity of job details from business analysts.
- Explore advanced features
-
- Use topologies to map multiple related jobs into a single view. A topology provides interactive end-to-end views of data as it traverses multiple pipelines.
- Create a subscription to listen for Control Hub events and then complete an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
- Schedule jobs to start or stop on a weekly or monthly basis.