Try StreamSets

This tutorial covers the steps needed to try StreamSets DataOps Platform. Although the tutorial provides a simple use case, keep in mind that StreamSets is a powerful platform that enables you to build, run, and monitor large numbers of complex pipelines.

To complete this tutorial, you must have an existing StreamSets account. If you do not have one, use the following URL to sign up for a free account:
Note: When you sign up, StreamSets grants your user account all of the roles required to complete the tasks in this tutorial. If you are invited to join an existing organization, your user account might not be granted the required roles.
To try StreamSets, complete the following steps:
  1. Set up a Deployment
  2. Build a Pipeline
  3. Run a Job
  4. Monitor the Job

Set up a Deployment

Before you can build a pipeline, you must set up a deployment of StreamSets engine instances.

A deployment is a group of identical engine instances. A deployment defines the engine type, version, and configuration to use. You can deploy and launch multiple instances of the configured engine.

The simplest way to deploy your first engine is to create a self-managed deployment that launches a single Data Collector engine instance on a local on-premises machine. After creating the deployment, you set up the machine and complete the installation requirements for the engine. You then run a command that installs and launches the engine on the machine that you have set up.

These instructions provide steps to create a deployment that launches Data Collector using a tarball. If you'd prefer to deploy Data Collector using a Docker image, simply select Docker image for the engine installation type as you follow the steps below. Or, to quickly deploy Data Collector, click Quick Start > Deploy Data Collector using Docker in the top toolbar.

Tip: After getting started, you might consider using one of the cloud service provider integrations that StreamSets provides, such as the Amazon Web Services (AWS) and Google Cloud Platform (GCP) environments and deployments. With these integrations, Control Hub automatically provisions the resources needed to run the engine type in your cloud service provider account, and then deploys engine instances to those resources.
  1. Use the following URL to log in to StreamSets:

    Control Hub displays the Getting Started view.

  2. In the Control Hub Navigation panel, click Set Up > Deployments, as follows:

  3. Click the Create Deployment icon: .
  4. For the deployment name, enter Tutorial.
  5. Use the default values for the remaining properties, and then click Save & Next.
  6. In the Configure Engine section, click 3 stage libraries selected.

    By default, each deployment includes 3 common stage libraries. Leave these stage libraries selected as you'll need them to complete this tutorial.

    You can optionally select additional stage libraries to install. Stage libraries determine the stages that display in the pipeline canvas when you design pipelines. For example, if you plan to design pipelines that process data from Elasticsearch and Google, select one of the Elasticsearch stage libraries and the Google Cloud stage library.

  7. Click Ok when you've finished selecting stage libraries.
  8. Use the default values for the remaining properties in the Configure Engine section.

    Notice that the default value for Engine Labels is the name of the deployment, Tutorial.

    Note: Labels determine the group of engine instances that run a job. When you have a single engine instance, you simply run all jobs on this instance. However, as you scale out your pipeline processing, you might create multiple groups of engine instances. Labels let you dedicate one group of engine instances to one set of jobs, and another group to another set of jobs.
  9. Click Save & Next.
  10. In the Configure Install Type section, use the default value of Tarball for the installation type.
  11. Click Save & Next.
  12. In the Share Deployment section, click Save & Next.
    Note: When additional users join your organization, you must share the deployment with other users to grant them access to it.
  13. In the Review & Launch section, click Start & Generate Install Script.
  14. Click the Copy to Clipboard icon () to copy the generated command, and then click Check Engine Status after Running the Script.

    Control Hub displays an Engine Status window. Before you can view the engine status, you need to set up your machine and run the installation script command.

  15. Verify that the machine meets the minimum requirements for a Data Collector engine.
  16. Download and install one of the supported Java versions.
  17. Open a command prompt and set your file descriptors limit to at least 32768.
  18. Paste and then run the installation script command that you copied from the self-managed deployment. Respond to the command prompts to enter download and installation directories for the Data Collector engine.

    You can check the engine status in the command prompt or in the Control Hub UI. When the engine installation is complete, Control Hub informs you that the engine is successfully running.

Build a Pipeline

Build a pipeline to define how data flows from origin to destination systems and how the data is processed along the way.

This tutorial builds a pipeline that reads a sample CSV file from an HTTP resource URL, processes the data to convert the data type of several fields, and then writes the data to a JSON file on your local machine.

The sample CSV file includes some invalid data, so you'll also see how StreamSets handles errors when you preview the pipeline.

  1. In the Engine Status window, click Create a pipeline.

    Or, if you already closed the Engine Status window, click Quick Start > Create a pipeline.

  2. Enter the following name: Tutorial.
  3. Use the defaults to create a blank Data Collector pipeline, and then click Next.

    In the Configure Pipeline section, the engine that you deployed is selected as the default authoring engine.

  4. Click Save & Open in Canvas.

    A blank pipeline opens in the canvas.

  5. In the canvas, click the Add Stage icon to open the stage selector.
  6. Click Origins, enter http in the search field, and then select the HTTP Client origin, as follows:

    The origin is added to the canvas.

  7. In the properties panel below the canvas, click the HTTP tab.
  8. Configure the HTTP properties as follows:
    HTTP Property Value
    Resource URL
    Mode Polling
    Polling Interval 600000
    Note: In most cases you would use batch mode to configure the origin to read a single file. In batch mode, the origin processes all available data and then stops the pipeline and job. However, when the origin uses batch mode to read a small amount of data, the Data Collector engine runs and stops the pipeline before you have a chance to monitor the data in real time. Setting the mode to polling with a 10 minute (or 600000 milliseconds) interval causes the origin to read the full contents of the file, wait 10 minutes, and then read the contents of the file again.
  9. Use the default values for the remaining properties.

    The HTTP tab should be configured as follows:

  10. Click the Data Format tab.
  11. For the Data Format property, select Delimited.
  12. Click Show Advanced Options, and then for the Header Line property, select With Header Line.
  13. Use the default values for the remaining properties.

    Since the sample data is read from a file, the origin reads all fields as String. So, next you'll add a Field Type Converter processor to the pipeline to convert several datetime fields to the Datetime data type.

  14. Click the Add Stage icon to open the stage selector.
  15. Click Processors, enter type in the search field, and then select the Field Type Converter processor.
  16. Click the Conversions tab.
  17. Convert fields with datetime data to Datetime as follows:
    Conversion Property Value
    Conversion Method By Field Name
    Fields to Convert
    • /dropoff_datetime
    • /pickup_datetime
    Note: To reference a field, you enter the path of the field. For simple records of data such as the sample CSV file, you reference a field as follows: /<field name>.
    Convert to Type Datetime
    Date Format yyyy-MM-dd HH:mm:ss
  18. In the toolbar above the pipeline canvas, click the Preview icon: .

    When you preview the pipeline, you can view several records of source data.

  19. In the Preview Configuration dialog box, use the default values and then click Run Preview.

    The HTTP Client origin is selected in the pipeline canvas, and preview displays several records of output data read by the origin. Since this is the origin of the pipeline, no input data displays.

    Notice how the Field Type Converter processor displays a red square with a counter of 1, indicating that the stage has encountered an error.

  20. Select the Field Type Converter processor in the canvas.

    Preview highlights the first record in red and displays an error message indicating that the first record has an unparsable date. The date data includes invalid characters at the end, as follows:

    By default, the stage passes error records to the pipeline for error handling, and then the pipeline discards the error records. Since this is sample data, you can leave the default error record handling. When you run the pipeline, this first record will not be passed to the next stage for processing.

  21. With the processor still selected in the canvas, scroll down in the preview panel to display the input and output data of the second record.

    You can see that the date fields in the second record were successfully converted to the Datetime data type:

  22. Click Close Preview to close the preview.
  23. Click the Add Stage icon to open the stage selector.
  24. Enter local in the search field, and then select the Local FS destination.

    The Local FS destination writes to files in a local file system.

  25. Click the Output Files tab, and configure the following property.

    Use the defaults for the advanced options:

    Output Files Property Description
    Directory Template By default, the directory template includes datetime variables to create a directory structure for output files. This is intended for writing large volumes of data.

    Since you are only processing the sample file, you don't need the datetime variables. Go ahead and delete the default and enter a local directory where you want the files to be written.

    For example: /<base directory>/tutorial/destination

  26. Click the Data Format tab, and select JSON to write the data using the JSON format.

    Use the defaults for the remaining properties.

Run a Job

Next, you'll check in the pipeline to indicate that your design is complete and the pipeline is ready to be added to a job and run. When you check in a pipeline, you enter a commit message. StreamSets maintains the commit history of each pipeline.

Jobs are the execution of the dataflow. Jobs enable you to manage and orchestrate large scale dataflows that run across multiple engines.

Since this pipeline processes one file, there's no need to enable the job to start on multiple engines or to increase the number of pipeline instances that run for the job. As a result, you can simply use the default values when creating the job. As you continue to use StreamSets, you can explore how to run pipelines at scale.

  1. With the pipeline open in the canvas, click the Check In icon: .
  2. Enter a commit message. For now, you can simply use the default: New Pipeline.
    As a best practice, state what changed in this pipeline version so that you can track the commit history of the pipeline.
  3. Click Publish & Next.

    The Share Pipeline step displays. You can skip this step for now. When additional users join your organization, you must share the pipeline to grant them access to it.

  4. Click Save & Next.
  5. In the Update Jobs step, click Skip and Create New Job.

    The Create Job Instances wizard appears.

  6. Use the defaults in the Define Job step, and click Next.
  7. In the Select Pipeline step, click Next.
  8. In the Configure Job step, select Tutorial (Self-Managed) for the Deployment property.

    Notice how the Engine Labels property is automatically populated with the default Tutorial label assigned to the deployment.

  9. Use the defaults for the remaining properties and click Save & Next.
  10. Click Start & Monitor Job.

    The job displays in the canvas, and Control Hub indicates that the job is active.

Monitor the Job

Next, you'll monitor the progress of the job. When you start a job, Control Hub sends the pipeline to the Data Collector engine. The engine runs the pipeline, sending status updates and metrics back to Control Hub.

  1. As the job runs, click the Realtime Summary tab in the monitor panel to view the real-time statistics for the job:

    Notice how the Record Count chart displays more input records than output records. That's because the pipeline is configured to discard the error records encountered by the Field Type Converter processor.

  2. Select the Field Type Converter processor in the canvas, and then click the Errors tab.

    The tab displays an error message for each error record, as follows:

  3. Expand one of the error records, and you'll see the same invalid data causing the error that you saw during the preview of the pipeline, as follows:

    Notice how the HTTP Client origin has displayed the running icon () for the last several minutes although the input record count has not increased. That's because you configured the origin to poll the specified URL every 10 minutes. So the origin waits for that interval, then reads the file again.

    When you've finished monitoring the data, stop the job so that the pipeline doesn't run indefinitely.

  4. Click the Stop icon: .
    The Confirmation dialog box appears.
  5. To stop the job, click OK.
    The job transitions from a deactivating to an inactive state.
  6. After the job successfully stops, click Close.
  7. Locate the local directory configured for the destination, /<base directory>/tutorial/destination, and open the file to verify that the data was written in JSON format.
That's it! You've finished building, running, and monitoring your first pipeline.

Next Steps

Now that you're familiar with the main tasks of managing pipelines, here are some next steps you can take to further explore StreamSets DataOps Platform:
Invite others to join
Invite other users to join your organization and collaboratively manage pipelines as a team.
Modify your first pipeline
Modify your first pipeline to add a different Data Collector destination to write to another external system. You can also add additional processors to explore the other types of processing available with Data Collector pipelines.
Or if you have sample data in another source system, add a different origin to read data from that system.
Explore sample pipelines
Explore the sample pipelines included with Control Hub.
Explore engines
Explore team-based features
  • Learn how teams of data engineers can use Control Hub to collaboratively build pipelines. Control Hub provides full lifecycle management of the pipelines, allowing you to track the version history and giving you full control of the evolving development process.
  • To create a multitenant environment within your organization, create groups of users. Grant roles to these groups and share objects within the groups to grant each group access to the appropriate objects.
  • Use connections to limit the number of users that need to know the security credentials for external systems. Connections also provide reusability - you create a connection once and then other users can reuse that connection in multiple pipelines.
  • Use job templates to hide the complexity of job details from business analysts.
Explore advanced features
  • Use topologies to map multiple related jobs into a single view. A topology provides interactive end-to-end views of data as it traverses multiple pipelines.
  • Create a subscription to listen for Control Hub events and then complete an action when those events occur. For example, you can create a subscription that sends a message to a Slack channel or emails an administrator each time a job status changes.
  • Schedule jobs to start or stop on a weekly or monthly basis.