Installation Overview
To set up and deploy a Data Collector engine in your corporate network, you create environments and deployments in Control Hub.
Requirements for Self-Managed Deployments
Stage Libraries
Supported Systems and Versions
MapR Prerequisites
Retrying the Pipeline
Rate Limit
Advanced Options
Pipelines and most pipeline stages include advanced options with default values that should work in most cases. By default, each pipeline and stage hides the advanced options. Advanced options can include individual properties or complete tabs.
Simple and Bulk Edit Mode
Runtime Values
Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.
Event Generation
Webhooks
Notifications
SSL/TLS Encryption
Many stages can use SSL/TLS encryption to securely connect to the external system.
Security in Amazon Stages
Security in Google Cloud Stages
Security in Kafka Stages
Kafka Message Keys
Authentication in Salesforce Stages
Implicit and Explicit Validation
Expression Configuration
Configuring a Pipeline
Data Formats Overview
Avro Data Format
Binary Data Format
Datagram Data Format
Delimited Data Format
Excel Data Format
Log Data Format
When you use an origin to read log data, you define the format of the log files to be read.
NetFlow Data Processing
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.
Reading and Processing XML Data
Writing XML Data
Solutions Overview
Converting Data to the Parquet Data Format
This solution describes how to convert Avro files to the columnar format, Parquet.
Automating Impala Metadata Updates for Drift Synchronization for Hive
This solution describes how to configure a Drift Synchronization Solution for Hive pipeline to automatically refresh the Impala metadata cache each time changes occur in the Hive metastore.
Managing Output Files
This solution describes how to design a pipeline that writes output files to a destination, moves the files to a different location, and then changes the permissions for the files.
Stopping a Pipeline After Processing All Available Data
This solution describes how to design a pipeline that stops automatically after it finishes processing all available data.
Offloading Data from Relational Sources to Hadoop
This solution describes how to offload data from relational database tables to Hadoop.
Sending Email During Pipeline Processing
This solution describes how to design a pipeline to send email notifications at different moments during pipeline processing.
Preserving an Audit Trail of Events
This solution describes how to design a pipeline that preserves an audit trail of pipeline and stage events that occur.
Loading Data into Databricks Delta Lake
You can use several solutions to load data into a Delta Lake table on Databricks.
Drift Synchronization Solution for Hive
The Drift Synchronization Solution for Hive detects drift in incoming data and updates corresponding Hive tables.
Drift Synchronization Solution for PostgreSQL
The Drift Synchronization Solution for PostgreSQL detects drift in incoming data and automatically creates or alters corresponding PostgreSQL tables as needed before the data is written.
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Stages for Microservice Pipelines
Sample Pipeline
Creating a Microservice Pipeline
Data Preview Overview
Preview Codes
Data preview displays different colors for different types of data. Preview also uses other codes and formatting to highlight changed fields.
Input and Output Schema for Stages
After running preview for a pipeline, you can view the input and output schema for each stage on the Schema tab in the pipeline properties panel. The schema includes each field path and data type.
Previewing a Single Stage
Previewing Multiple Stages
You can preview data for a group of linked stages within a pipeline.
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.
Editing Properties
In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the expression in an Expression Evaluator to see how the expression alters data.
Tutorial Overview
Before You Begin
Basic Tutorial
The basic tutorial creates a pipeline that reads a file from an HTTP resource URL, processes the data in two branches, and writes all data to a file system. You'll use data preview to help configure the pipeline, then create a data alert and run the pipeline.
Extended Tutorial
The extended tutorial builds on the basic tutorial, using an additional set of stages to perform some data transformations and write to the Trash destination. We'll also use data preview to test stage configuration.
© 2022 StreamSets, Inc.