Origins
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.
To help create or test pipelines, you can use development origins.
Standalone Pipelines
- Amazon S3 - Reads objects from Amazon S3. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Amazon SQS Consumer - Reads data from queues in Amazon Simple Queue Services (SQS). Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Aurora PostgreSQL CDC Client - Reads Amazon Aurora PostgreSQL WAL data to generate change data capture records.
- Azure Blob Storage - Reads data from Microsoft Azure Blob Storage. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Azure Data Lake Storage Gen2 - Reads data from Microsoft Azure Data Lake Storage Gen2. Creates multiple threads to enable parallel processing in a multithreaded pipeline. Use this origin for new development.
- Azure Data Lake Storage Gen2 (Legacy) - Reads data from Microsoft Azure Data Lake Storage Gen2. Creates multiple threads to enable parallel processing in a multithreaded pipeline. Do not use this origin for new development.
- Azure IoT/Event Hub Consumer - Reads data from Microsoft Azure Event Hub. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- CoAP Server - Listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- CONNX - Uses a SQL query to read mainframe data from a CONNX server.
- CONNX CDC - Reads mainframe change data provided by the CONNX data synchronization tool, DataSync.
- Couchbase - Reads JSON data from Couchbase Server. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Cron Scheduler - Generates a record with the current datetime as scheduled by a cron expression. This is an orchestration stage.
- Directory - Reads fully-written files from a directory. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Elasticsearch - Reads data from an Elasticsearch cluster. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- File Tail - Reads lines of data from an active file after reading related archived files in the directory.
- Google BigQuery - Executes a query job and reads the result from Google BigQuery.
- Google Cloud Storage - Reads fully written objects from Google Cloud Storage.
- Google Pub/Sub Subscriber - Consumes messages from a Google Pub/Sub subscription. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Groovy Scripting - Runs a Groovy script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded pipeline.
- Hadoop FS Standalone - Reads fully-written files from HDFS or Azure Blob storage. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- HTTP Client - Reads data from a streaming HTTP resource URL.
- HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- JavaScript Scripting - Runs a JavaScript script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded pipeline.
- JDBC Multitable Consumer - Reads database data from multiple tables through a JDBC connection. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- JDBC Query Consumer - Reads database data using a user-defined SQL query through a JDBC connection.
- Jira - Reads data from a Jira instance
- JMS Consumer - Reads messages from JMS.
- Jython Scripting - Runs a Jython script to create Data Collector records. Can create multiple threads to enable parallel processing in a multithreaded pipeline.
- Kafka Multitopic Consumer - Reads messages from multiple Kafka topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Kinesis Consumer - Reads data from Kinesis Streams, DynamoDB, and CloudWatch. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- MapR DB CDC - Reads changed MapR DB data that has been written to MapR Streams. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- MapR DB JSON - Reads JSON documents from MapR DB JSON tables.
- MapR FS Standalone - Reads fully-written files from MapR FS. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- MapR Multitopic Streams Consumer - Reads messages from multiple MapR Streams topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- MapR Streams Consumer - Reads messages from MapR Streams.
- MongoDB - Reads documents from MongoDB.
- MongoDB Atlas - Reads documents from MongoDB Atlas or MongoDB Enterprise Server.
- MongoDB Atlas CDC - Reads changes from a MongoDB Change Stream or Oplog.
- MongoDB Oplog - Reads entries from a MongoDB Oplog.
- MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
- MySQL Binary Log - Reads MySQL binary logs to generate change data capture records.
- OPC UA Client - Reads data from a OPC UA server.
- Oracle Bulkload - Reads data from multiple Oracle database tables, then stops the pipeline. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Oracle CDC - Processes change data capture information stored in redo logs using LogMiner. Use this origin for new development.
- Oracle CDC Client - Processes change data capture information stored in redo logs using LogMiner. This is the older Oracle origin. Use the Oracle CDC origin for new development.
- Oracle Multitable Consumer - Reads data from multiple Oracle database tables.
- PostgreSQL CDC Client - Reads PostgreSQL WAL data to generate change data capture records.
- Pulsar Consumer - Reads messages from Apache Pulsar topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Pulsar Consumer (Legacy) - Reads messages from Apache Pulsar topics.
- RabbitMQ Consumer - Reads messages from RabbitMQ.
- Redis Consumer - Reads messages from Redis.
- REST Service - Listens on an HTTP endpoint, parses the contents of all authorized requests, and sends responses back to the originating REST API. Creates multiple threads to enable parallel processing in a multithreaded pipeline. Use only in microservice pipelines.
- Salesforce - Reads data from Salesforce using the SOAP or Bulk API.
- Salesforce Bulk API 2.0 - Reads data from Salesforce using Salesforce Bulk API 2.0. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- SAP HANA Query Consumer - Reads data from an SAP HANA database using a user-defined SQL query.
- SFTP/FTP/FTPS Client - Reads files from an SFTP, FTP, or FTPS server.
- Snowflake Bulk - Reads data from Snowflake tables, then stops the pipeline. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- SQL Server CDC Client - Reads data from Microsoft SQL Server CDC tables. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- SQL Server Change Tracking - Reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
- Start Jobs - Starts one or more Control Hub jobs in parallel. This is an orchestration stage.
- TCP Server - Listens at the specified ports and processes incoming data over TCP/IP connections. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Development Origins
- Dev Data Generator
- Dev Random Source
- Dev Raw Data Source
- Dev Snapshot Replaying
For more information, see Development Stages.
Comparing Azure Storage Origins
We have several Azure storage origins, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:
Origin | Description |
---|---|
Azure Blob Storage |
|
Azure Data Lake Storage Gen2 |
|
Azure Data Lake Storage Gen2 (Legacy) |
|
Comparing HTTP Origins
Origin | Description |
---|---|
HTTP Client |
|
HTTP Server |
|
Web Client |
|
Comparing MapR Origins
Origin | Description |
---|---|
MapR DB CDC |
|
MapR DB JSON |
|
MapR FS Standalone |
|
MapR Multitopic Streams |
|
MapR Streams Consumer |
|
Comparing UDP Source Origins
The UDP Source and UDP Multithreaded Source origins are very similar. The main differentiator is that the UDP Multithreaded Source can use multiple threads to process data within the pipeline.
The UDP Multithreaded Source has a processing queue that aids multithreaded processing. But use of this queue can slow processing under certain circumstances.
Origin | Ideally Used When |
---|---|
UDP Multithreaded Source |
or
|
UDP Source |
|
Comparing WebSocket Origins
We have two WebSocket origins, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:
Origin | Description |
---|---|
WebSocket Client |
|
WebSocket Server |
|
Batch Size and Wait Time
For origin stages, the batch size determines the maximum number of records sent through the pipeline at one time. The batch wait time determines the time that the origin waits for data before sending a batch. At the end of the wait time, it sends the batch regardless of how many records the batch contains.
For example, a File Tail origin is configured for a batch size of 20 records and a batch wait time of 240 seconds. When data arrives quickly, File Tail fills a batch with 20 records and sends it through the pipeline immediately, creating a new batch and sending it again as soon as it is full. As incoming data slows, a remaining batch contains a few records, gaining an extra record periodically. 240 seconds after creating the batch, File Tail sends the partially-full batch through the pipeline. It immediately creates a new batch and starts a new countdown.
Configure the batch wait time based on your processing needs. You might reduce the batch wait time to ensure all data is processed within a specified time frame or to make regular contact with pipeline destinations. Use the default or increase the wait time if you prefer not to process partial or empty batches.
Maximum Record Size
Most data formats have a property that limits the maximum size of the record that an origin can parse. For example, the delimited data format has a Max Record Length property, the JSON data format has Max Object Length, and the text data format has Max Line Length.
When the origin processes data that is larger than the specified length, the behavior differs based on the origin and the data format. For example, with some data formats, oversized records are handled based on the record error handling configured for the origin. While in other data formats, the origin might truncate the data. For details on how an origin handles size overruns for each data format, see the "Data Formats" section of the origin documentation.
When available, the maximum record size properties are limited by the Data Collector
parser buffer size, which is 1048576 bytes by default. So, when raising the maximum record size property in the origin does not
change the origin's behavior, you might need to increase the Data Collector parser buffer size by configuring the parser.limit
property in
the Data Collector
configuration properties.
Note that most of the maximum record size properties are specified in characters, while the Data Collector limit is defined in bytes.
Previewing Raw Source Data
Some origins allow you to preview raw source data. Preview raw source data when reviewing the data might help with origin configuration.
When you preview file data, you can use the real directory and actual source file. Or when appropriate, you might use a different file that is similar to the source.
When you preview Kafka data, you enter the connection information for the Kafka cluster.
The data used for the raw source preview in an origin stage is not used when previewing data for the pipeline.
- In the Properties panel for the origin stage, click the Raw Preview tab.
- For a Directory or File Tail origin, enter a directory and file name.
-
For a Kafka Multitopic Consumer, enter the following information:
Kafka Raw Preview Property Description Topic Kafka topic to read. Partition Partition to read. Broker Host Broker host name. Use any broker associated with the partition. Broker Port Broker port number. Max Wait Time (secs) Maximum amount of time the preview waits to receive data from Kafka. - Click Preview.