Previously, this property was a hard limit. When the directory contained more files, the pipeline failed.
The property was added in version 3.1.0.0 and was enabled by default. Pipelines upgraded from versions earlier than 3.1.0.0 have the property enabled by default.
The Field Replacer processor replaces the Value Replacer processor which has been deprecated. The Field Replacer processor lets you define more complex conditions to replace values. For example, the Field Replacer can replace values which fall within a specified range. The Value Replacer cannot replace values that fall within a specified range.
StreamSets recommends that you update Value Replacer pipelines as soon as possible.
By default, the origin uses Keep Alive threads with an interval of one minute. Upgraded pipelines also use the new defaults.
Previously, StreamSets provided a single RPM package used to install Data Collector on any of these operating systems.
This property replaces the Query Interval property. For information about possible upgrade impact, see JDBC Multitable Consumer Query Interval Change.
Apache Kafka 0.9
CDH Kafka 2.0 (0.9.0) and 2.1 (0.9.0)
HDP 2.5 and 2.6
In addition, a new Capture Instance Name property replaces the Schema and Table Name Pattern properties from earlier releases.
You can simply use the schema name and table name pattern for the capture instance name. Or, you can specify the schema name and a capture instance name pattern, which allows you to specify specific CDC tables to process when you have multiple CDC tables for a single data table.
Upgraded pipelines require no changes.
For information about upgrading existing upsert pipelines, see Update MongoDB Destination Upsert Pipelines.
For information about upgrading a version of Data Collector with Cloudera Navigator integration enabled, see Disable Cloudera Navigator Integration.
DPM now refers to the performance management functions that reside in the cloud such as live metrics and data SLAs. Customers who have purchased the StreamSets Enterprise Edition will gain access to all SCH functionality and continue to have access to all DPM functionality as before.
To understand the end-to-end StreamSets Data Operations Platform and how the products fit together, visit https://streamsets.com/products/.
If you have pipelines that use these legacy stage libraries, you will need to update the pipelines to use a more current stage library or install the legacy stage library manually, For more information see Update Pipelines using Legacy Stage Libraries.
Data Collector version 2.7.0.0 includes the following new features and enhancements:
You define the credentials required by external systems - user names, passwords, or access keys - in a Java keystore file or in Vault. Then you use credential expression language functions in JDBC stage properties to retrieve those values, instead of directly entering credential values in stage properties.
Data Collector now provides beta support for publishing metadata about running pipelines to Cloudera Navigator. You can then use Cloudera Navigator to explore the pipeline metadata, including viewing lineage diagrams of the metadata.
By default, all new pipelines use partition processing when possible. Upgraded pipelines use multithreaded table processing to preserve previous behavior.
Hortonworks version 2.6 distribution of Apache Hadoop
For example, here’s the updated Elasticsearch origin icon:
![]()
When you run a job on multiple Data Collectors, a remote pipeline instance runs on each of the Data Collectors. To view aggregated statistics for the job within DPM, you must configure the pipeline to write the statistics to a Kafka cluster, Amazon Kinesis Streams, or SDC RPC.
Since the Hive Metastore previously supported only Avro data, there is no upgrade impact.
Kudu destination enhancement - You can use the new Mutation Buffer Space property to set the buffer size that the Kudu client uses to write each batch.
New Shell executor - Use to execute shell scripts upon receiving an event.
JDBC Query executor enhancement - A new Batch Commit property allows the executor to commit to the database after each batch. Previously, the executor did not call commits by default.
For new pipelines, the property is enabled by default. For upgraded pipelines, the property is disabled to prevent changes in pipeline behavior.
You upload the generated file to the StreamSets support team so that we can use the information to troubleshoot your support tickets.
Cluster mode enhancement - Cluster streaming mode now supports Spark 2.x. For information about using Spark 2.x stages with cluster mode, see Cluster Pipeline Limitations.
Upgrading to this version can require updating existing pipelines. For details, see Working with Cloudera CDH 5.11 or Later.
Data Collector version 2.5.0.0 includes the following new features and enhancements:
In previous versions, pipeline runtime parameters were named pipeline constants. You defined the constant values in the pipeline, and could not pass different values when you started the pipeline.
Data Collector now supports the Apache Kudu version 1.3.x. stage library.
You can also configure the quote character to use around table, schema, and column names in the query. And you can configure the number of times a thread tries to read a batch of data after receiving an SQL error.
To handle transient connection or network errors, you can now specify how many times the origin should retry a query before stopping the pipeline.
You can also develop a destroy script that the processor runs once when the pipeline stops. Use a destroy script to close any connections or resources opened by the processor.
The processor also now provides beta support of cluster mode pipelines. In a development or test environment, you can use the processor in pipelines that process data from a Kafka or MapR cluster in cluster streaming mode. Do not use the Spark Evaluator processor in cluster mode pipelines in a production environment.
You can also use the Enclose Object Name property to enclose the database/schema, table, and column names in quotation marks when writing to the database.
Data Collector version 2.4.0.0 includes the following new features and enhancements:
Cloudera CDH version 5.10 distribution of Hadoop
The new multithreaded framework includes the following changes:
HTTP Server origin - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST requests. Use the HTTP Server origin to receive high volumes of HTTP POST requests using multiple threads.
Enhanced Dev Data Generator origin - Can create multiple threads for testing multithreaded pipelines.
Enhanced runtime statistics - Monitoring a pipeline displays aggregated runtime statistics for all threads in the pipeline. You can also view the number of runners, i.e. threads and pipeline instances, being used.
The MongoDB Oplog and Salesforce origins are now enabled for processing changed data by including the CRUD operation type in the sdc.operation.type record header attribute.
Though previously CDC-enabled, the Oracle CDC Client and JDBC Query Consumer for Microsoft SQL Server now include CRUD operation type in the sdc.operation.type record header attribute.
Previous operation type header attributes are still supported for backward-compatibility.
The JDBC Tee processor and JDBC Producer can now process changed data based on CRUD operations in record headers. The stages also include a default operation and unsupported operation handling.
The MongoDB and Elasticsearch destinations now look for the CRUD operation in the sdc.operation.type record header attribute. The Elasticsearch destination includes a default operation and unsupported operation handling.
If you use file-based authentication, you can also now view all user accounts granted access to the Data Collector, including the roles and groups assigned to each user.
LDAP authentication enhancements - You can now configure Data Collector to use StartTLS to make secure connections to an LDAP server. You can also configure the userFilter property to define the LDAP user attribute used to log in to Data Collector. For example, a username, uid, or email address.
Proxy configuration for outbound requests - You can now configure Data Collector to use an authenticated HTTP proxy for outbound requests to Dataflow Performance Manager (DPM).
Java garbage collector logging - Data Collector now enables logging for the Java garbage collector by default. Logs are written to $SDC_LOG/gc.log. You can disable the logging if needed.
Field attributes - Data Collector now supports field-level attributes. Use the Expression Evaluator to add field attributes.
New HTTP to Kafka origin - Listens on a HTTP endpoint and writes the contents of all authorized HTTP POST requests directly to Kafka. Use to read high volumes of HTTP POST requests and write them to Kafka.
New MapR DB JSON origin - Reads JSON documents from MapR DB JSON tables.
New MongoDB Oplog origin - Reads entries from a MongoDB Oplog. Use to process change information for data or database operations.
Directory origin enhancement - You can use regular expressions in addition to glob patterns to define the file name pattern to process files.
HTTP Client origin enhancement - You can now configure the origin to use the OAuth 2 protocol to connect to an HTTP service.
HTTP Client processor enhancements - You can now configure the processor to use the OAuth 2 protocol to connect to an HTTP service. You can also configure a rate limit for the processor, which defines the maximum number of requests to make per second.
JDBC Lookup processor enhancements - You can now configure the processor to enable auto-commit mode for the JDBC connection. You can also configure the processor to use a default value if the database does not return a lookup value for a column.
Salesforce Lookup processor enhancement - You can now configure the processor to use a default value if Salesforce does not return a lookup value for a field.
XML Parser enhancement - A new Multiple Values Behavior property allows you to specify the behavior when you define a delimiter element and the document includes more than one value: Return the first value as a record, return one record with a list field for each value, or return all values as records.
Azure Data Lake Store destination enhancement - You can now use the destination in cluster batch pipelines. You can also process binary and protobuf data, use record header attributes to write records to files and roll files, and configure a file suffix and the maximum number of records that can be written to a file.
Elasticsearch destination enhancement - The destination now uses the Elasticsearch HTTP API. With this API, the Elasticsearch version 5 stage library is compatible with all versions of Elasticsearch. Earlier stage library versions have been removed. Elasticsearch is no longer supported on Java 7. You’ll need to verify that Java 8 is installed on the Data Collector machine and remove this stage from the blacklist property in $SDC_CONF/sdc.properties before you can use it.
You can also now configure the destination to perform any of the following CRUD operations: create, update, delete, or index.
Hive Metastore destination enhancement - New table events now include information about columns and partitions in the table.
Hadoop FS, Local FS, and MapR FS destination enhancement - The destinations now support recovery after an unexpected stop of the pipeline by renaming temporary files when the pipeline restarts.
JDBC Query executor enhancement - You can now configure the executor to enable auto-commit mode for the JDBC connection.
pipeline:title() - Returns the pipeline title or name.
Previously, the core installation also included the Groovy, Jython, and statistics stage libraries.
Related properties, such as Charset, Compression Format, and Ignore Control Characters now appear on the Data Format tab as well.