Pre Upgrade Tasks
In some situations, you must complete tasks before you upgrade.
Verify Installation Requirements
The minimum requirements for Data Collector can change with each version. Before you upgrade to a new Data Collector version, verify that the machine meets the latest minimum requirements as described in Installation Requirements.
Complete Control Hub On-premises Prerequisite
If you use Data Collector with Control Hub On-premises, you must complete a prerequisite task before you upgrade to Data Collector version 4.0.x or later.
For details, see the StreamSets Support portal.
Upgrading Full and Core Tarball Installations
Starting with version 3.19.0, the full and core tarball installation methods are available only to users with an enterprise account.
If you have an enterprise account, you can download the full or core installation packages from the StreamSets Support portal. After you perform the upgrade, you can install or remove stage libraries as needed.
If you do not have an enterprise account, go to the StreamSets website to upgrade from a full or core installation.
Upgrade to Spark 2.1 or Later
Data Collector version 3.3.0 introduces cluster streaming mode with support for Kafka security features such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or later.
Category | Stage Libraries |
---|---|
New stage libraries | The following new stage libraries include the Kafka Consumer
origin for cluster mode pipelines:
|
Changed stage libraries | The following stage library no longer includes the Kafka Consumer
origin for cluster mode pipelines:
The following stage libraries were upgraded to use Spark
2.1:
|
Removed stage libraries | The following stage libraries are removed:
During the upgrade process, these removed stage libraries are replaced with the new streamsets-datacollector-cdh-spark_2_1-lib stage library. |
Removed legacy stage libraries | The following legacy stage libraries are removed:
|
Changed legacy stage libraries | The following legacy stage libraries no longer include the Spark
Evaluator processor:
|
- Cloudera - Cloudera Distribution of Spark 2.1 release 1 or later is supported. For more information, see Spark 2 Requirements.
- Hortonworks - Hortonworks Data Platform (HDP) 2.6 or later includes Spark 2.2.0. For more information, see the HDP 2.6 Release Notes.
- MapR - MapR with MapR Expansion Pack 3.0 or later is supported. For more information, see HPE Ezmeral Data Fabric documentation..
Then, you must configure upgraded pipelines to work with the upgraded system, as described in Working with Upgraded External Systems.
- Spark Evaluator processor - If the Spark application was previously built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1. Or if you used Scala to write the custom Spark class, and the application was compiled with Scala 2.10, you must recompile it with Scala 2.11.
- Spark executor - If the Spark application was previously built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1 and Scala 2.11.
Migrate to Java 8
Data Collector version 2.5.0.0 requires Java 8. If your previous Data Collector version ran on Java 7, you must migrate to Java 8 before upgrading to the latest Data Collector version.
All services that use Data Collector JAR files also must run on Java 8. This means that your Hadoop cluster must run on Java 8 if you are using cluster pipelines, the Spark Executor, or the MapReduce Executor.
To migrate to Java 8, complete the following steps before upgrading to the latest Data Collector version:
- Shut down Data Collector.
- Install Java 8 on the Data Collector machine.
- If you customized Java configuration options in the SDC_JAVA7_OPTS environment variable and if those options are valid in Java 8, migrate those customizations to the SDC_JAVA8_OPTS environment variable.
- Restart Data Collector and verify that it works as expected.
- If any pipelines include the JavaScript Evaluator processor, open the pipelines and validate the scripts on Java 8.
Upgrade Cluster Streaming Pipelines
If you use cluster pipelines that run in cluster streaming mode and you are upgrading from a version earlier than 2.3.0.0, you must upgrade to Data Collector version 2.3.0.0 before upgrading to the latest version.
- Upgrade to Data Collector version 2.3.0.0.
- Start the upgraded Data Collector version 2.3.0.0 and run the cluster pipelines so that they process some data.
After verifying that the upgrade to Data Collector version 2.3.0.0 was successful, upgrade to the latest version.