Pre Upgrade Tasks

In some situations, you must complete tasks before you upgrade.

Verify Installation Requirements

The minimum requirements for Data Collector can change with each version. Before you upgrade to a new Data Collector version, verify that the machine meets the latest minimum requirements as described in Installation Requirements.

Complete Control Hub On-premises Prerequisite

If you use Data Collector with Control Hub On-premises, you must complete a prerequisite task before you upgrade to Data Collector version 4.0.x or later.

For details, see the StreamSets Support portal.

Upgrading Full and Core Tarball Installations

Starting with version 3.19.0, the full and core tarball installation methods are available only to users with an enterprise account.

If you have an enterprise account, you can download the full or core installation packages from the StreamSets Support portal. After you perform the upgrade, you can install or remove stage libraries as needed.

If you do not have an enterprise account, go to the StreamSets website to upgrade from a full or core installation.

Upgrade to Spark 2.1 or Later

Data Collector version 3.3.0 introduces cluster streaming mode with support for Kafka security features such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka or later.

However, this means that using Spark 1.x for cluster streaming mode, the Spark Evaluator processor, and the Spark executor was deprecated as of version Support for Spark 1.x is removed in version 3.3.0. If you are using cluster streaming mode, the Spark Evaluator processor, or the Spark executor, you must upgrade to Spark 2.1 or later. In addition, if you are using cluster streaming mode for Kafka, you must also upgrade to Kafka or later.
Note: You can continue to use Kafka in standalone pipelines. Or you can continue to use an earlier version of Data Collector to use Kafka in cluster pipelines until you can upgrade Kafka.
Since Spark 1.x is no longer supported and since Kafka is no longer supported in cluster pipelines, the following stage libraries have changed:
Category Stage Libraries
New stage libraries The following new stage libraries include the Kafka Consumer origin for cluster mode pipelines:
  • streamsets-datacollector-cdh-spark_2_1-lib
  • streamsets-datacollector-cdh-spark_2_2-lib
  • streamsets-datacollector-cdh-spark_2_3-lib
Changed stage libraries The following stage library no longer includes the Kafka Consumer origin for cluster mode pipelines:
  • streamsets-datacollector-hdp_2_4-lib
The following stage libraries were upgraded to use Spark 2.1:
  • streamsets-datacollector-hdp_2_6-lib
  • streamsets-datacollector-mapr_5_2-lib
  • streamsets-datacollector-mapr_6_0-mep4-lib
Removed stage libraries The following stage libraries are removed:
  • streamsets-datacollector-cdh_5_8-cluster-cdh_kafka_2_0-lib
  • streamsets-datacollector-cdh_5_9-cluster-cdh_kafka_2_0-lib
  • streamsets-datacollector-cdh_5_10-cluster-cdh_kafka_2_1-lib
  • streamsets-datacollector-cdh_5_11-cluster-cdh_kafka_2_1-lib
  • streamsets-datacollector-cdh_5_12-cluster-cdh_kafka_2_1-lib
  • streamsets-datacollector-cdh_5_13-cluster-cdh_kafka_2_1-lib
  • streamsets-datacollector-cdh_5_14-cluster-cdh_kafka_2_1-lib

During the upgrade process, these removed stage libraries are replaced with the new streamsets-datacollector-cdh-spark_2_1-lib stage library.

Removed legacy stage libraries The following legacy stage libraries are removed:
  • streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_2-lib
  • streamsets-datacollector-cdh_5_4-cluster-cdh_kafka_1_3-lib
  • streamsets-datacollector-cdh_5_5-cluster-cdh_kafka_1_3-lib
  • streamsets-datacollector-cdh_5_7-cluster-cdh_kafka_2_0-lib
Changed legacy stage libraries The following legacy stage libraries no longer include the Spark Evaluator processor:
  • streamsets-datacollector-cdh_5_4-lib
  • streamsets-datacollector-cdh_5_5-lib
To continue to use cluster streaming mode, you must upgrade to a newer Cloudera CDH or Hortonworks Hadoop distribution and to Kafka or later. The major Hadoop distribution vendors provide a means for Spark 1.x and Spark 2.x to coexist on the same cluster, so you can use both versions in your clusters. Data Collector supports the following Spark 2.x versions for the Hadoop distribution vendors:

Then, you must configure upgraded pipelines to work with the upgraded system, as described in Working with Upgraded External Systems.

In addition to selecting the upgraded stage library version for each stage that connects to the upgraded CDH, HDP, or Kafka system, you might need to perform additional tasks for the following stages:
  • Spark Evaluator processor - If the Spark application was previously built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1. Or if you used Scala to write the custom Spark class, and the application was compiled with Scala 2.10, you must recompile it with Scala 2.11.
  • Spark executor - If the Spark application was previously built with Spark 2.0 or earlier, you must rebuild it with Spark 2.1 and Scala 2.11.

Migrate to Java 8

Data Collector version requires Java 8. If your previous Data Collector version ran on Java 7, you must migrate to Java 8 before upgrading to the latest Data Collector version.

All services that use Data Collector JAR files also must run on Java 8. This means that your Hadoop cluster must run on Java 8 if you are using cluster pipelines, the Spark Executor, or the MapReduce Executor.

To migrate to Java 8, complete the following steps before upgrading to the latest Data Collector version:

  1. Shut down Data Collector.
  2. Install Java 8 on the Data Collector machine.
  3. If you customized Java configuration options in the SDC_JAVA7_OPTS environment variable and if those options are valid in Java 8, migrate those customizations to the SDC_JAVA8_OPTS environment variable.
  4. Restart Data Collector and verify that it works as expected.
  5. If any pipelines include the JavaScript Evaluator processor, open the pipelines and validate the scripts on Java 8.

Upgrade Cluster Streaming Pipelines

If you use cluster pipelines that run in cluster streaming mode and you are upgrading from a version earlier than, you must upgrade to Data Collector version before upgrading to the latest version.

Prior to, Data Collector used the Spark checkpoint mechanism to recover cluster pipelines after a failure. Starting in version, Data Collector maintains the state of cluster pipelines without relying on Spark checkpoints.
Warning: If you upgrade from a version earlier than directly to the latest version - without first upgrading to version - cluster pipelines fail when starting.
Before you upgrade to the latest version, complete the following general tasks:
  1. Upgrade to Data Collector version
  2. Start the upgraded Data Collector version and run the cluster pipelines so that they process some data.

After verifying that the upgrade to Data Collector version was successful, upgrade to the latest version.