Installation Requirements

Install StreamSets Control Hub on a machine that meets the following minimum requirements:

Component Minimum Requirement
Operating system Use one of the following operating systems and versions:
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x - 8.x
  • Red Hat Enterprise Linux 6.x - 8.x
  • Ubuntu 14.04 LTS
Java Use one of the following Java versions:
  • Oracle Java 8 or OpenJDK 8
  • Oracle Java 11 or OpenJDK 11
Note: Java 8u161 or earlier also requires that you download Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files 8.

The remaining requirements depend on whether you are installing a single Control Hub instance for a development environment or multiple Control Hub instances for a highly available production environment:

Component Single Installation Multiple Installations for High Availability
CPU 8 4
RAM 15 GB 7.5 GB
Disk space 50 GB 30 GB
If installing on Amazon EC2 instances, install Control Hub on a separate volume instead of the root volume. Use the following instance types:
  • Single installation - c4.2xlarge
  • Multiple installations for high availability - c4.xlarge

General Access Requirements

After installation, Control Hub requires access to the following components. These components can be local or remote to the Control Hub installations:

Component Minimum Requirement
SMTP account SMTP account to send emails.
Load balancer Load balancer to set up a highly available Control Hub system. We recommend using a Layer 7 load balancer such as HAProxy, NGINX, or F5.

Required for a production environment, optional for a development environment.

Browser Use the latest version of one of the following browsers:
  • Google Chrome
  • Firefox
  • Safari

Ensure that the browser can access registered Data Collectors and Transformers.

StreamSets Data Collector StreamSets recommends using the latest version of Data Collector.
The minimum supported Data Collector version depends on how you use Data Collector:
  • Version 2.1.0.0 or later is required to design pipelines in Data Collector and to run standalone and cluster pipelines from jobs.
  • Version 3.0.0.0 or later is required as the authoring Data Collector used to design pipelines in Control Hub.
  • Version 3.2.0.0 or later is required as the authoring Data Collector used to design pipeline fragments.
  • Version 3.4.0 or later is required to monitor the CPU load and memory usage of each Data Collector from within Control Hub.
  • Version 3.19.0 or later is required to create and use connections.

If needed, you can customize the supported Data Collector version range.

StreamSets Transformer StreamSets recommends using the latest version of Transformer to design and execute Transformer pipelines from Control Hub.

Version 3.16.0 or later is required to use connections.

Statistics aggregator Use one of the following systems to aggregate pipeline statistics when jobs run on multiple Data Collectors:
  • Amazon Kinesis Streams
  • Kafka version supported by Data Collector
  • MapR Streams version supported by Data Collector
Note: In a development environment, you can also use SDC RPC to aggregate pipeline statistics. Using SDC RPC to aggregate statistics is not highly available and might cause the loss of some data. It should be used for development purposes only.

Relational Database Requirements

Control Hub supports MariaDB, MySQL, or PostgreSQL for the relational database instance.

MariaDB Requirements

The relational database for a single Control Hub instance supports MariaDB 10.x. Control Hub is fully tested with MariaDB 10.11.

The relational database for a highly available Control Hub system supports MariaDB Galera Cluster 10.x.

MariaDB installations must meet the following minimum requirements:

Component Single Installation Multiple Installations for High Availability
CPU 4 4
RAM 30.5 GB 30.5 GB
Disk space 50 GB 100 GB
If installing on Amazon EC2 instances, use the following instance types:
  • Single installation - db.r3.xlarge
  • Multiple installations for high availability - db.r3.xlarge

MySQL Requirements

The relational database for a single Control Hub instance supports MySQL 5.6, 5.7, or 8.x. Control Hub is fully tested with MySQL 8.0.28.

The relational database for a highly available Control Hub system supports MySQL Enterprise High Availability 5.6, 5.7, or 8.x.

Important: Although Control Hub continues to support MySQL 5.6 and 5.7, these earlier MySQL versions have reached end of life. After you upgrade Control Hub to version 3.56.x, you might consider upgrading to MySQL 8.x as a post-upgrade task.

MySQL installations must meet the following minimum requirements:

Component Single Installation Multiple Installations for High Availability
CPU 4 4
RAM 30.5 GB 30.5 GB
Disk space 50 GB 100 GB
If installing on Amazon EC2 instances, use the following instance types:
  • Single installation - db.r3.xlarge
  • Multiple installations for high availability - db.r3.xlarge

PostgreSQL Requirements

The relational database for a single Control Hub instance supports PostgreSQL 9.4, 9.6, 11.x, or 14.x. Control Hub is fully tested with PostgreSQL 11.10 and 14.6.

The relational database for a highly available Control Hub system supports PostgreSQL 9.4, 9.6, 11.x, or 14.x with high availability enabled.

PostgreSQL installations must meet the following minimum requirements:

Component Single Installation Multiple Installations for High Availability
CPU 4 4
RAM 30.5 GB 30.5 GB
Disk space 50 GB 100 GB
If installing on Amazon EC2 instances, use the following instance types:
  • Single installation - db.r3.xlarge
  • Multiple installations for high availability - db.r3.xlarge

Time Series Database Requirements

The time series database for a single Control Hub instance supports InfluxDB 1.3.x, 1.7.x, or 1.9.x.

The time series database for a highly available Control Hub system supports InfluxDB Enterprise 1.3.x, 1.7.x, or 1.9.x with a minimum of 2 data nodes and 3 meta nodes in the cluster. A single data node and a single meta node can be deployed to the same server.

Influx installations must meet the following minimum requirements:

Component Single Installation Multiple Installations for High Availability
CPU 4 8
RAM 30.5 GB 61 GB
Disk space 250 GB 500 GB
If installing on Amazon EC2 instances, use the following instance types:
  • Single installation - r4.xlarge
  • Multiple installations for high availability - r3.2xlarge

Default Ports

The following table lists the default ports exposed to Control Hub clients and how they are used. Note that the default port numbers can be changed during installation.

In a development environment, configure network routes and firewalls so that web UI clients and registered Data Collectors and Provisioning Agents can reach the Control Hub IP addresses.

In a highly available production environment, configure network routes and firewalls so that the Control Hub instances, web UI clients, and registered Data Collectors and Provisioning Agents can reach the load balancer.

System Default Port Protocol Usage
Control Hub
  • HTTP - 18631
  • HTTPS - depends on dpm.properties configuration
TCP Access to the Control Hub web-based UI and API for a single Control Hub instance in a development environment.

Used by developers and administrators to access the UI. Used by registered Data Collectors and Provisioning Agents to access the API.

Control Hub Admin tool
  • HTTP - 18632
  • HTTPS - depends on dpm.properties configuration
TCP Access to the Control Hub Admin tool web-based UI for a single Control Hub instance in a development environment.

Used by administrators to access the UI.

Load balancer Depends on the chosen load balancer TCP When using multiple Control Hub instances in a highly available production environment, both Control Hub and the Control Hub Admin tool are accessed through a load balancer.

The following table lists the default ports of the external systems that Control Hub depends on and how they are used. The default port numbers can change - confirm the actual numbers with your systems administrator.

External System Default Port Protocol Usage
MariaDB 3306 TCP Relational database that stores Control Hub application data.
MySQL 3306 TCP Relational database that stores Control Hub application data.
PostgreSQL 5432 TCP Relational database that stores Control Hub application data.
InfluxDB 8086 TCP Time series database that stores metrics.
LDAP or LDAPS 389

636

TCP Used when Control Hub is configured for LDAP or LDAPS authentication.
SMTP 465 TCP Used by Control Hub to send email notifications.

Browser Access to Data Collector and Transformer

The web browser used to access Control Hub must be able to reach the following components:
Authoring engines
Authoring Data Collectors and Transformers accept inbound connections from the web browser when you design pipelines using Pipeline Designer.
Execution engines
Execution Data Collectors and Transformers accept inbound connections from the web browser when you complete the following tasks:
  • Capture and view snapshots in an active Data Collector job.
  • Monitor real-time statistics on the Realtime Summary tab for an active Data Collector or Transformer job.
  • Monitor error records encountered by a pipeline stage in an active Data Collector job.
  • View the execution engine log when monitoring an active Data Collector or Transformer job.
  • View configuration properties, active Java threads, metric charts, logs, and directories when monitoring a Data Collector or Transformer from the Execute view.

Configure network routes and firewalls so that the Control Hub web browser can reach the URLs of registered Data Collectors and Transformers.

If registered Data Collectors and Transformers are installed on a cloud computing platform such as Amazon Elastic Compute Cloud (EC2), configure them to use a publicly accessible URL as described in Publicly Accessible URL for Data Collector or Publicly Accessible URL for Transformer.

If Data Collector containers are provisioned on Kubernetes, you must expose the container outside the cluster using a Kubernetes service as described in Defining a Deployment YAML Specification.
Tip: To ensure that the Control Hub web browser has access to registered engines, click Execute > Data Collectors or Execute > Transformers in the Navigation panel, and then click the URL for each registered Data Collector or Transformer.