Installation
- Full installation - Includes all stage libraries. When a stage library has multiple
versions, such as Kudu for 1.5.x, 1.6.x, and 1.7.x, all versions are included in the full
installation. As a result, the full installation is the largest version of Data Collector.
You can perform a full installation for a manual start or service start.
Available for users with an enterprise account.
- Common installation -
Includes all core Data Collector
functionality and commonly-used stage libraries in a tarball installation. This
installation allows you to create pipelines easily while using less disk space than the
full installation. You can install additional stage libraries as needed.
Available for all users.
- Core installation -
Includes a minimum version of Data Collector. This
installation uses the least amount of disk space, but requires most users to install
additional stage libraries to develop pipelines.
Available for users with an enterprise account.
- Cloudera Manager
installation - If you use Cloudera Manager, you can install and administer a full
version of Data Collector through
Cloudera Manager.
Available for users with an enterprise account.
- Docker image - If
you use Docker, you can run the Data Collector image
from Docker Hub.
Available for all users.
- Cloud service provider
marketplace - You can install the full Data Collector as a
service on cloud service providers, such as Azure and Google Cloud, through their
marketplaces.
Available for all users.
Installation Requirements
Install Data Collector on a machine that meets the following minimum requirements. To run pipelines in cluster execution mode, each node in the cluster must meet the minimum requirements.
Component | Minimum Requirement |
---|---|
Operating system | Use one of the following operating systems and versions:
|
Cores | 2 |
RAM | 1 GB |
Disk space | 6 GB Note: StreamSets does not
recommend using NFS or NAS to store Data Collector
files. |
File descriptors | 32768 |
Java |
One of the following Java versions:
Some Data Collector functionality is dependent on the Java version that you use. For more information, see Java Versions and Available Features. |
Browser | Use the latest version of one of the following browsers:
|
Java Versions and Available Features
The Java version installed on the Data Collector machine determines the Data Collector features that you can use.
All supported Java versions provide almost all Data Collector features. However due to third-party requirements, some features require a particular Java version. For example, HPE Ezmeral Data Fabric 7.0.x requires Java JDK 11 or higher, so you must use Java 11 or 17 to work with MapR stages that connect to HPE Ezmeral Data Fabric 7.0.x.
Java Version | Available Features |
---|---|
Java 8 (Oracle Java 8 and Open JDK 8) | Provides access to all Data Collector features except for the following:
|
Java 11 (Oracle Java 11 and Open JDK 11) | Provides access to all Data Collector features except for the following:
|
Java 17 (Oracle Java 17 and Open JDK 17) |
Provides access to all Data Collector features except for the following:
|
JCE for Oracle JVM
If you use AES-256 encryption with your Oracle JVM and use a version of JDK earlier than 1.8.0_161, then configure the JDK on the Data Collector machine to use the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy.
To configure the JDK to use unlimited cryptography, set the
crypto.policy
Java Security property in the
java.security
file included in your JDK installation to a value of
unlimited
. See the notes in the java.security
file
for more information.
After you configure unlimited cryptography, restart Data Collector.
Configuring the Open File Limit
Data Collector requires a large number of file descriptors to work correctly with all stages. Most operating systems provide a configuration to limit the number of files a process or a user can open. The default values are usually less than the Data Collector requirement of 32768 file descriptors.
ulimit -n
Most operating systems use two ways of configuring the maximum number of open files - the soft limit and the hard limit. The hard limit is set by the system administrator. The soft limit can be set by the user, but only up to the hard limit.
Increasing the open file limit differs for each operating system. Consult your operating system documentation for the preferred method.
Increase the Limit on Linux
To increase the open file limit on Linux, see the following solution: How to set ulimit values.
This solution should work on Red Hat Enterprise Linux, Oracle Linux, CentOS, and Ubuntu. However, refer to the administrator documentation for your operating system for the preferred method.
Increase the Limit on MacOS
The method you use to increase the limit on MacOS can differ with each version. Refer to the documentation for your operating system version for the preferred method.
To increase the limit for the computer - so that the limits are retained after relaunching the terminal and restarting the computer - create a property list file. The following steps are valid for recent MacOS versions:
- Use the following command to create a property list file named
limit.maxfiles.plist
:sudo vim /Library/LaunchDaemons/limit.maxfiles.plist
- Add the following contents to the file, modifying the
maxfiles
attribute as needed.The maxfiles attribute defines the open file limit. The first value in the file is the soft limit. The second value is the hard limit.
For example, in the following
limit.maxfiles.plist
file, both the soft and hard limit are set to 32,768:<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>limit.maxfiles</string> <key>ProgramArguments</key> <array> <string>launchctl</string> <string>limit</string> <string>maxfiles</string> <string>32768</string> <string>32768</string> </array> <key>RunAtLoad</key> <true/> <key>ServiceIPC</key> <false/> </dict> </plist>
- Use the following commands to load the new
settings:
sudo launchctl unload -w /Library/LaunchDaemons/limit.maxfiles.plist sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
- Use the following command to check that the system limits were
modified:
launchctl limit maxfiles
- Use the following command to set the session
limit:
ulimit -n 32768
Default Ports
The following table lists the default ports exposed to Data Collector clients and how they are used. Note that the default port numbers can be changed during installation. Configure network routes and firewalls so that web UI clients can reach the Data Collector IP address.
System | Default Port | Protocol | Usage |
---|---|---|---|
Data Collector |
|
TCP | Access to the Data Collector web-based UI and API. |
The following table lists the default ports of the external systems that Data Collector depends on and how they are used. The default port numbers can change - confirm the actual numbers with your systems administrator.
External System | Default Port | Protocol | Usage |
---|---|---|---|
LDAP or LDAPS | 389 636 |
TCP | Used when Data Collector is configured for LDAP or LDAPS authentication. |
SMTP | 465 | TCP | Used when Data Collector is configured to send email notifications. |