Installation with Cloudera Manager
Users with a StreamSets enterprise account can use Cloudera Manager to install a full version of Data Collector across the cluster as an add-on service.
- Install the StreamSets custom service descriptor (CSD).
- (Optional.) Manually install the parcel and checksum files. Typically only needed when the Cloudera Manager Server does not have internet access.
- Download, distribute, and activate the StreamSets parcel.
- Configure the StreamSets service.
Afterwards, you can configure the Data Collector if necessary.
This documentation includes details about Cloudera Manager to simplify the installation and configuration process. For more information about using Cloudera Manager, see the Cloudera documentation.
Step 1. Install the StreamSets Custom Service Descriptor
Install the StreamSets custom service descriptor file (CSD), and then restart Cloudera Manager.
-
Download the CSD from the StreamSets Support portal.
Or, you can use the GNU Wget program to download the CSD from the command line by running the following commands:
export VERSION="5.11.0" wget https://archives.streamsets.com/datacollector/$VERSION/csd/STREAMSETS-$VERSION.jar
-
Copy the Data Collector CSD file
to the Local Descriptor Repository Path. By default, the
path is
/opt/cloudera/csd
.To verify the path to use, in Cloudera Manager, click Custom Service Descriptors category. Place the CSD file in the path configured for Local Descriptor Repository Path.. In the navigation panel, select the -
Set the file ownership to
cloudera-scm:cloudera-scm
with permission 644.For example:chown cloudera-scm:cloudera-scm /opt/cloudera/csd/STREAMSETS*.jar chmod 644 /opt/cloudera/csd/STREAMSETS*.jar
-
Use one of the following commands to restart Cloudera Manager Server:
For Ubuntu 14.04, CentOS 6, Red Hat Enterprise Linux 6, or Oracle Linux 6:For Ubuntu 16.04, CentOS 7, Red Hat Enterprise Linux 7, or Oracle Linux 7:
service cloudera-scm-server restart
systemctl restart cloudera-scm-server
- In Cloudera Manager, to restart the Cloudera Management Service, click Menu icon and select Restart. . To the right of Cloudera Management Service, click the
Step 2. Manually Install the Parcel and Checksum Files (Optional)
You can manually install the StreamSets parcel and related checksum files. Manually install the files when the Cloudera Manager Server does not have internet access.
When working with multiple clusters, perform the following steps for each cluster.
- Download the StreamSets parcel and related checksum file for the Cloudera Manager Server operating system.
-
Copy the StreamSets parcel and checksum file to the Cloudera Manager
Local Parcel Repository Path.
By default, the path is
/opt/cloudera/parcel-repo
.To verify the path to use, click Parcels category. Place the StreamSets parcel file in the path configured for Local Parcel Repository Path.. In the navigation panel, select the -
Change ownership on the parcel and checksum file to the user that runs the
Cloudera Manager process.
For example, if the Cloudera Manager process runs as the cloudera-scm user, use the following command to change ownership to cloudera-scm:
sudo chown cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo/STREAMSETS_DATACOLLECTOR*
Step 3. Distribute and Activate the StreamSets Parcel
After you add the StreamSets repository to Cloudera Manager, you can download, distribute, and activate the StreamSets parcel across the cluster.
When working with multiple clusters, perform the following steps for each cluster.
-
To view the list of available parcels, in the menu bar, click the
Parcels icon.
The StreamSets parcel displays in the list of available parcels. If it doesn't display, click Check for New Parcels.
-
To download the StreamSets parcel to the local repository, click
Download.
After the parcel is downloaded, the Download button becomes the Distribute button.
-
To distribute the StreamSets parcel to the cluster, click
Distribute.
After distribution, the Distribute button becomes the Activate button.
- To activate the StreamSets parcel, click Activate.
Step 4. Configure the StreamSets Service
When you configure the service, you assign Data Collector to the hosts where you want it to run.
To run Data Collector in cluster streaming mode, colocate Data Collector on a node with the Spark Gateway role. To run Data Collector in cluster batch mode, colocate Data Collector on a node with the YARN Gateway role.To write to HDFS, colocate Data Collector on a node with the HDFS Gateway role. Similarly, to write to HBase or Hive, colocate Data Collector on nodes with the HBase or Hive Gateway roles, respectively.
When working with multiple clusters, perform the following steps for each cluster.
- In Cloudera Manager, click the menu for the cluster you want to use, then click Add a Service.
- In the Service Types list, select StreamSets, then click Continue.
- To select the hosts where you want to install StreamSets, on the Customize Role Assignments for StreamSets page, click Select Hosts to open a list of available hosts.
-
Select one or more hosts, then click OK. Click
Continue.
The Review Changes page displays the Data and Resource directories for the Data Collector.
-
Optionally change the directories, then click
Continue.
The First Run Command page displays status updates as Cloudera Manager starts Data Collector on the selected hosts.
- Click Continue, then click Finish.
Configuring Data Collector with Cloudera Manager
When administering Data Collector with Cloudera Manager, configure all Data Collector configuration properties and environment variables through Cloudera Manager.
Manual changes to Data Collector configuration files can be overwritten by Cloudera Manager.
-
In Cloudera Manager, select the StreamSets service, then
click Configuration.
The Configuration page displays Data Collector configuration properties.
-
On the Configuration page, in the navigation panel, you
can select a category to configure related properties.
For a description of each property, see Configuring Data Collector.