MapR Prerequisites

Due to licensing restrictions, StreamSets cannot distribute MapR libraries with Data Collector. As a result, you must perform additional steps to enable the Data Collector machine to connect to MapR. Data Collector does not display MapR origins and destinations in stage library lists until you perform these prerequisites.

MapR prerequisites include installing the required client libraries on the Data Collector machine. Then, you run the command to set up MapR.

If the MapR cluster is enabled with built-in security, you also must configure Data Collector to connect to a secure MapR cluster and ensure that a valid ticket exists for the Data Collector user.

Supported Versions

This release supports the following MapR and MapR Ecosystem Pack (MEP) version:
  • MapR 6.1.x with optional MEP 6.x

To view the complete list of MEPs supported by MapR core versions, see MEP Support by MapR Core Version in the MapR documentation.

Step 1. Install Client Libraries

Install Data Collector on a node in the MapR cluster or on a client machine.

To run Data Collector on a client machine - outside the cluster or on your local machine - the MapR client package must be installed and configured on the machine. If the MapR client package is not installed on the machine, download and install the following files:
  • MapR client library - Typically named mapr-client_<version>.<ext>.
    You can download the files for your operating system here:
    http://package.mapr.com/releases/<version>/
  • Kafka client library - Typically named mapr-kafka-<version>.<ext>.
    You can download the files for your operating system here:
    http://package.mapr.com/releases/MEP/MEP-<version>/

Step 2. Install MapR Stage Libraries

When you configure the Control Hub deployment, select the MapR stage libraries to install on each engine.

When installing MapR stage libraries, you must install both the MapR stage library and the MapR Ecosystem Pack (MEP) stage library for your supported version of MapR. For example, if using MapR version 6.1.0, you must install both of the following stage libraries as additional stage libraries:
  • streamsets-datacollector-mapr_6_1-lib
  • streamsets-datacollector-mapr_6_1-mep6-lib

Step 3. Run the Command to Set Up MapR

After installing the required client libraries, run the setup-mapr command. The command modifies configuration files and creates the required symbolic links. You can run the command in interactive or non-interactive mode.

In interactive mode, the command prompts you for the MapR version and home directory. In non-interactive mode, you define the MapR version and home directory in environment variables before running the command.

In either mode, the command checks if the MapR distribution of Spark is installed in the specified MapR cluster. If a supported version is installed, the command also installs the MapR Spark stage library for you.

Running the Command in Interactive Mode

When you run the setup-mapr command in interactive mode, the command prompts you for the MapR version and home directory.

  1. Set the following environment variables:
    Environment Variable Description
    SDC_HOME Data Collector home directory.

    SDC_CONF Data Collector configuration directory.
    MAPR_MEP_VERSION MEP version. Enter a single digit MEP version number: 4, 5, or 6.
    Use the following command to set an environment variable:
    export <environment variable>=<value>
    For example, use the following commands:
    export SDC_HOME=/streamsets-datacollector-4.1.0
    export SDC_CONF=/streamsets-datacollector-4.1.0/etc
    export MAPR_MEP_VERSION=4
  2. Use the following command from the $SDC_HOME directory to set up MapR:
    bin/streamsets setup-mapr
  3. When prompted, enter the MapR version.

    Enter the full three-digit version: 6.0.0, 6.0.1, or 6.1.0.

  4. When prompted, enter the absolute path to the MapR home directory, usually /opt/mapr.
  5. Restart Data Collector and verify that MapR stages appear in stage library lists.

Running the Command in Non-Interactive Mode

When you run the setup-mapr command in non-interactive mode, you define the MapR version and home directory in environment variables before running the command.

  1. Set the following environment variables:
    Environment Variable Description
    SDC_HOME Data Collector home directory.

    SDC_CONF Data Collector configuration directory.
    MAPR_HOME MapR home directory, usually /opt/mapr.
    MAPR_VERSION MapR version.

    Enter the full three-digit version: 6.0.0, 6.0.1, or 6.1.0.

    MAPR_MEP_VERSION MEP version. Enter a single digit MEP version number: 4, 5, or 6.
    Use the following command to set an environment variable:
    export <environment variable>=<value>
    For example, use the following commands:
    export SDC_HOME=/streamsets-datacollector-4.1.0
    export SDC_CONF=/streamsets-datacollector-4.1.0/etc
    export MAPR_HOME=/opt/mapr
    export MAPR_VERSION=6.0.0
    export MAPR_MEP_VERSION=4
  2. Use the following command from the $SDC_HOME directory to set up MapR:
    bin/streamsets setup-mapr
  3. Restart Data Collector and verify that MapR stages appear in stage library lists.

Step 4. Connect to a MapR Cluster Secured with Built-in Security

If the MapR cluster is enabled with built-in security, you must configure Data Collector to connect to a secure MapR cluster.

In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Java Configuration.

Modify the Java Options property to add the -Dmaprlogin.password.enabled configuration property.

For example, define the Java options as follows:
-Dmaprlogin.password.enabled=true

Save the changes to the deployment and restart all engine instances.

Step 5. Run Data Collector as a MapR Ticket User

To connect to a secure MapR cluster enabled with built-in security, ensure that a valid user, tenant, or service ticket exists for the Data Collector user.

To generate tickets, see the MapR documentation.

Note: If the MapR ticket that Data Collector uses allows impersonation, then you can configure MapR stages in Data Collector to use Hadoop impersonation mode.

To run MapR commands in the secure cluster, Data Collector must run as the user account granted access in the MapR ticket.

For example, if you ran the following MapR command to generate the service ticket for applications running outside of the cluster:

maprlogin generateticket -type service -out /tmp/longlived_ticket -duration 30:0:0 -renewal 90:0:0
MapR credentials of user 'myappuser' for cluster 'mycluster' are written to '/tmp/longlived_ticket'

Then Data Collector must run as the myappuser user account.

Configure Data Collector to run as the required user account based on how you start Data Collector:

Manual start
When Data Collector is started manually, it runs as the system user account logged into the command prompt when you use the following launch command from the $SDC_DIST directory:
bin/streamsets dc
To connect to a secure MapR cluster, log into the command prompt as the user account granted access in the MapR ticket. Or, impersonate the required user account by using the following launch command from the $SDC_DIST directory, where <user> is the user account granted access in the MapR ticket:
sudo -u <user> bin/streamsets dc
For example:
sudo -u myappuser /opt/streamsets-datacollector-5.1.0/bin/streamsets dc