MapR Prerequisites

You can use MapR stages only in a self-managed tarball deployment. You cannot use MapR stages with self-managed Docker deployments or with Control Hub-managed deployments, such as Amazon EC2 deployments. You can install Data Collector engine instances on a node in the MapR cluster or on a client machine.

Due to licensing restrictions, StreamSets cannot distribute MapR libraries with Data Collector. As a result, you must perform additional steps to enable the Data Collector machine to connect to MapR. Data Collector does not display MapR stages in stage library lists until you perform these prerequisites.

MapR prerequisites include configuring the deployment, MapR, and the Data Collector machine. Perform the prerequisite tasks in the following order:
  1. On the Data Collector machine, install MapR client libraries as needed.
  2. In Control Hub, configure the deployment.
  3. On each Data Collector machine, run a script to set up MapR.
  4. If using a secure MapR cluster, configure MapR security.
  5. Start Data Collector engines.

Supported Versions

This release supports the following MapR and MapR Ecosystem Pack (MEP) versions:
  • MapR 6.1.x with optional MEP 6.x
  • MapR 7.0.x with optional MEP 8.x

Different MapR versions require different versions of Java. For more information, see Java Versions and Available Features.

To view the complete list of MEPs supported by MapR core versions, see MEP Support by MapR Core Version in the MapR documentation.

Step 1. Install Client Libraries as Needed

You can install Data Collector on a node in the MapR cluster or on a client machine. A client machine is one that is outside the cluster or on your local machine. When you install Data Collector on a client machine, the MapR client package must be installed on the machine.

If you install Data Collector on a node in the MapR cluster, or on a client machine that has the MapR client package installed, you can skip this step.

If the MapR client package is not installed on a client machine, download and install the following files:
  • MapR client library - Typically named mapr-client_<version>.<ext>.
    You can download the files for your operating system here:
    http://package.mapr.com/releases/<version>/
  • Kafka client library - Typically named mapr-kafka-<version>.<ext>.
    You can download the files for your operating system here:
    http://package.mapr.com/releases/MEP/MEP-<version>/
Note: If you encounter an error when you run a MapR pipeline that indicates that Data Collector cannot find the MapR client, configure the environment to enable Data Collector to use the MapR libraries. For example, on Linux, you might use export LD_LIBRARY_PATH="${MAPR_HOME}/lib". On MacOS, you might use export JAVA_LIBRARY_PATH="${MAPR_HOME}/lib".

Step 2. Configure the Deployment

You need to configure a deployment to use MapR. When you configure a deployment, you add MapR stage libraries, remove the stage libraries from a blacklist, add a security policy, and specify Java options as needed. If the MapR cluster is enabled with built-in security, then you need to add a Java option to enable Data Collector to connect to a secure MapR cluster.

You can configure an existing Data Collector self-managed tarball deployment or create a new self-managed tarball deployment to work with MapR.
  1. In Control Hub, create or edit a deployment. If creating a new deployment, set the Deployment Type property to Self Managed, then click Save and Next.
  2. In the Configure Engine step, click Stage Libraries. Add the MapR stage libraries to include in the deployment, then click OK to save your changes.
    When installing MapR stage libraries, you must install both the MapR stage library and the MapR Ecosystem Pack (MEP) stage library for your supported version of MapR. For example, if using MapR version 7.0.x, you must install both of the following stage libraries:
    • streamsets-datacollector-mapr_7_0-lib
    • streamsets-datacollector-mapr_7_0-mep8-lib

    For detailed steps on adding stage libraries to a new self-managed deployment, see "Configure the Engine" in the Control Hub documentation.

    For detailed steps on adding stage libraries to an existing deployment, see Updating Stage Libraries in the Control Hub documentation.

  3. In the Configure Engine step, click Advanced Configuration, then click Data Collector Configuration. Find and edit the system.stagelibs.blacklist property, then remove the MapR stage libraries that you added to the deployment.
  4. At the top of the Engine Configuration window, click Security Policy, and then add the following text to the end of the section:
    //MapR codebase
    grant codebase "file://<MAPR_HOME>-" {
    permission java.security.AllPermission;
    };

    where <MAPR_HOME> is the MapR home path, typically /opt/mapr.

  5. At the top of the Engine Configuration window, click Java Configuration. In the Java Options property, add the following properties as needed.
    • When using MapR 7.0.x or later, add the following property:
      -Dmapr.library.flatclass -Dsecurity.provider=BCFIPS
    • When connecting to a MapR cluster with built-in security enabled, add the following property:
      -Dmaprlogin.password.enabled=true
  6. To save all of the engine configuration changes, click Save.
  7. If you are configuring a new self-managed deployment, in the Configure Install Type step, choose a tarball installation.
  8. Start or launch the engines:
    • If you edited an existing active deployment, click Save and Next until you reach the Review step, then click Restart Engines to restart running engines. If you have engines that are not running, start those engines manually so they receive updates from the deployment.
    • If you configured a new deployment, configure the rest of the deployment. Then, in the Review and Launch step, click Start & Generate Script. Run the script on every machine where you want the Data Collector engine to run. Install Data Collector on MapR cluster nodes or client machines.

      For more information about launching a self-managed Data Collector tarball, see the Control Hub documentation.

    Important: All existing and new engines will fail to start and generate errors about missing classes. This is expected because the prerequisite tasks are not yet complete. Ignore start engine errors and continue with the next prerequisite task.

Step 3. Run the Command to Set Up MapR

After you install all required MapR client libraries and configure a deployment to work with MapR, run the setup-mapr command on every Data Collector machine. This command modifies configuration files and creates the required symbolic links to enable Data Collector to work with MapR. You can run the command in interactive or non-interactive mode.

In interactive mode, the command prompts you for the MapR version and home directory. In non-interactive mode, you define the MapR version and home directory in environment variables before running the command.

In both modes, the command checks if the MapR distribution of Spark is installed in the specified MapR cluster. If a supported version is installed, the command also installs the MapR Spark stage library for you.

Running the Command in Interactive Mode

When you run the setup-mapr command in interactive mode, the command prompts you for the MapR version and home directory.

  1. Set the following environment variables:
    Environment Variable Description
    SDC_HOME Data Collector home directory.

    SDC_CONF Data Collector configuration directory.
    MAPR_MEP_VERSION MEP version. Enter a single digit MEP version number: 4, 5, 6, or 8.
    Use the following command to set an environment variable:
    export <environment variable>=<value>
    For example, use the following commands:
    export SDC_HOME=/streamsets-datacollector-5.3.0
    export SDC_CONF=/streamsets-datacollector-5.3.0/etc
    export MAPR_MEP_VERSION=8
  2. Use the following command from the engine installation directory to set up MapR:
    bin/streamsets setup-mapr
  3. When prompted, enter the MapR version.

    Enter the full three-digit version: 6.0.0, 6.0.1, 6.1.0, or 7.0.0.

  4. When prompted, enter the absolute path to the MapR home directory, usually /opt/mapr.

Running the Command in Non-Interactive Mode

When you run the setup-mapr command in non-interactive mode, you define the MapR version and home directory in environment variables before running the command.

  1. Set the following environment variables:
    Environment Variable Description
    SDC_HOME Data Collector home directory.

    SDC_CONF Data Collector configuration directory.
    MAPR_HOME MapR home directory, usually /opt/mapr.
    MAPR_VERSION MapR version.

    Enter the full three-digit version: 6.0.0, 6.0.1, 6.1.0, or 7.0.0.

    MAPR_MEP_VERSION MEP version. Enter a single digit MEP version number: 4, 5, 6, or 8.
    Use the following command to set an environment variable:
    export <environment variable>=<value>
    For example, use the following commands:
    export SDC_HOME=/streamsets-datacollector-5.3.0
    export SDC_CONF=/streamsets-datacollector-5.3.0/etc
    export MAPR_HOME=/opt/mapr
    export MAPR_VERSION=7.0.0
    export MAPR_MEP_VERSION=8
  2. Use the following command from the engine installation directory to set up MapR:
    bin/streamsets setup-mapr

Step 4. Configure MapR in Secure Clusters

To connect to a secure MapR cluster with built-in security enabled, ensure that a valid user, tenant, or service ticket exists for the Data Collector user in MapR. To generate tickets, see the MapR documentation.

Note: If the MapR ticket that Data Collector uses allows impersonation, then you can configure MapR stages in Data Collector to use Hadoop impersonation mode.

To run MapR commands in a secure cluster, Data Collector must run as the user account granted access in the MapR ticket.

For example, if you ran the following MapR command to generate the service ticket for applications running outside of the cluster, then Data Collector must run as the myappuser user account:

maprlogin generateticket -type service -out /tmp/longlived_ticket -duration 30:0:0 -renewal 90:0:0

Step 5. Start Engine Instances

After you install MapR client libraries as needed, configure a deployment, and run the setup-mapr command, you can start all engine instances for the deployment. After you start engine instances, verify that MapR stages are available in stage library lists.

The start method differs depending on whether you are using a secure MapR cluster:
Unsecure MapR cluster
To start an engine for an unsecure MapR cluster, run the following command from the engine installation directory:
bin/streamsets dc
Secure MapR cluster
You can start an engine for a secure MapR cluster in either of the following ways:
  • Log into the command prompt as the user account granted access in the MapR ticket, then use the following command from the engine installation directory:
    bin/streamsets dc
  • Impersonate the required user account by using the following launch command from the engine installation directory, where <user> is the user account granted access in the MapR ticket:
    sudo -u <user> bin/streamsets dc
    For example:
    sudo -u myappuser /opt/streamsets-datacollector-5.3.0/bin/streamsets dc