MapReduce

The MapReduce executor starts a MapReduce job in HDFS or MapR FS each time it receives an event record. Use the MapReduce executor as part of an event stream.

MapR is now HPE Ezmeral Data Fabric. At times, this documentation uses "MapR" to refer to both MapR and HPE Ezmeral Data Fabric. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

You can use the MapReduce executor to start a custom job, such as a validation job that compares the number of records in files. You can use a custom job by configuring it in the executor or using a predefined configuration object. You can also use the MapReduce executor to start a predefined job. The MapReduce executor includes two predefined jobs: one that converts Avro files to ORC files, and one that converts Avro files to Parquet.

You can use the executor in any logical way, such as running MapReduce jobs after the Hadoop FS or MapR FS destination closes files. For example, you can use the Avro to ORC job to convert Avro files to ORC files after a MapR FS destination closes a file. Or, you might use the Avro to Parquet job to convert Avro files to Parquet after the Hadoop FS destination closes a file as part of the Drift Synchronization Solution for Hive.

Note: The MapReduce executor starts jobs in an external system. It does not monitor jobs or wait for the job to complete. The executor becomes available for additional processing as soon as it successfully submits the job.

When you configure the MapReduce executor, you specify connection information and job details. For predefined jobs, you specify Avro conversion details, such as the input and output file location, as well as ORC- or Parquet-specific details. For other types of jobs, you specify a job creator or configuration object, and the job configuration properties to use.

When necessary, you can enable Kerberos authentication and specify a MapReduce user. You can also use MapReduce configuration files and add other MapReduce configuration properties as needed.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

For a solution that describes how to use the MapReduce executor, see Converting Data to the Parquet Data Format.

Prerequisites

Before you run a pipeline that includes MapReduce executor, you must enable MapReduce executor to submit a job.

You can enable MapReduce executor to submit a job in several different ways. Perform one of the following tasks to enable the executor to submit jobs:

Configure the YARN Minimum User ID property, min.user.id
The min.user.id property is set to 1000 by default. To allow job submission:
  1. Verify the user ID being used by the Data Collector user, typically named "sdc".
  2. In Hadoop, configure the YARN min.user.id property.

    Set the property to equal to or lower than the Data Collector user ID.

Configure the YARN Allowed System Users property, allowed.system.users
The allowed.system.users property lists allowed user names. To allow job submission:
  1. In Hadoop, configure the YARN allowed.system.users property.

    Add the Data Collector user name, typically "sdc", to the list of allowed users.

Configure the MapReduce executor MapReduce User property
In the MapReduce executor, the MapReduce User property allows you to enter a user name for the stage to use when submitting jobs. To allow job submission:
  1. In the MapReduce executor stage, configure the MapReduce User property.

    Enter a user with an ID that is higher than the min.user.id property, or with a user name that is listed in the allowed.system.users property.

For information about the MapReduce User property, see Using a MapReduce User.

Related Event Generating Stages

Use the MapReduce executor in the event stream of a pipeline. The MapReduce executor is meant to start MapReduce jobs after output files are written.

Use the MapReduce executor to perform post-processing for files written by the following destinations:
  • Hadoop FS destination
  • MapR FS destination

MapReduce Jobs and Job Configuration Properties

The MapReduce executor can run a custom job that you must configure or a predefined job that is provided by the executor.

When configuring a custom job, you can either specify a job creator and job configuration properties, or use a custom configuration object and specify job configuration properties.

When using a predefined job, you specify the job, Avro conversion details, and job-related properties.

When configuring job configuration properties, you specify key-value pairs. You can use expressions in the key-value pairs.

Predefined Jobs for Parquet and ORC

The MapReduce executor includes two predefined jobs: Avro to ORC and Avro to Parquet.

The Avro to ORC job converts Avro files to ORC files. The Avro to Parquet job converts Avro files to Parquet. Both jobs process Avro files after they are written. That is, a destination finishes writing an Avro file and generates an event record. The event record contains information about the file, including the name and location of the file. And when the MapReduce executor receives the event record, it starts the selected predefined MapReduce job.

When using a predefined job, you configure input file information and the output directory, whether to keep the input file, and whether to overwrite temporary files.

By default, for the input file, the MapReduce executor uses the file name and location in the "filepath" field of the event record, as follows:
${record:value('/filepath')}

The executor writes output files in the specified output directory. The executor uses the name of the input file that was processed as the basis for the output file name and adds .parquet or .orc, depending on the job type.

When using the Avro to ORC job, you specify the ORC batch size on the Avro to ORC tab. To specify additional job information, add job configuration properties on the Job tab. For information about the properties you might want to use, see the Hive documentation.

When using the Avro to Parquet job, you specify job-specific properties on the Avro to Parquet tab. You can specify additional job information by adding job configuration properties on the Job tab.

Event Generation

The MapReduce executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it starts a MapReduce job.

MapReduce executor events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the MapReduce executor have the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute Description
sdc.event.type Event type. Uses one of the following types:
  • job-created - Generated when the executor creates and starts a MapReduce job.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
Event records generated by the MapReduce executor have the following fields:
Event Field Name Description
tracking-url Tracking URL for the MapReduce job.
job-id Job ID of the MapReduce job.

Kerberos Authentication

You can use Kerberos authentication to connect to Hadoop services such as HDFS or YARN. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to authenticate. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file, and then enable Kerberos in the MapReduce executor.

For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.

Using a MapReduce User

Data Collector can either use the currently logged in Data Collector user or a user configured in the executor to submit jobs.

A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode in the Data Collector documentation.

Note that the executor uses a different user account to connect. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.

To configure a user in the executor, perform the following tasks:
  1. On the external system, configure the user as a proxy user and authorize the user to impersonate the MapReduce user.

    For more information, see the MapReduce documentation.

  2. In the MapReduce executor, on the MapReduce tab, configure the MapReduce User property.

Configuring a MapReduce Executor

Configure a MapReduce executor to start MapReduce jobs each time the executor receives an event record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

  2. On the MapReduce tab, configure the following properties:
    MapReduce Property Description
    MapReduce Configuration Directory Absolute path to the directory containing the Hive and Hadoop configuration files. For a Cloudera Manager installation, enter hive-conf.
    The stage uses the following configuration files:
    • core-site.xml
    • yarn-site.xml
    • mapred-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in this stage.
    MapReduce Configuration

    Additional properties to use.

    To add properties, click Add and define the property name and value. Use the property names and values as expected by HDFS or MapR FS.

    MapReduce User

    The MapReduce user to use to connect to the external system. When using this property, make sure the external system is configured appropriately.

    When not configured, the pipeline uses the currently logged in Data Collector user.

    Not configurable when Data Collector is configured to use the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode in the Data Collector documentation.

    Kerberos Authentication Uses Kerberos credentials to connect to the external system.

    When selected, uses the Kerberos principal and keytab defined in the Data Collector configuration file.

  3. On the Jobs tab, configure the following properties:
    Job Property Description
    Job Name Display name for the MapReduce job.

    This name displays in Hadoop web applications and other reporting tools that list MapReduce jobs.

    Job Type Type of MapReduce job to run:
    • Custom - Use a custom job creator interface and job configuration properties to define the job.
    • Configuration Object - Use a configuration object and job configuration properties to define the job.
    • Convert Avro to Parquet - Use a predefined job to convert Avro files to Parquet. Specify Avro conversion properties and optionally configure additional job configuration properties for the job.
    • Convert Avro to ORC - Use a predefined job to convert Avro files to ORC files. Specify Avro conversion properties and optionally configure additional job configuration properties for the job.
    Custom JobCreator MapReduce Job Creator interface to use for custom jobs.
    Job Configuration Key-value pairs of configuration properties to define the job. You can use expressions in both keys and values.

    Using simple or bulk edit mode, click the Add icon to add additional properties.

  4. When using a predefined job, click the Avro Conversion tab, and configure the following properties:
    Avro Conversion Property Description
    Input Avro File Expression that evaluates to the name and location of the Avro file to process.

    By default, processes the file with the name and location specified in the filepath field of the event record.

    Keep Input File Leaves the processed Avro file in place. By default, the executor removes the file after processing.
    Output Directory Location to write the resulting ORC or Parquet file. Use an absolute path.
    Overwrite Temporary File Enables overwriting any existing temporary files that remain from a previous run of the job.
  5. To use the Avro to Parquet job, click the Avro to Parquet tab, and configure the following properties:
    Avro to Parquet Property Description
    Compression Codec Compression codec to use. If you do not enter a compression code, the executor uses the default compression codec for Parquet.
    Row Group Size Parquet row group size. Use -1 to use the Parquet default.
    Page Size Parquet page size. Use -1 to use the Parquet default.
    Dictionary Page Size Parquet dictionary page size. Use -1 to use the Parquet default.
    Max Padding Size Parquet maximum padding size. Use -1 to use the Parquet default.
  6. To use the Avro to ORC job, click the Avro to ORC tab, and configure the following properties:
    Avro to ORC Property Description
    ORC Batch Size The maximum number of records written to ORC files at one time.