Amazon S3 Requirements

Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.

The requirements for cluster pipelines that read from Amazon S3 depend on the following batch modes:

Cluster EMR batch mode
Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3. Cluster EMR batch mode pipelines require a supported version of an Amazon EMR cluster with Hadoop. For a list of the supported Amazon EMR and Hadoop versions, see Available Stage Libraries in the Data Collector documentation.
Cluster batch mode
Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3. Cluster mode pipelines that read from HDFS require a supported version of CDH or HDP. For a list of the supported CDH or HDP versions, see Available Stage Libraries in the Data Collector documentation.

Configuring Cluster EMR Batch Mode for Amazon S3

Cluster EMR batch mode pipelines run on an Amazon EMR cluster to process data from Amazon S3.

Cluster EMR batch mode pipelines can run on an existing Amazon EMR cluster or on a new EMR cluster that is provisioned when the pipeline starts. When you provision a new EMR cluster, you can configure whether the cluster remains active or terminates when the pipeline stops.

Data Collector can be installed on a gateway node in an existing Amazon EMR cluster. Or, it can be installed outside of the EMR cluster - on an on-premises machine or on another Amazon EC2 instance. Regardless of where Data Collector is installed, you'll likely need to modify the Amazon EMR security group to allow Data Collector to access the master node in the EMR cluster. Security groups control inbound and outbound access to EMR cluster instances. For information on configuring security groups for Amazon EMR clusters, see the Amazon EMR documentation.

All processors and destinations supported in cluster pipelines are supported in a cluster EMR batch pipeline as long as network connectivity is correctly configured from the Amazon EMR cluster to any external system that the processors or destinations use. For example, if you include a JDBC Lookup processor in a cluster EMR batch pipeline, you must ensure that the Amazon EMR cluster can connect to the database.

Note: Cluster EMR batch mode pipelines do not support Kerberos authentication at this time.

Complete the following steps to configure a cluster EMR batch mode pipeline to read from Amazon S3:

  1. In Amazon EMR, modify the master security group used by the EMR cluster to allow Data Collector to access the master node in the cluster.
    For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.
  2. In the pipeline properties, on the General tab, set the Execution Mode property to Cluster EMR Batch.
  3. On the Cluster tab of the pipeline, configure the following properties:
    Cluster Property Description
    Worker Java Options Additional Java properties for the pipeline. Separate properties with a space.

    The following properties are set by default.

    • XX:+UseConcMarkSweepGC and XX:+UseParNewGC are set to the Concurrent Mark Sweep (CMS) garbage collector.
    • Dlog4j.debug enables debug logging for log4j.

    Changing the default properties is not recommended.

    You can add any valid Java property.

    Log Level Log level to use when the pipeline runs on the Amazon EMR cluster. Default is the INFO severity level.
    Worker Memory (MB) Maximum amount of memory allocated to each Data Collector worker in the cluster.

    Default is 1024 MB.

  4. On the EMR tab of the pipeline, configure the following properties:
    EMR Property Description
    Region AWS region that contains the EMR cluster.

    If the region does not display in the list, select Custom and then enter the name of the AWS region.

    AWS Access Key AWS access key ID.
    AWS Secret Key AWS secret access key.

    The pipeline uses the access key pair to pass credentials to Amazon Web Services to connect to the EMR cluster.

    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    S3 Staging URI Temporary staging location in Amazon S3 to store the resources and configuration files required to run the pipeline. Data Collector removes the contents from the folder when the pipeline stops.

    Location must be unique for each pipeline. Use the following format:

    s3://<bucket>/<path>

    The bucket must exist. If the folder in the specified path does not exist, it is created.

    Provision a New Cluster Provisions a new EMR cluster when the pipeline starts.
    Cluster ID ID of the existing EMR cluster.
  5. If you chose to provision a new EMR cluster, configure the following properties on the EMR tab of the pipeline.

    For more information about the properties required to provision an EMR cluster, see the Amazon EMR documentation.

    EMR Property to Provision New Cluster Description
    Cluster Name Prefix Prefix for the name of the provisioned EMR cluster.
    The Data Collector ID and pipeline ID are appended to the prefix as follows:
    <prefix>::<sdc ID>::<pipeline ID>
    Terminate Cluster Terminates the cluster when the pipeline stops.

    When cleared, the cluster remains active when the pipeline stops.

    Logging Enabled Enables logging on the cluster.

    When logging is enabled, Amazon EMR writes the cluster log files to the Amazon S3 location that you specify.

    S3 Log URI Location in Amazon S3 where the cluster writes log data.

    Location must be unique for each pipeline. Use the following format:

    s3://<bucket>/<path>

    The bucket must exist. If the folder in the specified path does not exist, it is created.

    Enable Debugging Enables debugging on the cluster.

    When debugging is enabled, you can use the Amazon EMR console to view the cluster log files.

    Service Role EMR role used by the cluster when provisioning resources and performing other service-level tasks.

    Default is EMR_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Job Flow Role EMR role for EC2 used by EC2 instances within the cluster.

    Default is EMR_EC2_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

    Visible to All Users Determines whether all AWS Identity and Access Management (IAM) users under your account can access the cluster.
    EC2 Subnet ID EC2 subnet identifier to launch the cluster in.
    Master Security Group Security group ID for the master node in the cluster.
    Important: Verify that the master security group allows Data Collector to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.
    Slave Security Group Security group ID for the slave nodes in the cluster.
    Instance Count Number of Amazon EC2 instances to initialize. Each instance corresponds to a slave node in the EMR cluster.
    Master Instance Type Amazon EC2 instance type initialized for the master node in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

    Slave Instance Type Amazon EC2 instance type initialized for the slave nodes in the EMR cluster.

    If an instance type does not display in the list, select Custom and then enter the instance type.

  6. In the pipeline, use the Hadoop FS origin for cluster EMR batch mode.
  7. On the General tab of the origin, select the appropriate EMR stage library for cluster EMR batch mode.
  8. On the Hadoop FS tab of the origin, configure the Hadoop FS URI property to point to the Amazon S3 bucket to read from.

    Use the following format: s3a://<bucket>

    For example:s3a://WebServer

    Then in the Input Paths property, enter the full path to the data to be read within that Amazon S3 bucket. You can enter multiple paths for the Input Paths property, for example:
    • Input Path 1 - /2016/February
    • Input Path 2 - /2016/March

    For more information, see Reading from Amazon S3.

  9. On the S3 tab of the origin, enter the same access key pair that you entered on the EMR tab of the pipeline.

    The origin uses the access key pair to pass credentials to Amazon Web Services to read from Amazon S3.

    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

Configuring Cluster Batch Mode for Amazon S3

Cluster batch mode pipelines run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3.

Complete the following steps to configure a cluster batch mode pipeline to read from Amazon S3:

  1. Verify the installation of HDFS and YARN.
  2. Install Data Collector on a YARN gateway node.
  3. Grant the user defined in the user environment variable write permission on /user/$SDC_USER.
    The user environment variable defines the system user used to run Data Collector as a service. The file that defines the user environment variable depends on your operating system. For more information, see User and Group for Service Start in the Data Collector documentation.
    For example, say the user environment variable is defined as sdc and the cluster does not use Kerberos. Then you might use the following commands to create the directory and configure the necessary write permissions:
    $sudo -u hdfs hadoop fs -mkdir /user/sdc
    $sudo -u hdfs hadoop fs -chown sdc /user/sdc
  4. To enable Data Collector to submit YARN jobs, perform one of the following tasks:
    • On YARN, set the min.user.id to a value equal to or lower than the user ID associated with the Data Collector user ID, typically named "sdc".
    • On YARN, add the Data Collector user name, typically "sdc", to the allowed.system.users property.
    • After you create the pipeline, specify a Hadoop FS user in the Hadoop FS origin.

      For the Hadoop FS User property, enter a user with an ID that is higher than the min.user.id property, or with a user name that is listed in the allowed.system.users property.

  5. On YARN, verify that the Hadoop logging level is set to a severity of INFO or lower.
    YARN sets the Hadoop logging level to INFO by default. To change the logging level:
    1. Edit the log4j.properties file.
      By default, the file is located in the following directory:
      /etc/hadoop/conf
    2. Set the log4j.rootLogger property to a severity of INFO or lower, such as DEBUG or TRACE.
  6. If YARN is configured to use Kerberos authentication, configure Data Collector to use Kerberos authentication.
    When you configure Kerberos authentication for Data Collector, you enable Data Collector to use Kerberos and define the principal and keytab.
    Important: For cluster pipelines, enter an absolute path to the keytab when configuring Data Collector. Standalone pipelines do not require an absolute path.
    Once enabled, Data Collector automatically uses the Kerberos principal and keytab to connect to any YARN cluster that uses Kerberos. For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
  7. In the pipeline properties, on the General tab, set the Execution Mode property to Cluster Batch.
  8. On the Cluster tab, configure the following properties:
    Cluster Property Description
    Worker Java Options Additional Java properties for the pipeline. Separate properties with a space.

    The following properties are set by default.

    • XX:+UseConcMarkSweepGC and XX:+UseParNewGC are set to the Concurrent Mark Sweep (CMS) garbage collector.
    • Dlog4j.debug enables debug logging for log4j.

    Changing the default properties is not recommended.

    You can add any valid Java property.

    Launcher Env Configuration

    Additional configuration properties for the cluster launcher. Using simple or bulk edit mode, click the Add icon and define the property name and value.

    Worker Memory (MB) Maximum amount of memory allocated to each Data Collector worker in the cluster.

    Default is 1024 MB.

  9. In the pipeline, use the Hadoop FS origin for cluster batch mode.

    For more information about the origin, see Hadoop FS (deprecated).

  10. On the General tab of the origin, select the appropriate CDH or HDP stage library for cluster mode.
  11. On the Hadoop FS tab of the origin, configure the Hadoop FS URI property to point to the Amazon S3 bucket to read from.

    Use the following format: s3a://<bucket>

    For example:s3a://WebServer

    Then in the Input Paths property, enter the full path to the data to be read within that Amazon S3 bucket. You can enter multiple paths for the Input Paths property, for example:
    • Input Path 1 - /2016/February
    • Input Path 2 - /2016/March

    For more information, see Reading from Amazon S3.

  12. On the Hadoop FS tab of the origin, enable the Kerberos Authentication property if YARN is configured to use Kerberos authentication.
  13. On the S3 tab of the origin, enter the AWS access key pair used to pass credentials to Amazon Web Services to read from Amazon S3.
    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.