Amazon S3 Requirements

Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.

The requirements for cluster pipelines that read from Amazon S3 depend on the following batch modes:

Cluster EMR batch mode: Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3. Cluster EMR batch mode pipelines require a supported version of an Amazon EMR cluster with Hadoop. For a list of the supported Amazon EMR and Hadoop versions, see AvailableCommon Stage LibrariesAvailable Stage Libraries in the Data Collector documentation.
Cluster batch mode: Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3. Cluster mode pipelines that read from HDFS require a supported version of CDH or HDP. For a list of the supported CDH or HDP versions, see AvailableCommon Stage LibrariesAvailable Stage Libraries in the Data Collector documentation.