Hadoop FS (deprecated)
Supported pipeline types:
|
- Cluster batch mode
- Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
- Cluster EMR batch mode
- Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3.
For more information about cluster pipelines, see Cluster Pipelines (deprecated). To read from HDFS in standalone execution mode, use the Hadoop FS Standalone origin.
When you configure the Hadoop FS origin, you specify the input path and data format for the data to be read. You can configure the origin to read from all subdirectories and to generate a single record for records that include multiple objects.
The origin reads compressed data based on file extension for all Hadoop-supported compression codecs.
When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.
The Hadoop FS origin generates record header attributes that enable you to use the origins of a record in pipeline processing.