Hadoop FS (deprecated)
- Cluster batch mode
- Cluster batch mode use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
- Cluster EMR batch mode
- Cluster EMR batch mode use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3.
For more information about cluster , see Cluster Pipelines (deprecated). To read from HDFS in standalone execution mode, use the Hadoop FS Standalone origin.
The Hadoop FS origin reads compressed data based on file extension for all Hadoop-supported compression codecs. It also generates record header attributes that enable you to use the origins of a record in processing.
When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.
When the stops, the origin notes where it stops reading. When the starts again, the origin continues processing from where it stopped by default. You can reset the originreset the originreset the origin to process all requested data.