Hadoop FS (deprecated)

The Hadoop FS origin reads data from the Hadoop Distributed File System (HDFS), Amazon S3, or other file systems using the Hadoop FileSystem interface. For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.

Important: This stage is deprecated along with cluster pipelines, and may be removed in a future release. You can use StreamSets Transformer instead. For more information, see the Transformer documentation Transformer documentation.

Use this origin only in configured for one of the following cluster modes:

Cluster batch mode: Cluster batch mode use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
Cluster EMR batch mode: Cluster EMR batch mode use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3.

For more information about cluster , see Cluster Pipelines (deprecated). To read from HDFS in standalone execution mode, use the Hadoop FS Standalone origin.

When you configure the Hadoop FS origin, you specify the input path and data format for the data to be read. You can configure the origin to read from all subdirectories and to generate a single record for records that include multiple objects.

Note: The origin processes files based on file names and locations. Having files with the same name in the same location can cause the origin to skip reading the duplicate files.

The Hadoop FS origin reads compressed data based on file extension for all Hadoop-supported compression codecs. It also generates record header attributes that enable you to use the origins of a record in processing.

When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.

When the stops, the origin notes where it stops reading. When the starts again, the origin continues processing from where it stopped by default. You can reset the origin reset the origin reset the origin to process all requested data.