Hadoop FS (deprecated)

Supported pipeline types:
  • Data Collector

The Hadoop FS origin reads data from the Hadoop Distributed File System (HDFS), Amazon S3, or other file systems using the Hadoop FileSystem interface. For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.
Important: This stage is deprecated along with cluster pipelines, and may be removed in a future release. StreamSets recommends using StreamSets Transformer instead. For more information, see the Transformer documentationTransformer documentation.
Use this origin only in pipelines configured for one of the following cluster modes:
Cluster batch mode
Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
Cluster EMR batch mode
Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3.

For more information about cluster pipelines, see Cluster Pipelines (deprecated). To read from HDFS in standalone execution mode, use the Hadoop FS Standalone origin.

When you configure the Hadoop FS origin, you specify the input path and data format for the data to be read. You can configure the origin to read from all subdirectories and to generate a single record for records that include multiple objects.

The origin reads compressed data based on file extension for all Hadoop-supported compression codecs.

When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.

The Hadoop FS origin generates record header attributes that enable you to use the origins of a record in pipeline processing.