Hadoop FS (deprecated)
Supported pipeline types:
|
- Cluster batch mode
- Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from HDFS, Amazon S3, or other file systems using the Hadoop FileSystem interface.
- Cluster EMR batch mode
- Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3.
For more information about cluster pipelines, see Cluster Pipelines (deprecated). To read from HDFS in standalone execution mode, use the Hadoop FS Standalone origin.
The Hadoop FS origin reads compressed data based on file extension for all Hadoop-supported compression codecs. It also generates record header attributes that enable you to use the origins of a record in pipeline processing.
When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.
Reading from Amazon S3
The Hadoop FS origin included in a cluster batch or cluster EMR batch pipeline allows you to read from Amazon S3.
To
read from Amazon S3, specify the appropriate URI for Amazon S3 when you configure the
Hadoop FS origin. Use the s3a
scheme in the URI. S3A is the active
connector maintained by open source Hadoop and is the only connector that works with
Hadoop and Amazon S3.
Configure the URI to point to the Amazon S3 bucket to read from, as follows:
s3a://<bucket>
s3a://WebServer
Then in the Input Paths property, enter the full path to the data to be read within that Amazon S3 bucket. You can enter multiple paths for the Input Paths property, as follows:
For additional requirements when using the Hadoop FS origin to read from Amazon S3, see Amazon S3 Requirements.
Reading from Other File Systems
The Hadoop FS origin included in a cluster batch pipeline allows you to read from file systems other than HDFS using the Hadoop FileSystem interface.
For example, you can use the Hadoop FS origin to read data from Microsoft Azure Data Lake Storage for a cluster batch pipeline if the origin system has the Hadoop FileSystem interface installed.
- Make sure the Hadoop FileSystem interface is installed on the file system.
- Install all required file system application JAR files as external libraries for the Hadoop FS stage library that you use. See the file system documentation for details about the files to install. For instructions on installing external libraries, see Install External Libraries in the Data Collector documentation.
- When you configure the Hadoop FS origin, specify the appropriate URI for the
origin system. For example, instead of
hdfs://<authority>
, to connect to Azure Data Lake Storage, you might useadls://<authority>
.
Kerberos Authentication
You can use Kerberos authentication to connect to HDFS. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HDFS. By default, Data Collector uses the user account who started it to connect.
The Kerberos principal and keytab are defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To use Kerberos
authentication, configure all Kerberos properties in the Data Collector
configuration file, and then enable Kerberos in the Hadoop FS origin.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
Using a Hadoop User
Data Collector can either use the currently logged in Data Collector user or a user configured in the Hadoop FS origin to read from HDFS.
A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode in the Data Collector documentation.
Note that the origin uses a different user account to connect to HDFS. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.
- On Hadoop, configure the user as a proxy user and
authorize the user to impersonate a Hadoop user.
For more information, see the Hadoop documentation.
- In the Hadoop FS origin, on the Hadoop FS tab, configure the Hadoop FS User property.
Hadoop Properties and Configuration Files
- Hadoop configuration files
- You can use the following Hadoop configuration files with the Hadoop FS
origin:
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
- Individual properties
- You can configure individual Hadoop properties in the origin. To add a
Hadoop property, you specify the exact property name and the value. The
Hadoop FS origin does not validate the property names or
values.Note: Individual properties override properties defined in the Hadoop configuration files.
Record Header Attributes
The Hadoop FS origin creates record header attributes that include information about the originating file for the record.
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
- file - Provides the file path and file name where the record originated.
- offset - Provides the file offset in bytes. The file offset is the location in the file where the record originated.
Data Formats
- Avro
- Generates a record for every Avro record. Includes a
precision
andscale
field attribute for each Decimal field. - Delimited
- Generates a record for each delimited line.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.