MapR FS (deprecated)
Supported pipeline types:
|
Data Collector provides several MapR origins to address different needs. For a quick comparison chart to help you choose the right one, see Comparing MapR Origins.
The MapR FS origin reads compressed data based on file extension for all Hadoop-supported compression codecs. The origin also generates record header attributes that enable you to use the origins of a record in pipeline processing.
When necessary, you can enable Kerberos authentication. You can also specify a Hadoop user to impersonate, define a Hadoop configuration file directory, and add Hadoop configuration properties as needed.
Before you use any MapR stage in a pipeline, you must perform additional steps to enable Data Collector to process MapR data. For more information, see MapR Prerequisites in the Data Collector documentation.
Kerberos Authentication
You can use Kerberos authentication to connect to MapR. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to MapR. By default, Data Collector uses the user account who started it to connect.
The Kerberos principal and keytab are defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To use Kerberos
authentication, configure all Kerberos properties in the Data Collector
configuration file, and then enable Kerberos in the MapR FS origin.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
Using a Hadoop User
Data Collector can either use the currently logged in Data Collector user or a user configured in the MapR FS origin to read files from MapR FS.
A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode in the Data Collector documentation.
Note that the origin uses a different user account to connect to MapR FS. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.
- On MapR, configure the user as a proxy user and authorize the user to
impersonate the Hadoop user.
For more information, see the MapR documentation.
- In the MapR FS origin, on the Hadoop FS tab, configure the Hadoop FS User property.
Hadoop Properties and Configuration Files
You can configure the MapR FS origin to use individual Hadoop properties or Hadoop configuration files:
- Hadoop configuration files
- You can use the following Hadoop configuration files with the MapR FS origin:
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
- Individual properties
- You can configure individual Hadoop properties in the origin. To add a Hadoop
property, you specify the exact property name and the value. The MapR FS origin
does not validate the property names or values.Note: Individual properties override properties defined in the Hadoop configuration files.
Record Header Attributes
The MapR FS origin creates record header attributes that include information about the originating file for the record.
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
- file - Provides the file path and file name where the record originated.
- offset - Provides the file offset in bytes. The file offset is the location in the file where the record originated.
Data Formats
The MapR FS origin processes data differently based on the data format that you select. The origin processes the following types of data:
- Avro
- Generates a record for every Avro record. Includes a
precision
andscale
field attribute for each Decimal field. - Delimited
- Generates a record for each delimited line.
- Text
- Generates a record for each line of text or for each section of text based on a custom delimiter.