HDFS File Metadata

The HDFS File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in HDFS or a local file system each time it receives an event. You cannot perform multiple tasks in the same executor. To perform more than one task, use additional executors. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Use the HDFS File Metadata executor as part of an event stream. For example, you might use the executor to move a file or change file permissions after it receives a file closure event from the Hadoop FS destination.

You can use the executor in any logical way, such as changing file metadata after receiving file closure events from the Hadoop FS or Local FS destinations.

When changing metadata, you configure an expression that represents the location and name of the file to process, and then specify the changes you want to perform. When creating an empty file, you specify the output location for the file, and can optionally specify the owner, permissions, and ACLs for the file. When removing a file or directory, you specify the location of the file or directory.

When necessary, you can enable Kerberos authentication and specify an HDFS user. You can also use HDFS configuration files and add other HDFS configuration properties as needed.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

For a solution that describes how to use the HDFS File Metadata executor, see Managing Output Files.

Related Event Generating Stages

Use the HDFS File Metadata executor in the event stream in a pipeline. Though you can use the executor in any logical way, the HDFS File Metadata executor is optimized to update file metadata for output files or whole files written by the following stages:
  • Hadoop FS destination
  • Local FS destination

Changing Metadata

You can configure the HDFS File Metadata executor to change metadata for a file in HDFS or a local file system after receiving an event. For example, you might use the executor to change file permissions after a destination closes a file.

When changing file metadata, the HDFS File Metadata executor can change the following file metadata at the same time:
  • File name
  • File location
  • File owner and group
  • File permissions
  • Access control lists (ACLs)

When changing metadata, the user who impersonates the HDFS user must have the necessary permissions to perform the task. For more information about the HDFS user, see HDFS User.

Specifying the File Path

When using the HDFS File Metadata executor to change file metadata, specify an expression for the File Path property that provides an absolute path to the files you want to use.

Use the default file path expression, ${record:value('/filepath')}, to update output files closed by the Hadoop FS or Local FS destinations. The file closure event records generated by these destinations include a filepath field that contains the location and name of the closed output files.

To update whole files that the Hadoop FS or Local FS destinations have completed streaming, use the following expression:
${record:value('/targetFileInfo/path')}

The whole file processed event records from these destinations include a /targetFileInfo/path field that contains the location and name of the processed whole files.

For more information about the event records generated by destinations, see "Event Record" in the destination documentation.

Changing the File Name or Location

When using the HDFS File Metadata executor to change file metadata, you can change the name or location of files after they close. When you specify new file names and locations, you can enter constants or expressions. Use any expression that evaluates to the values that you want to use.

When needed, you can use the file functions in expressions to use part of the existing filepath. File functions can return any part of a path, file name, or extension.
Example for moving files
Say you have Hadoop FS writing JSON files to the following directory structure:
/server1/weblogs/<subdir>/<filename>
After the files are written, you want HDFS File Metadata executor to move the files to a different root directory. When moving files you need to specify the new location for the files. So you configure the executor to move files to a different directory while still using the rest of the path to the file as follows:
/newDir/${file:pathElement(record:value('/filepath'),1)}/${file:pathElement(record:value('/filepath'),2)}/

This expression uses newDir as the new root directory, then uses two levels of subdirectories. Do not include file names when moving files.

Example for renaming files
Say you want to add the .json suffix to the original file name. When you rename files, you need to specify the new name for the files, so you use the following expression:
${file:fileName(record:value('/filepath'))}.json
This expression returns the file name from the filepath field in the event record and adds .json to the file name, e.g. <filename>.json.
If you wanted to strip the extension from written files, you could use the following expression in the New Name field:
${file:removeExtension(file:fileName(record:value('/filepath')))}
This expression returns the file name from the filepath event record field, then strips the extension from the name, and uses the result as the new file name.

For more information about file functions, see File Functions.

Defining the Owner, Group, Permissions, and ACLs

When using the HDFS File Metadata executor to change file metadata or create an empty file, you can define the file owner, group, file permissions, and the access control list (ACL).
Important: When the executor changes permissions, it removes existing permissions and implements the requested permissions. The executor does not add permissions to the existing permissions, so be sure to configure permissions exactly how you want them.
You can set permissions using any combination of the following methods:
Define a new owner and group
You can define the owner and group for files. When you use this option, you must enter both an owner and a group name.
Set file permissions using the octal or symbolic formats
You can set file permissions by entering the permissions you want to use in octal or symbolic format.
For example, you can use the following octal format to make files read-only:
0444
You can alternatively use the following symbolic format to make files read-only:
-r--r--r-- 
To make them read-only for the user and the group, forbidding all access to other users, you could use either of the following formats:
0440

-r--r-----
Define ACLs
You can define the ACLs for files. When you define ACLs, note that HDFS expects permissions defined for the user, group, and other. You can alternatively add permissions for additional users or groups.
Use the following format to define ACLs:
user::<permissions>,group::<permissions>,other::<permissions>\
[,<user | group>:<user or group name:<permissions>]
Define permissions using the symbolic format, with r, w, x or - representing the permission type.
For example, the following ACLs allow read access for the user and group only:
user::r--,group::r--,other::–-
If you wanted to allow read access to the operations group in addition to the group associated with the file, you would enter the following permissions:
user::r--,group::r--,other::–-,group:operations:r--

Creating an Empty File

You can configure the HDFS File Metadata executor to create empty files in HDFS or a local file system upon receiving an event. You might create empty files to trigger downstream actions in other applications, such as Oozie.

To create an empty file, specify an expression for the File Path property that provides an absolute path to the location where you want the file created.
Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.
When creating an empty file, you can also specify the following file details:
  • File owner and group
  • File permissions
  • Access control lists (ACLs)

For more information, see Defining the Owner, Group, Permissions, and ACLs.

Removing a File or Directory

You can configure the HDFS File Metadata executor to remove a file or directory from HDFS or a local file system after receiving an event.

For example, say you run a daily pipeline that writes data to HDFS. You can use the HDFS File Metadata executor to remove the target directory and all of its contents before a pipeline starts processing data. Simply configure the pipeline to pass the pipeline start event to an HDFS File Metadata executor, then specify the target directory when you configure the executor. For more information about using pipeline events, see Pipeline Event Generation.

Remove directories with caution. The executor removes directories recursively, deleting any subdirectories and their contents in addition to the specified directory.

Event Generation

The HDFS File Metadata executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it changes file metadata, creates an empty file, or removes a file or directory.

HDFS File Metadata events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the HDFS File Metadata executor have the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute Description
sdc.event.type Event type. Uses the following event type:
  • file-changed - Generated when the executor changes file metadata, including file name, location, permissions or ACLs.
  • file-created - Generated when the executor creates an empty file.
  • file-removed - Generated when the executor removes a file or directory.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The HDFS File Metadata executor can generate the following types of event records:
File changed

The executor generates a file changed event record when it changes file metadata, including file name, location, permissions or ACLs.

File changed event records have the sdc.event.type record header attribute set to file-changed and include the following fields:
Event Field Name Description
filepath Most recent path and name of the changed file.
filename Most recent name of the changed file.
File created

The executor generates a file created event record when it creates an empty file.

File created event records have the sdc.event.type record header attribute set to file-created and include the following fields:
Event Field Name Description
filepath Location where the file was created.
filename Name of the file.
File removed

The executor generates a file removed event record when it removes a file or directory.

File removed event records have the sdc.event.type record header attribute set to file-removed and include the following fields:
Event Field Name Description
filepath Location of the directory that was removed, or the directory where the removed file was located.
filename Name of the file that was removed, when applicable.

Kerberos Authentication

You can use Kerberos authentication to connect to the external system. When you use Kerberos authentication, the Data Collector uses the Kerberos principal and keytab to connect. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file, and then enable Kerberos in the HDFS File Metadata executor.

For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.

HDFS User

Data Collector can either use the currently logged in Data Collector user or a user configured in the executor to change file metadata, create files, or remove files or directories in HDFS or a local file system.

A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode in the Data Collector documentation.

Note that the executor uses a different user account to connect. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.

To configure a user in the executor, perform the following tasks:
  1. On the external system, configure the user as a proxy user and authorize the user to impersonate a HDFS user.

    For more information, see the HDFS documentation.

  2. In the HDFS File Metadata executor, enter the HDFS user name.

HDFS Properties and Configuration Files

You can configure the HDFS File Metadata executor to use HDFS configuration files and individual HDFS properties:
HDFS configuration files
You can use the following HDFS configuration files with the HDFS File Metadata executor:
  • core-site.xml
  • hdfs-site.xml
To use HDFS configuration files:
  1. Store the files or a symlink to the files in the Data Collector resources directory.
  2. In the HDFS File Metadata executor, specify the location of the files.
Note: For a Cloudera Manager installation, Data Collector automatically creates a symlink to the files named hadoop-conf. Enter hadoop-conf for the location of the files in the HDFS File Metadata executor.
Individual properties
You can configure individual HDFS properties in the executor. To add an HDFS property, you specify the exact property name and the value. The HDFS File Metadata executor does not validate the property names or values.
Note: Individual properties override properties defined in the HDFS configuration file.

Configuring an HDFS File Metadata Executor

Configure an HDFS File Metadata executor to create an empty file, change file metadata, or remove a file or directory from HDFS or a local file system upon receiving an event.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

  2. On the HDFS tab, configure the following properties:
    HDFS Property Description

    Hadoop FS URI

    URI to use to access files:
    • To access files in HDFS, enter the HDFS URI to use.

    • To access files in a local directory, enter:
      file:///

    HDFS User

    The HDFS user to use to create empty files or change file metadata in the external system. When you use this property, make sure the external system is configured appropriately.

    When not configured, the pipeline uses the currently logged in Data Collector user.

    Not configurable when Data Collector is configured to use the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode in the Data Collector documentation.

    Kerberos Authentication

    Uses Kerberos credentials to connect to the external system.

    When selected, uses the Kerberos principal and keytab defined in the Data Collector configuration file, $SDC_CONF/sdc.properties.

    Hadoop FS Configuration Directory

    Location of the HDFS configuration files.

    For a Cloudera Manager installation, enter hadoop-conf. For all other installations, use a directory or symlink within the Data Collector resources directory.

    You can use the following files with the HDFS File Metadata executor:
    • core-site.xml
    • hdfs-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in the stage.
    Hadoop FS Configuration

    Additional HDFS properties to use.

    To add properties, click Add and define the property name and value. Use the property names and values as expected by the external system.

  3. On the Tasks tab, configure the following property:
    Task Property Description
    Task Determines the type of task that the executor performs. You can create an empty file, change file metadata, or remove a file or directory.

    To do more than one type of task, add additional executors to the pipeline.

  4. To create an empty file, configure the following properties:
    Task Property Description
    File Path Expression that represents the full path to the file that you want to create.
    By default, the property uses ${record:value('/filepath')}.
    Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.
    Set Ownership Select to specify a file owner or group.
    New Owner The user name to become the new owner of the file.
    New Group The group to become the new group owner of the file.
    Set Permissions Select to set file permissions in an octal or symbolic format.
    New Permissions File permissions in octal or symbolic format.
    Set ACLs Select to define access control list (ACL) permissions.
    New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. For details, see Defining the Owner, Group, Permissions, and ACLs.
  5. To change file metadata, configure the following properties:
    Task Property Description
    File Path Expression that represents the full path to the file.

    By default, the property uses ${record:value('/filepath')}, which processes data in the filepath field. The Hadoop FS and Local FS destinations both generate file closure event records that include the path to closed files in a filepath field.

    To update whole files that the Hadoop FS or Local FS destinations have completed streaming, use the following expression:
    ${record:value('/targetFileInfo/path')}
    Move File Select to move the file.
    New Location New location for the file.
    Rename Select to rename the file.
    New Name New name for the file.
    Change Ownership Select to change the file owner or group.
    New Owner The user name to become the new owner of the file.
    New Group The group to become the new group owner of the file.
    Set Permissions Select to set file permissions in an octal or symbolic format.
    New Permissions File permissions in octal or symbolic format.
    Set ACLs Select to define access control list (ACL) permissions.
    New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. For details, see Defining the Owner, Group, Permissions, and ACLs.
  6. To remove a file or directory, configure the following property:
    Task Property Description
    File Path Expression that represents the full path to the file or directory that you want to remove.

    The executor removes directories recursively, removing all subdirectories as well. Use with caution. For more information, see Removing a File or Directory.

    By default, the property uses ${record:value('/filepath')}.
    Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.