HDFS File Metadata
The HDFS File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in HDFS or a local file system each time it receives an event. You cannot perform multiple tasks in the same executor. To perform more than one task, use additional executors. For information about supported versions, see Supported Systems and Versions.
Use the HDFS File Metadata executor as part of an event stream. For example, you might use the executor to move a file or change file permissions after it receives a file closure event from the Hadoop FS destination.
You can use the executor in any logical way, such as changing file metadata after receiving file closure events from the Hadoop FS or Local FS destinations.
When changing metadata, you configure an expression that represents the location and name of the file to process, and then specify the changes you want to perform. When creating an empty file, you specify the output location for the file, and can optionally specify the owner, permissions, and ACLs for the file. When removing a file or directory, you specify the location of the file or directory.
When necessary, you can enable Kerberos authentication and specify an HDFS user. You can also use HDFS configuration files and add other HDFS configuration properties as needed.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
For a solution that describes how to use the HDFS File Metadata executor, see Managing Output Files.
Related Event Generating Stages
- Hadoop FS destination
- Local FS destination
Changing Metadata
You can configure the HDFS File Metadata executor to change metadata for a file in HDFS or a local file system after receiving an event. For example, you might use the executor to change file permissions after a destination closes a file.
- File name
- File location
- File owner and group
- File permissions
- Access control lists (ACLs)
When changing metadata, the user who impersonates the HDFS user must have the necessary permissions to perform the task. For more information about the HDFS user, see HDFS User.
Specifying the File Path
When using the HDFS File Metadata executor to change file metadata, specify an expression for the File Path property that provides an absolute path to the files you want to use.
Use the default file path expression, ${record:value('/filepath')}
, to
update output files closed by the Hadoop FS or Local FS destinations. The file closure
event records generated by these destinations include a filepath field that contains the
location and name of the closed output files.
${record:value('/targetFileInfo/path')}
The whole file processed event records from these destinations include a /targetFileInfo/path field that contains the location and name of the processed whole files.
For more information about the event records generated by destinations, see "Event Record" in the destination documentation.
Changing the File Name or Location
When using the HDFS File Metadata executor to change file metadata, you can change the name or location of files after they close. When you specify new file names and locations, you can enter constants or expressions. Use any expression that evaluates to the values that you want to use.
- Example for moving files
- Say you have Hadoop FS writing JSON files to the following directory
structure:
/server1/weblogs/<subdir>/<filename>
- Example for renaming files
- Say you want to add the .json suffix to the original file name. When you
rename files, you need to specify the new name for the files, so you use the
following
expression:
${file:fileName(record:value('/filepath'))}.json
For more information about file functions, see File Functions.
Defining the Owner, Group, Permissions, and ACLs
- Define a new owner and group
- You can define the owner and group for files. When you use this option, you must enter both an owner and a group name.
- Set file permissions using the octal or symbolic formats
- You can set file permissions by entering the permissions you want to use in octal or symbolic format.
- Define ACLs
- You can define the ACLs for files. When you define ACLs, note that HDFS expects permissions defined for the user, group, and other. You can alternatively add permissions for additional users or groups.
Creating an Empty File
You can configure the HDFS File Metadata executor to create empty files in HDFS or a local file system upon receiving an event. You might create empty files to trigger downstream actions in other applications, such as Oozie.
- File owner and group
- File permissions
- Access control lists (ACLs)
For more information, see Defining the Owner, Group, Permissions, and ACLs.
Removing a File or Directory
You can configure the HDFS File Metadata executor to remove a file or directory from HDFS or a local file system after receiving an event.
For example, say you run a daily pipeline that writes data to HDFS. You can use the HDFS File Metadata executor to remove the target directory and all of its contents before a pipeline starts processing data. Simply configure the pipeline to pass the pipeline start event to an HDFS File Metadata executor, then specify the target directory when you configure the executor. For more information about using pipeline events, see Pipeline Event Generation.
Remove directories with caution. The executor removes directories recursively, deleting any subdirectories and their contents in addition to the specified directory.
Event Generation
The HDFS File Metadata executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it changes file metadata, creates an empty file, or removes a file or directory.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event type:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- File changed
-
The executor generates a file changed event record when it changes file metadata, including file name, location, permissions or ACLs.
File changed event records have the sdc.event.type record header attribute set to file-changed and include the following fields:Event Field Name Description filepath Most recent path and name of the changed file. filename Most recent name of the changed file. - File created
-
The executor generates a file created event record when it creates an empty file.
File created event records have the sdc.event.type record header attribute set to file-created and include the following fields:Event Field Name Description filepath Location where the file was created. filename Name of the file.
- File removed
-
The executor generates a file removed event record when it removes a file or directory.
File removed event records have the sdc.event.type record header attribute set to file-removed and include the following fields:Event Field Name Description filepath Location of the directory that was removed, or the directory where the removed file was located. filename Name of the file that was removed, when applicable.
Kerberos Authentication
You can use Kerberos authentication to connect to the external system. When you use Kerberos authentication, the Data Collector uses the Kerberos principal and keytab to connect. By default, Data Collector uses the user account who started it to connect.
The Kerberos principal and keytab are defined in the Data Collector configuration properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration properties, and then enable Kerberos in the HDFS File Metadata executor.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication.
HDFS User
Data Collector can either use the currently logged in Data Collector user or a user configured in the executor to change file metadata, create files, or remove files or directories in HDFS or a local file system.
A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode.
Note that the executor uses a different user account to connect. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.
- On the external system, configure the user as a proxy user and authorize the
user to impersonate a HDFS user.
For more information, see the HDFS documentation.
- In the HDFS File Metadata executor, enter the HDFS user name.
HDFS Properties and Configuration Files
- HDFS configuration files
- You can use the following HDFS configuration files with the HDFS File
Metadata executor:
- core-site.xml
- hdfs-site.xml
- Individual properties
- You can configure individual HDFS properties in the executor. To add an HDFS
property, you specify the exact property name and the value. The HDFS File
Metadata executor does not validate the property names or
values.Note: Individual properties override properties defined in the HDFS configuration file.
Configuring an HDFS File Metadata Executor
Configure an HDFS File Metadata executor to create an empty file, change file metadata, or remove a file or directory from HDFS or a local file system upon receiving an event.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Library version that you want to use. Produce Events Generates event records when events occur. Use for event handling. Required Fields Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses.Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
-
On the HDFS tab, configure the following
properties:
HDFS Property Description Hadoop FS URI
URI to use to access files:-
To access files in HDFS, enter the HDFS URI to use.
- To access files in a local directory,
enter:
file:///
The HDFS user to use to create empty files or change file metadata in the external system. When you use this property, make sure the external system is configured appropriately.
When not configured, the pipeline uses the currently logged in Data Collector user.
Not configurable when Data Collector is configured to use the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode.
Uses Kerberos credentials to connect to the external system.
When selected, uses the Kerberos principal and keytab defined in the Data Collector configuration properties.
Location of the HDFS configuration files.
Use a directory or symlink within the Data Collector resources directory.
You can use the following files with the HDFS File Metadata executor:- core-site.xml
- hdfs-site.xml
Note: Properties in the configuration files are overridden by individual properties defined in the stage.Hadoop FS Configuration Additional HDFS properties to use.
To add properties, click Add and define the property name and value. Use the property names and values as expected by the external system.
-
-
On the Tasks tab, configure the following property:
Task Property Description Task Determines the type of task that the executor performs. You can create an empty file, change file metadata, or remove a file or directory. To do more than one type of task, add additional executors to the pipeline.
-
To create an empty file, configure the following properties:
Task Property Description File Path Expression that represents the full path to the file that you want to create. By default, the property uses${record:value('/filepath')}
.Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.Set Ownership Select to specify a file owner or group. New Owner The user name to become the new owner of the file. New Group The group to become the new group owner of the file. Set Permissions Select to set file permissions in an octal or symbolic format. New Permissions File permissions in octal or symbolic format. Set ACLs Select to define access control list (ACL) permissions. New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. For details, see Defining the Owner, Group, Permissions, and ACLs. -
To change file metadata, configure the following properties:
Task Property Description File Path Expression that represents the full path to the file. By default, the property uses
${record:value('/filepath')}
, which processes data in the filepath field. The Hadoop FS and Local FS destinations both generate file closure event records that include the path to closed files in a filepath field.To update whole files that the Hadoop FS or Local FS destinations have completed streaming, use the following expression:${record:value('/targetFileInfo/path')}
Move File Select to move the file. New Location New location for the file. Rename Select to rename the file. New Name New name for the file. Change Ownership Select to change the file owner or group. New Owner The user name to become the new owner of the file. New Group The group to become the new group owner of the file. Set Permissions Select to set file permissions in an octal or symbolic format. New Permissions File permissions in octal or symbolic format. Set ACLs Select to define access control list (ACL) permissions. New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. For details, see Defining the Owner, Group, Permissions, and ACLs. -
To remove a file or directory, configure the following property:
Task Property Description File Path Expression that represents the full path to the file or directory that you want to remove. The executor removes directories recursively, removing all subdirectories as well. Use with caution. For more information, see Removing a File or Directory.
By default, the property uses${record:value('/filepath')}
.Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.