ADLS Gen2 File Metadata
The ADLS Gen2 File Metadata executor changes file metadata, creates an empty file, or removes a file or directory in Azure Data Lake Storage Gen2 each time it receives an event. To perform these tasks in Azure Data Lake Storage Gen1, use the ADLS Gen1 File Metadata executor. For information about supported versions, see Supported Systems and Versions.
Before you use the executor, you must perform some prerequisite tasks.
An executor can perform a single task. To perform more than one task, use additional executors.
Use the ADLS Gen2 File Metadata executor as part of an event stream. For example, you might use the executor to move a file or change file permissions after it receives a file closure event from the Azure Data Lake Storage Gen2 destination.
To change metadata, configure an expression that represents the location and name of the file to process, and then specify the changes you want to make. To create an empty file, specify the output location for the file, and optionally specify the owner, permissions, and ACLs for the file. To remove a file or directory, specify the location of the file or directory.
When necessary, you can configure advanced properties to pass to the underlying Hadoop file system.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
For a solution that describes how to use a similar file metadata executor, see Managing Output Files.
Prerequisites
- If necessary, create a new Azure Active Directory
application for Data Collector.
For information about creating a new application, see the Azure documentation.
- Ensure that the Azure Active Directory Data Collector application
has the appropriate access control to perform the necessary
tasks.
The Data Collector application requires Read, Write, and Execute permissions to perform all possible tasks.
For information about configuring Gen2 access control, see the Azure documentation.
- Retrieve information from Azure to configure the executor.
After you complete all of the prerequisite tasks, you can configure the executor.
Retrieve Authentication Information
The ADLS Gen2 File Metadata executor can use different methods to authenticate connections to Azure.
- OAuth with Service Principal
- Connections made with OAuth with Service Principal authentication require
the following information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory
Data Collector application. Also known as the directory ID.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key or client secret
for the Azure Active Directory application. Also known as the
client secret.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Azure Managed Identity
- Connections made with Azure Managed Identity authentication
require the following information:
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active Directory Data Collector
application. Also known as the client ID.
- Connections made with Shared Key authentication require the following
information:
Related Event Generating Stages
Use the ADLS Gen2 File Metadata executor in the event stream of a pipeline. The ADLS Gen2 File Metadata executor is optimized to update file metadata for output files or whole files processed by the Azure Data Lake Storage Gen2 destination or another ADLS Gen2 File Metadata executor. However, you can use the executor in any logical way.
Changing Metadata
You can configure the ADLS Gen2 File Metadata executor to change metadata for a file in Azure Data Lake Storage Gen2 after receiving an event. For example, you might use the executor to change file permissions after a destination closes a file.
- File name
- File location
- File owner and group
- File permissions
- Access control lists (ACLs)
To change metadata, the Azure Active Directory application for Data Collector must have the required permission.
Specifying the File Path
When using the ADLS Gen2 File Metadata executor to change file metadata, specify an expression for the File Path property that provides an absolute path to the files you want to use.
Use the default file path expression, ${record:value('/filepath')}
, to
update output files closed by the Azure Data Lake Storage Gen2 destination. The file
closure event records generated by this destination include a filepath
field that contains the location and name of the closed output files.
${record:value('/targetFileInfo/path')}
The whole file processed event records from the destination include a
/targetFileInfo/path
field that contains the location and name of
the processed whole files.
For more information about the event records generated by the Azure Data Lake Storage Gen2 destination, see Event Records.
Changing the File Name or Location
When using the ADLS Gen2 File Metadata executor to change file metadata, you can change the name or location of files after they close. To specify new file names and locations, you can enter constants or expressions. Use any expression that evaluates to the values that you want to use.
- Example for moving files
- Say the Azure Data Lake Storage Gen2 destination writes JSON files to the
following directory
structure:
/server1/weblogs/<subdir>/<filename>
- Example for renaming files
- Say you want to add the .json suffix to the original
file name. To rename files, you need to specify the new name for the files,
so you use the following
expression:
${file:fileName(record:value('/filepath'))}.json
For more information about file functions, see File Functions.
Defining the Owner, Group, Permissions, and ACLs
- Define a new owner and group
- You can define the owner and group for files. When you use this option, you must enter both an owner and a group name.
- Set file permissions using the octal or symbolic formats
- You can set file permissions by entering the permissions you want to use in octal or symbolic format.
- Define ACLs
- You can define the ACLs for files. When you define ACLs, note that Azure Data Lake Storage expects permissions defined for the user, group, and other. You can alternatively add permissions for additional users or groups.
Creating an Empty File
You can configure the ADLS Gen2 File Metadata executor to create empty files in Azure Data Lake Storage Gen2 upon receiving an event. You might create empty files to trigger downstream actions in other applications, such as Oozie.
- File owner and group
- File permissions
- Access control lists (ACLs)
For more information, see Defining the Owner, Group, Permissions, and ACLs.
Removing a File or Directory
You can configure the ADLS Gen2 File Metadata executor to remove a file or directory from Azure Data Lake Storage Gen2 after receiving an event.
For example, say you run a daily pipeline that writes data to Azure Data Lake Storage Gen2. You can use the ADLS Gen2 File Metadata executor to remove the target directory and all of its contents before a pipeline starts processing data. Simply configure the pipeline to pass the pipeline start event to an ADLS Gen2 File Metadata executor, then specify the target directory when you configure the executor. For more information about using pipeline events, see Pipeline Event Generation.
Remove directories with caution. The executor removes directories recursively, deleting any subdirectories and their contents in addition to the specified directory.
Event Generation
The ADLS Gen2 File Metadata executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it changes file metadata, creates an empty file, or removes a file or directory.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- File changed
-
The executor generates a file-changed event record when it changes file metadata, including file name, location, permissions, or ACLs.
File-changed event records have thesdc.event.type
record header attribute set tofile-changed
and include the following fields:Event Field Name Description filepath Most recent path and name of the changed file. filename Most recent name of the changed file. - File created
-
The executor generates a file-created event record when it creates an empty file.
File-created event records have thesdc.event.type
record header attribute set tofile-created
and include the following fields:Event Field Name Description filepath Location where the file was created. filename Name of the file.
- File removed
-
The executor generates a file-removed event record when it removes a file or directory.
File-removed event records have thesdc.event.type
record header attribute set tofile-removed
and include the following fields:Event Field Name Description filepath Location of the directory that was removed, or the directory where the removed file was located. filename Name of the file that was removed, when applicable.
Configuring an ADLS Gen2 File Metadata Executor
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Produce Events Generates event records when events occur. Use for event handling. Required Fields Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses.Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Data Lake tab, configure the following
properties:
Data Lake Property Description Account FQDN The host name of the Data Lake Storage Gen2 account. For example: <storage account name>.dfs.core.windows.net
Storage Container / File System Name of the storage container or file system where the executor reads or writes the data.
Secure Connection Uses the abfss
protocol to securely connect to Azure using a TLS connection.When cleared, the stage uses the
abfs
protocol without a TLS connection.Authentication Method Authentication method used to connect to Azure: - OAuth with Service Principal
- Azure Managed Identity
Application ID Application ID for the Azure Active Directory Data Collector application. Also known as the client ID. For information on accessing the application ID from the Azure portal, see the Azure documentation.
Available when using the OAuth with Service Principal or the Azure Managed Identity authentication method.
Endpoint Type Method to provide endpoint details. Available when using the OAuth with Service Principal authentication method.
Tenant ID Tenant ID for the Azure Active Directory Data Collector application. Also known as the directory ID. For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
Available when Endpoint Type is set to Tenant ID.
Endpoint URL Endpoint URL for the Azure Active Directory Data Collector application. Default is
https://login.microsoftonline.com/<tenant-id>/oauth2/token
.In the URL, specify the tenant ID for the Azure Active Directory Data Collector application.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
Available when Endpoint Type is set to Endpoint URL.
Application Key Authentication key or client secret for the Azure Active Directory application. Also known as the client secret. For information on accessing the application key from the Azure portal, see the Azure documentation.
Available when using the OAuth with Service Principal authentication method.
Account Shared Key Shared access key that Azure generated for the storage account. For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
Available when using the Shared Key authentication method.
Advanced Configuration Additional HDFS properties to pass to the underlying file system. ADLS Gen2 accesses data using the Hadoop FileSystem interface. Specified properties override those in Hadoop configuration files.
To add properties, click the Add icon and define the HDFS property name and value. Use the property names and values as expected by Hadoop.
-
On the Tasks tab, configure the following property:
Task Property Description Task Type of task that the executor performs. The executor can create an empty file, change file metadata, or remove a file or directory. To do more than one type of task, add additional executors to the pipeline.
-
To create an empty file, configure the following properties:
Task Property Description File Path Expression that represents the full path to the file that you want to create. By default, the property uses${record:value('/filepath')}
.Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.Set Ownership Select to specify a file owner or group. New Owner The user name to become the new owner of the file. New Group The group to become the new group owner of the file. Set Permissions Select to set file permissions in an octal or symbolic format. New Permissions File permissions in octal or symbolic format. Set ACLs Select to define access control list (ACL) permissions. New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. -
To change file metadata, configure the following properties:
Task Property Description File Path Expression that represents the full path to the file. By default, the property uses
${record:value('/filepath')}
, which processes data in thefilepath
field. The Azure Data Lake Storage Gen2 destination generates file-closure event records that include the path to closed files in afilepath
field.To update whole files that the Azure Data Lake Storage Gen2 destination has completed streaming, use the following expression:${record:value('/targetFileInfo/path')}
Move File Select to move the file. New Location New location for the file. Rename Select to rename the file. New Name New name for the file. Set Ownership Select to change the file owner or group. New Owner The user name to own the file. New Group The group to own the file. Set Permissions Select to set file permissions in an octal or symbolic format. New Permissions File permissions in octal or symbolic format. Set ACLs Select to define access control list (ACL) permissions. New ACLs Define ACLs for the owner, group, and other. You can optionally define other user and group permissions. -
To remove a file or directory, configure the following property:
Task Property Description File Path Expression that represents the full path to the file or directory that you want to remove. The executor removes directories recursively, removing all subdirectories as well. Use with caution. For more information, see Removing a File or Directory.
By default, the property uses${record:value('/filepath')}
.Note: In most cases, you will not want to use the default expression. The default expression is more appropriate for changing file metadata.