Delta Lake
The Delta Lake origin reads data from a Delta Lake table. The origin can read from a managed or unmanaged table.
The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. To process a Delta Lake managed table in streaming mode, or in batch mode while tracking offsets, use the Hive origin. The Hive origin cannot process unmanaged tables.
When you configure the Delta Lake origin, you specify the path to the table to read. You can optionally enable time travel to query older versions of the table.
You configure the storage system for the table. When reading from a table stored on Azure Data Lake Storage (ADLS) Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.
To access a table stored on ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen2 or Amazon S3, complete these additional prerequisite tasks.
Storage Systems
- Amazon S3
- Azure Data Lake Storage (ADLS) Gen2
- HDFS
- Local file system
ADLS Gen2 Prerequisites
- If necessary, create a new Azure Active Directory
application for StreamSets Transformer.
For information about creating a new application, see the Azure documentation.
- Ensure that the
Azure Active Directory Transformer application
has the appropriate access control to perform the necessary tasks.
To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.
For information about configuring Gen2 access control, see the Azure documentation.
- Install the Azure Blob File System driver on the cluster where the pipeline
runs.
Most recent cluster versions include the Azure Blob File System driver,
azure-datalake-store.jar
. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation. - Retrieve Azure
Data Lake Storage Gen2 authentication information from the Azure
portal for configuring the origin.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
- Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.
Retrieve Authentication Information
The Delta Lake origin provides several ways to authenticate connections to ADLS Gen2. Depending on the authentication method that you use, the origin requires different authentication details.
If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.
You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.
- OAuth
- When connecting using OAuth authentication, the origin requires the
following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key for the Azure
Active Directory Transformer application. Also known as the client key.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- OAuth Token Endpoint - OAuth 2.0 token endpoint for
the Azure Active Directory v1.0 application for Transformer. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token
.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Managed Service Identity
- When connecting using Managed Service Identity authentication, the origin
requires the following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory
Transformer
application. Also known as the directory ID.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Shared Key
- When connecting using Shared Key authentication, the origin requires the
following information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
Amazon S3 Credential Mode
- Instance profile
- When Transformer runs on an Amazon EC2 instance that has an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS.
- AWS access keys
- When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
- None
- When accessing a public bucket, you can connect anonymously using no authentication.
Reading from a Local File System
- On the Cluster tab of the pipeline properties, set Cluster Manager Type to None (Local).
- On the General tab of the stage properties, set Stage Library to Delta Lake Transformer-provided libraries.
- On the Delta Lake tab, for the Table Directory Path property, specify the directory to use.
- On the Storage tab, set Storage System to HDFS.
Configuring a Delta Lake Origin
Configure a Delta Lake origin to process data from a Delta Lake table in batch execution mode. The origin can only be used in batch pipelines and does not track offsets. So each time the pipeline runs, the origin reads all available data.
Complete the necessary prerequisites before reading a table stored on ADLS Gen2. Also, before you run a local pipeline for a table on ADLS Gen2 or Amazon S3, complete these additional prerequisite tasks.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Stage library to use to connect to Delta Lake: - Delta Lake cluster-provided libraries - The cluster where the pipeline runs has Delta Lake libraries installed, and therefore has all of the necessary libraries to run the pipeline.
- Delta Lake Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Delta Lake libraries.
Note: When using additional Delta Lake stages in the pipeline, ensure that they use the same stage library.Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Delta Lake tab, configure the following
properties:
Delta Lake Property Description Table Directory Path Path to the Delta Lake table. Time Travel Queries an earlier version of the table. For more information about time travel, see the Delta Lake documentation.
Time Travel Query Mode Mode to use to access the earlier version of data in the table: - Version As Of - Returns time travel data with a matching version number.
- Timestamp As Of - Returns time travel data with a matching date or timestamp.
Version Version of the table to use. Timestamp String Date or timestamp to use to find matching time travel data. -
On the Storage tab, configure storage and connection
information:
Storage Description Storage System Storage system for the Delta Lake table: - Amazon S3 - Use for a table stored on Amazon S3. To connect, Transformer uses connection information stored in HDFS configuration files.
- ADLS Gen2 - Use for a table stored on Azure Data Lake Storage Gen2. To connect, Transformer uses the specified connection details.
- HDFS - Use for a table stored on HDFS or a local
file system.
To connect to HDFS, Transformer uses connection information stored in HDFS configuration files. To connect to a local file system, Transformer uses the directory path specified for the table.
Credential Mode Authentication method used to connect to Amazon Web Services (AWS): - AWS Keys - Authenticates using an AWS access key pair.
- Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
- None - Connects to a public bucket using no authentication.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS. Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS. Application ID Application ID for the Azure Active Directory Transformer application. Also known as the client ID. Used to connect to Azure Data Lake Storage Gen2 with OAuth or Managed Service Identity authentication.
When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.
For information on accessing the application key from the Azure portal, see the Azure documentation.
Application Key Authentication key for the Azure Active Directory Transformer application. Also known as the client key. Used to connect to Azure Data Lake Storage Gen2 with OAuth authentication.
When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.
For information on accessing the application key from the Azure portal, see the Azure documentation.
OAuth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token
.Used to connect to Azure Data Lake Storage Gen2 with OAuth authentication.
When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.
Tenant ID Tenant ID for the Azure Active Directory Transformer application. Also known as the directory ID. Used to connect to Azure Data Lake Storage Gen2 with Managed Service Identity authentication.
When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
Account Shared Key Shared access key that Azure generated for the storage account. Used to connect to Azure Data Lake Storage Gen2 with Shared Key authentication.
When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.