Azure Data Lake Storage Gen2

Available when using an authoring Data Collector version 3.20.0 or later.

To create an Azure Data Lake Storage Gen2 connection, the Azure stage library, streamsets-datacollector-azure-lib, must be installed on the selected authoring Data Collector.

For a description of the connection properties, see Azure Data Lake Storage Gen2 Connection Properties.

After you create an Azure Data Lake Storage Gen2 connection, you can use the connection in the following stages:
Engine Stages

Data Collector 5.5.0 or later

  • Azure Data Lake Storage Gen2 origin

Data Collector 3.20.0 or later

  • Azure Data Lake Storage Gen2 (Legacy) origin
  • Azure Data Lake Storage Gen2 destination
  • ADLS Gen2 File Metadata executor

For information about features added to the connection with different engine releases, see the connection requirements for the engine.

Prerequisites

Complete the following prerequisites before you configure an Azure Data Lake Storage Gen2 connection:
  1. If necessary, create a new Azure Active Directory application for Data Collector.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Data Collector application has the appropriate access control to perform the necessary tasks.

    The Data Collector application requires Read and Execute permissions to read data in Azure. If also writing to Azure, the application requires Write permission as well.

    For information about configuring Gen2 access control, see the Azure documentation.

  3. Retrieve information from Azure to configure the connection.

After you complete all of the prerequisite tasks, you can configure a Azure Data Lake Storage Gen2 connection.

Retrieve Authentication Information

An Azure Data Lake Storage Gen2 connection can use different methods to authenticate with Azure.

The authentication information required depends on the selected authentication method:
OAuth with Service Principal
Connections made with OAuth with Service Principal authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Tenant ID - Tenant ID for the Azure Active Directory Data Collector application. Also known as the directory ID.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key or client secret for the Azure Active Directory application. Also known as the client secret.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

Azure Managed Identity
Connections made with Azure Managed Identity authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

Note: This authentication method is supported by Data Collector 5.5.0 or later.
Shared Key
Connections made with Shared Key authentication require the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

Azure Data Lake Storage Gen2 Connection Properties

When creating an Azure Data Lake Storage Gen2 connection, configure the following properties on the Azure tab:
Azure Property Description
Account FQDN The host name of the Data Lake Storage Gen2 account. For example:

<storage account name>.dfs.core.windows.net

Storage Container / File System Name of the storage container or file system that contains the data to be read or written.
Secure Connection Uses the abfss protocol to securely connect to Azure using a TLS connection.

When cleared, the stage uses the abfs protocol without a TLS connection.

Authentication Method Authentication method used to connect to Azure:
  • OAuth with Service Principal
  • Azure Managed Identity - This authentication method is supported by Data Collector 5.5.0 or later.
  • Shared Key
Application ID Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

For information on accessing the application ID from the Azure portal, see the Azure documentation.

Available when using the OAuth with Service Principal or the Azure Managed Identity authentication method.

Endpoint Type Method to provide endpoint details.

Available when using the OAuth with Service Principal authentication method.

Tenant ID Tenant ID for the Azure Active Directory Data Collector application. Also known as the directory ID.

For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

Available when Endpoint Type is set to Tenant ID.

Endpoint URL Endpoint URL for the Azure Active Directory Data Collector application.

Default is https://login.microsoftonline.com/<tenant-id>/oauth2/token.

In the URL, specify the tenant ID for the Azure Active Directory Data Collector application.

For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

Available when Endpoint Type is set to Endpoint URL.

Application Key Authentication key or client secret for the Azure Active Directory application. Also known as the client secret.

For information on accessing the application key from the Azure portal, see the Azure documentation.

Available when using the OAuth with Service Principal authentication method.

Account Shared Key Shared access key that Azure generated for the storage account.

For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

Available when using the Shared Key authentication method.