Snowflake
The Snowflake origin reads data from a Snowflake database. You can use the Snowflake origin to read from any accessible Snowflake database, including those hosted on Amazon S3, Microsoft Azure, and private Snowflake installations.
When reading data from Snowflake, the origin stages the data in an internal stage. The origin can read data from a specified table or using a specified query. It can also perform incremental reads.
When you configure the origin, you specify the Snowflake region, database, table, and schema to use. You also specify the user account and password. You can optionally specify a custom role that overrides the default role for the user account. The user account or the custom role must have the required Snowflake privileges.
You can also use a connection to configure the origin.
You define the read type to perform and related properties, such as the table or query to use. If you enable incremental reads, you also specify the initial offset and offset column to use. The Snowflake origin supports numeric and datetime offsets.
By default, the origin performs a bulk read, also known as a copy unload. When not performing a copy unload, you can specify the partition size to use. You can configure the origin to preserve existing capitalization for column names. You can also specify the number of connections to use, enable pushdown optimization, and configure additional Snowflake properties.
Required Privileges and Custom Roles
The Snowflake origin requires a Snowflake role that grants the following privileges:
Object | Privilege |
---|---|
Internal Snowflake Stage | READ |
Table | SELECT |
- Assign the custom role as the default role
- In Snowflake, assign the custom role as the default role for the Snowflake user account specified in the stage. A Snowflake user account is associated with a single default role.
- Override the default role with the custom role
- In the stage, use the Role property to specify the name of the custom role. The custom role overrides the role assigned to the Snowflake user account specified in the stage.
- Use a Snowflake connection
- When working with Control Hub, you can configure a Snowflake connection to provide connection details for Snowflake stages.
Read Mode
The read mode determines how the Snowflake origin reads data from Snowflake.
- Table - Reads all columns from the specified table.
- Query - Reads data from a table based on a specified query. You might use query mode to read a subset of columns from a table, or to read a join or union of multiple tables.
Full or Incremental Read
The Snowflake origin can perform a full read or an incremental read each time you run the pipeline. By default, the origin performs a full read of the specified table or query.
When the origin performs a full read, the origin processes all data available in the table or returned by the query each time that the pipeline runs.
When the origin performs an incremental read, the first pipeline run is the same as a full read. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin reads the table or query starting from the last-saved offset.
SQL Query Guidelines
When the origin runs in Query read mode, you must specify the SQL query to use.
The SQL query defines the data returned from Snowflake. You can use any valid SQL query but the guidelines for the query depend on whether you configure the origin to perform a full or incremental read.
Incremental Read Query
When you define the SQL query for an incremental read, the query must include a WHERE clause and an ORDER BY clause.
- WHERE clause
- In the WHERE clause, include the offset column and the
offset value. Use the
$offset
variable to represent the offset value. - ORDER BY clause
- In the ORDER BY clause, include the offset column as the
first column to avoid returning duplicate data.Note: Using a column that is not a primary key or indexed column in the ORDER BY clause can slow performance.
Full Read Query
The query for a full read has no specific requirements or limitations.
SELECT * FROM invoice
When you define the SQL query for full mode, you can optionally include the WHERE and ORDER BY clauses using the same guidelines as for incremental mode. However, using these clauses to read from large tables can cause performance issues.
Pushdown Optimization
The Snowflake origin can perform pushdown optimization in clusters that use Spark 2.4.0 or later. When you enable pushdown, the origin pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets.
You can enable pushdown in the Snowflake origin independently from ludicrous mode. However, using it in conjunction with Ludicrous mode should provide best results. For details on the Spark SQL operators that can be pushed down to Snowflake, see the Snowflake documentation.
Use the Enable Pushdown property on the Connection tab to enable pushdown for the Snowflake origin.
Snowflake Data Types
When reading from Snowflake, the Snowflake origin converts Snowflake data types to Spark data types. The following table describes how this conversion occurs.
Snowflake data types that are not listed in the table are not supported.
Snowflake Data Type | Spark Data Type |
---|---|
ARRAY | StringType |
BIGINT |
DecimalType(38,0) |
BOOLEAN | BooleanType |
CHAR | StringType |
CLOB | StringType |
DATE | DateType |
DECIMAL | DecimalType |
DOUBLE | DoubleType |
FLOAT | DoubleType |
INTEGER | DecimalType(38,0) |
OBJECT | StringType |
TIMESTAMP | TimestampType |
TIME | StringType |
VARIANT | StringType |
Configuring a Snowflake Origin
Configure a Snowflake origin to read data from a Snowflake database.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Stage library to use to connect to Snowflake: - Snowflake cluster-provided libraries - The cluster
where the pipeline runs has Snowflake libraries
installed, and therefore has all of the necessary
libraries to run the pipeline.
Available only in Databricks pipelines.
- Snowflake Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Snowflake libraries.
Select the appropriate Spark version for your cluster.
Note: When using additional Snowflake stages in the pipeline, ensure that they use the same stage library.Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.
- Snowflake cluster-provided libraries - The cluster
where the pipeline runs has Snowflake libraries
installed, and therefore has all of the necessary
libraries to run the pipeline.
-
On the Connection tab, configure the following
properties.
Note: Snowflake JDBC driver versions 3.13.25 or higher convert underscores to hyphens, by default. When needed, you can bypass this behavior by setting the
allowUnderscoresInHost
driver property totrue
. For more information and alternate solutions, see this Snowflake community article.The properties that display differ depending on whether you use a Control Hub connection to provide connection details.When using a connection, configure the following properties:Connection Property - Using Connection Property Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Override Warehouse Snowflake warehouse. If the selected Snowflake connection specifies a warehouse to use, this property overrides it.
Available only when using a connection.
Override Database Snowflake database. If the selected Snowflake connection specifies a database to use, this property overrides it.
Available only when using a connection.
Override Schema Snowflake schema. If the selected Snowflake connection specifies a schema to use, this property overrides it.
Available only when using a connection.
Enable Pushdown Pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets. Use only when the cluster that runs the pipeline uses Spark 2.4.0 or later.
When not using a Control Hub connection, configure the following properties:Connection Property - Not using Connection Description Use Custom JDBC URL Enables using a custom JDBC URL. Custom JDBC URL Custom JDBC URL to use. Include Organization Enables specifying the Snowflake organization. Organization Snowflake organization. Snowflake Region Region where the Snowflake warehouse is located. Select one of the following: - An available Snowflake region.
- Other - Enables specifying a Snowflake region not listed in the property.
- Custom JDBC URL - Enables specifying a virtual private Snowflake installation.
Available when Include Organization is disabled.
Custom Snowflake Region Custom Snowflake region. Available when using Other as the Snowflake region. Account Snowflake account name. Authentication Method Authentication method to connect to Snowflake:- User Credentials
- Key Pair Path - Use to specify the location of a private key file.
- Key Pair Content - Use to provide the contents of a private key file.
- OAuth
- None
User Snowflake user name. The user account or the custom role that overrides the default role for this user account must have the required Snowflake privileges.
For details, see Required Privileges and Custom Roles.
Available when using an authentication method.
Password Snowflake password. Required when using User Credentials authentication.
Private Key Path Path to the private key file. Required when using Key Pair Path authentication.
Private Key Content Contents of the private key file. Required when using Key Pair Content authentication.
Private Key Password Optional password for the specified private key file or content. Use for encrypted private keys. Available when using Key Pair Path or Key Pair Content authentication.
OAuth Token OAuth token to use. You can use a Snowflake security integration access token, refresh token, or code grant. Required when using OAuth authentication.
OAuth Client ID Client ID for Snowflake security integration access tokens. Available when using OAuth authentication.
OAuth Client Secret Client secret for Snowflake security integration access tokens. Available when using OAuth authentication.
OAuth Redirect URI Redirect URI for Snowflake security integration access tokens. Available when using OAuth authentication.
Enable Pushdown Pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets. Use only when the cluster that runs the pipeline uses Spark 2.4.0 or later.
Role Overrides the default role for the specified user account. The custom role must have the required Snowflake privileges.
Warehouse Snowflake warehouse. Database Snowflake database. Schema Snowflake schema. Use Private Link Snowflake URL Enables using a private link URL. You can specify the URL in the Custom JDBC URL property above. Or, you can define the appropriate values for the Account property and either the Snowflake Region or Organization properties.
Connection Properties Additional Snowflake Connection for Spark properties to use. For information about additional options, see the Snowflake documentation. To add properties, click the Add icon and define the property name and value. Specify the property names and values as expected by Snowflake.
You can use simple or bulk edit mode to configure the properties.
-
On the Table tab, configure the following
properties:
Table Property Description Read Mode Read mode to use: - Table - The origin reads all columns from the specified table.
- Query - The origin reads data based on the specified SQL query.
Table Table to read. Available with Table read mode. SQL Query SQL query to use for the read. Available with Query read mode. For guidelines on creating queries for full and incremental reads, see SQL Query Guidelines.
Copy Unload Enables the origin to perform a bulk read of the data using a COPY INTO command. When cleared, the origin uses a SELECT command to read the data. By default this option is selected, which is the default Snowflake read.
Partition Size Size of the partitions to use for the read, in MB. Available when Copy Unload is not selected.
Incremental Mode Enables the origin to read data from last-saved offset during subsequent pipeline runs. Initial Offset Initial offset value to use in the query. This value is substituted for the $offset
variable when you start the pipeline.The origin supports numeric and datetime offsets.
Available and required for incremental reads.
Offset Column Column to track the progress of the read. The origin supports numeric and datetime offsets.
As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.
Available and required for incremental reads.
Keep Column Case Prevents adding quotation marks around column names that contain characters besides uppercase letters, numbers, and underscores. -
Optionally, on the Advanced tab, configure the following
property:
Advanced Property Description Connection Pool Maximum number of connections to Snowflake that the stage uses. Default is 4.
Increasing this property can improve performance. However, Snowflake warns that setting this property to an arbitrarily high value can adversely affect performance. The default is the recommended value.