Snowflake Bulk
The Snowflake Bulk origin reads all available data from one or more tables in a Snowflake database, then stops the pipeline. This design enables processing Snowflake data without incurring the costs of continuously connecting to Snowflake with streaming pipelines.
You can use the Snowflake Bulk origin to read from any Snowflake database hosted on Amazon S3, Google Cloud Storage, Microsoft Azure, and private Snowflake installations. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
During the read process, the Snowflake Bulk origin sends a command to Snowflake to stage data as CSV files on either an internal Snowflake stage or a hosted external stage. Then the Snowflake Bulk origin downloads and processes those CSV files. The Snowflake Bulk origin can use multiple threads to process the files in parallel.
When you configure the Snowflake Bulk origin, you specify the Snowflake region, account, and connection information. You define the number of threads to use to connect to Snowflake and process data. You can specify an organization name to use and define additional Snowflake connection properties, as needed. You can also use a connection to configure the origin.
You specify the tables to read and optional WHERE clauses for each table. You configure staging location details and the data format of the stage files.
Prerequisites
Create a Snowflake Stage
Before using the Snowflake Bulk origin in a pipeline, you must create a Snowflake internal or external stage.
During the read process, the Snowflake Bulk origin sends a command to Snowflake to stage data as CSV files on either an internal Snowflake stage or a hosted external stage. Then the Snowflake Bulk origin downloads and processes those CSV files.
To use an external stage, create the external stage with the cloud service provider that hosts your Snowflake warehouse.
- Snowflake internal stage
- You can stage data in Snowflake internal user stages or named stages. Do not use internal table stages.
- Amazon S3 external stage
- To stage data in an Amazon S3 external stage, create a Snowflake external stage in a bucket in the same S3 region that hosts your Snowflake virtual warehouse. For example, if your Snowflake warehouse is in AWS US West, then create the Snowflake external stage in a bucket in the AWS US West region.
- Google Cloud Storage external stage
- To stage data in a Google Cloud Storage external stage, create a Snowflake storage integration in Google Cloud Storage. This is a multistep process described in the Snowflake documentation that ends with creating a Snowflake external stage. Be sure to complete all required steps.
- Microsoft Azure external stage
- To stage data in a Microsoft Azure external stage, complete the
following tasks:
- Configure Snowflake authentication for the Microsoft
Azure Blob Storage container that you want to use.
You can use an SAS token or an Azure account name and key for authentication. For information about configuring SAS token authentication, see the Snowflake documentation.
- Create a Snowflake external stage in the container.
When you create a Snowflake external stage, you specify a URL that defines the name and location for the stage. Include a trailing slash in the URL to ensure that Snowflake loads all staged data. You might also include a prefix in the stage name to indicate that the external stage is for Data Collector.
For example, the following URL creates an external stage namedsdc-externalstage
inazure://myaccount.blob.core.windows.net/mycontainer/load/
and loads all staged data to Snowflake:azure://myaccount.blob.core.windows.net/mycontainer/load/sdc-externalstage/
You can create an Azure stage using the Snowflake web interface or SQL. For more information, see Creating an Azure Stage in the Snowflake documentation.
- Configure Snowflake authentication for the Microsoft
Azure Blob Storage container that you want to use.
AWS Credentials
When the Snowflake Bulk origin reads data staged on Amazon S3, it must pass credentials to Amazon Web Services.
- Instance profile
- When the execution Data Collector runs on an Amazon EC2 instance that has an associated instance profile, Data Collector uses the instance profile credentials to automatically authenticate with AWS.
- AWS access key pair
-
When the execution Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can connect with an AWS access key pair.
To connect with an AWS access key pair, on the Staging tab, set the Authentication Method property to AWS Keys. Then define the access key ID and secret access key.Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Google Cloud Storage Credentials
Before processing data staged on Google Cloud Storage, the Snowflake Bulk origin must pass credentials to Google Cloud.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
Assign Privileges
The Snowflake Bulk origin requires a Snowflake role that grants all privileges needed to process data.
Object Type | Privilege |
---|---|
External stage or internal Snowflake stage | READ, WRITE |
Table | SELECT |
Batch Processing
Unlike most Data Collector origins, the Snowflake Bulk origin performs batch processing only. After processing available data, the origin stops the pipeline rather than waiting for additional data, as with streaming pipelines. This design enables processing Snowflake data without incurring the costs of continuously connecting to Snowflake with streaming pipelines.
Define a Role
The Snowflake Bulk origin requires a Snowflake role that can read from Snowflake.
Before configuring the origin, ensure that you have granted the required privileges to a Snowflake role, as explained in Prerequisites.
- Assign the custom role as the default role
- In Snowflake, assign the custom role as the default role for the Snowflake user account specified in the stage. A Snowflake user account is associated with a single default role.
- Override the default role with the custom role
- In the stage, use the Role property to specify the name of the custom role. The custom role overrides the role assigned to the Snowflake user account specified in the stage.
- Use a Snowflake connection
- When working with Control Hub, you can configure a Snowflake connection to provide connection details for Snowflake stages.
Multithreaded Processing
The Snowflake Bulk origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you start the pipeline, the Snowflake Bulk origin determines the tables to read based on the specified table configuration. The origin then uses multiple concurrent threads for processing based on the Connection Pool Size property on the Snowflake Connection Info tab.
As the pipeline runs, each thread processes one table: first copying data to the specified staging area, then downloading the data from the staging area. Each thread then creates batches based on the maximum batch size configured for the origin, and passes them to pipeline runners.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Table Configurations
When you configure the Snowflake Bulk origin, you specify the tables to read. You can optionally specify WHERE clauses to limit the records that the origin processes.
- One table
- To process one table, you simply specify the table name in the Name Pattern property.
- Multiple tables
- To process multiple tables, configure one or more sets of the following
properties:
- Table Name Pattern - A single table name or a table name pattern that uses SQL LIKE syntax and evaluates to a set of tables.
- Exclusion Pattern - An optional name pattern that evaluates to table
names to exclude from processing. Use to exclude a subset of tables
defined by the Table Name Pattern property.
Define the property with a Java-based regular expression. For more information about using regular expressions, see Regular Expressions Overview.
- Where Clause - An optional WHERE clause that limits the records
processed from a specified table or table name pattern. When used
with a table name pattern, the WHERE clause is applied to every
table associated with the pattern.
Use Snowflake SQL to define a WHERE clause, and omit
WHERE
from the specified clause. For example, to use a WHERE clause to read sales data and omit returns, you might enter the following in the Where Clause property:total > 0
.For information about WHERE clause syntax, see the Snowflake documentation.
Example
Say you have a set of marketing tables that all have a -marketing
suffix for the name. You want to process all data in all of the tables except the
AU-Marketing
table. And you want to include data from the
Marketing-EU
table where Feedback
is not null.
To do this, you define two sets of table configuration properties as follows:
First set:
- Name Pattern:
%-Marketing
- Exclusion Pattern:
AU-Marketing
- Name Pattern:
Marketing-EU
- Where:
Feedback IS NOT NULL
Error Handling
You can configure Snowflake error handling in the Snowflake Bulk origin. The error
handling properties determine how the on_error
option is defined in the
Snowflake SQL query.
- Default - Does not set the
on_error
option. When not explicitly set, the Snowflake defaulton_error
behavior is Abort Statement. - Continue - Ignores errors and continues processing.
- Skip File
- When encountering errors, skips reading the batch.When you use this option, you also configure a Skip File On Error property to specify when to skip the file:
- First - After discovering the first error record.
- Number - After discovering the specified number of error records in the batch.
- Percentage - After discovering the specified percentage of error records in the batch.
- Abort Statement - Skips reading the batch upon the first error.
Record Header Attributes
The Snowflake Bulk origin includes basic Snowflake information in the record header attributes of the generated record.
- database - The database that contains the table that was processed.
- schema - The schema that contains the table that was processed.
- table - The table that contains the record that was processed.
Snowflake Data Types
The Snowflake Bulk origin converts Snowflake data types to Data Collector data types. The origin supports the following data types:
Snowflake Data Type | Data Collector Data Type |
---|---|
Array | List |
Bigint | Long |
Binary | Byte Array |
Boolean | Boolean |
Byteint | Long |
Char | String |
Date | Date |
Datetime | Datetime |
Decimal | Long |
Double | Double |
Double Precision | Double |
Float | Double |
Geography | String |
Int | Long |
Integer | Long |
Number | Long |
Number(<precision>, 0) |
Long |
Number(<precision>, <scale>) |
Long |
Nchar | String |
Nvarchar2 | String |
Real | Double |
Smallint | Long |
Time | Time |
Timestamp_LZT | Zoned Datetime |
Timestamp_NTZ | Datetime |
Timestamp_TZ | Zoned Datetime |
Tinyint | Long |
Varchar | String |
Variant | Conversion occurs based on content |
Configuring a Snowflake Bulk Origin
Configure a Snowflake Bulk origin to read all available data from Snowflake tables, and then stop the pipeline. Before you use the origin, complete the prerequisite tasks.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Snowflake Connection Info tab, configure the
following properties:
Note: Snowflake JDBC driver versions 3.13.25 or higher convert underscores to hyphens, by default. When needed, you can bypass this behavior by setting the
allowUnderscoresInHost
driver property totrue
. For more information and alternate solutions, see this Snowflake community article.Snowflake Connection Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Organization Snowflake organization. Account Snowflake account name. Authentication Method Authentication method to connect to Snowflake:- User Credentials
- Key Pair Path - Use to specify the location of a private key file.
- Key Pair Content - Use to provide the contents of a private key file.
- OAuth
- None
Available when not using a Control Hub connection for connection details.
User Snowflake user name. The user account or the custom role that overrides the default role for this user account must have the required Snowflake privileges.
For information about required privileges, see Assign Privileges.
Required when using authentication.
Password Snowflake password. Required when using User Credentials authentication.
Private Key Path Path to the private key file. Required when using Key Pair Path authentication.
Private Key Content Contents of the private key file. Required when using Key Pair Content authentication.
Private Key Password Optional password for the specified private key file or content. Use for encrypted private keys. Available when using Key Pair Path or Key Pair Content authentication.
OAuth Token OAuth token to use. You can use a Snowflake security integration access token, refresh token, or code grant. Required when using OAuth authentication.
OAuth Client ID Client ID for Snowflake security integration access tokens. Available when using OAuth authentication.
OAuth Client Secret Client secret for Snowflake security integration access tokens. Available when using OAuth authentication.
OAuth Redirect URI Redirect URI for Snowflake security integration access tokens. Available when using OAuth authentication.
Role Overrides the default role for the specified user account. The custom role must have the required Snowflake privileges. For more information, see Assign Privileges.
Available when not using a Control Hub connection for connection details.
Warehouse Snowflake warehouse. Available when not using a Control Hub connection for connection details.
Database Snowflake database. Available when not using a Control Hub connection for connection details.
Schema Snowflake schema. Available when not using a Control Hub connection for connection details.
Use Custom JDBC URL Custom JDBC URL to use when using a virtual private Snowflake installation. Include Organization Enables specifying the Snowflake organization. Use Private Link Snowflake URL Enables using a private link URL. You can specify the URL in the Custom JDBC URL property above. Or, you can define the appropriate values for the Account property and either the Snowflake Region or Organization properties.
Connection Properties Additional Snowflake connection properties to use. To add properties, click Add and define the property name and value. Use the property names and values as expected by Snowflake.
Connection Pool Size Number of threads the origin generates and uses for multithreaded processing. You might increase the number of connections when reading from multiple tables.
For more information, see Multithreaded Processing.
-
On the Snowflake tab, configure the following
properties:
Snowflake Property Description Table or View Configuration Snowflake data to read. Define one or more sets of the following properties: - Name Pattern - A table or view name or an expression that evaluates to the set of tables to process. Use SQL LIKE syntax.
- Exclusion Pattern - A Java-based regular expression
that defines the tables to exclude from the read.
For more information about using regular expressions, see Regular Expressions Overview.
- Optional WHERE clause to limit the records to be
read. Omit
WHERE
from the specified clause.
Read Values as String Reads all values as strings. Max Columns Maximum number of columns to read. Default is 512. Increase as needed to process larger tables.
Max Characters Per Column Maximum number of characters to read in a column. Default is 4096. Increase as needed to process larger columns.
Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the Data Collector maximum batch size. Default is 1000. The Data Collector default is 1000.
Error Behavior Action to take when errors occur. Determines how the Snowflake on_error
option is used in a query:- Default - Does not set the
on_error
option. When not explicitly set, the Snowflake defaulton_error
behavior is Abort Statement. - Continue - Ignores errors and continues processing.
- Skip File - When encountering errors, skips reading the batch. Requires configuring the Skip File on Error property.
- Abort Statement - Skips reading the batch upon the first error.
For more information about the
on_error
option, see the Snowflake documentation.Skip File on Error Indicates when to skip reading a batch: - First - After discovering the first error record.
- Number - After discovering the specified number of error records in the batch.
- Percentage - After discovering the specified percentage of error records in the batch.
Available when using the Skip File error behavior.
-
On the Staging tab, configure the following
properties:
Staging Property Description Stage Location Location of the Snowflake stage: - Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- Snowflake Internal Stage
This property configuration determines the properties that display on this tab and the Staging Advanced tab.
Snowflake Stage Name Name of the Snowflake stage used to stage the data. You generally create this stage as part of the prerequisite tasks. To use a Snowflake internal user stage, enter a tilde (
~
).Connection Connection that defines the information required to connect to an external system. Available when processing data staged on Google Cloud Storage. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Authentication Method Amazon authentication method: - AWS Keys
- Instance Profile
- None
AWS Access Key ID AWS access key ID.
Required when not using an instance profile to read from an external stage on Amazon S3.
AWS Secret Key ID AWS secret access key.
Required when not using an instance profile to read from an external stage on Amazon S3.
S3 Stage File Name Prefix Optional prefix for the external stage name. Default is sdc
.S3 Compressed File Ensures that files are compressed before being staged on S3. Keep this option enabled for optimum performance. Azure Authentication Type of authentication to use to connect to Azure. Azure Account Name Azure account name. Azure Account Key Azure account key. Used only for Account Name and Key authentication.
To secure sensitive information, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Azure SAS Token Azure SAS Token. Used only for SAS Token authentication.
To secure sensitive information, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Azure Stage File Name Prefix Optional prefix for the external stage name. Azure Compressed File Compresses files before writing them to Azure. Keep this option enabled for optimum performance. Project ID Google Cloud project ID to use.
Credentials Provider Provider for Google Cloud credentials: - Default credentials provider - Uses Google Cloud default credentials.
- Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
- Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file. Enter a path relative to the Data Collector resources directory,
$SDC_RESOURCES
, or enter an absolute path.Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Stage Database Optional database for the Snowflake stage. Configure this property when the stage is located in a different database than the Snowflake table. When not defined, the origin uses the database defined for the Snowflake table, on the Snowflake tab.
Stage Schema Optional schema for the Snowflake stage. Configure this property when the stage is located in a different schema than the Snowflake table. When not defined, the origin uses the schema defined for the Snowflake table, on the Snowflake tab.
Stage File Prefix Prefix to use for staged files. Default is
sdc
.Purge Stage File After Ingesting Removes a stage file after the data is read. Compressed File Enables compressing staged files. Parallel Transfers Number of threads to use to upload and download stage files. Available when staging to Snowflake internal stages.
-
When using a Snowflake external stage, on the Staging
Advanced tab, configure the following properties.
This tab displays different properties depending on the location of the external stage.
When using an external stage in Amazon S3, you can configure the following properties:When using an external stage in Azure, you can configure the following properties:Amazon S3 Advanced Property Description Connection Timeout Seconds to wait for a response before closing the connection. Socket Timeout Seconds to wait for a response to a query. Retry Count Maximum number of times to retry requests. Use Proxy Specifies whether to use a proxy to connect. Proxy Host Proxy host. Proxy Port Proxy port. Proxy User User name for proxy credentials. Proxy Password Password for proxy credentials. Proxy Domain Optional domain name for the proxy server. Proxy Workstation Optional workstation for the proxy server. Uploading Threads Size of the thread pool for parallel uploads. Used when working with multiple partitions and processing large objects in multiple parts. When working with multiple partitions, setting this property up to the number of partitions being used to can improve performance.
For more information about this and the following properties, see the Amazon S3 TransferManager documentation.
Minimum Upload Part Size (MB) Minimum part size in bytes for multipart uploads. Multipart Upload Threshold (MB) Minimum batch size in bytes for multipart uploads. Tags Tags to add to an object. Using simple or bulk edit mode, click Add Another to configure a tag. You can configure multiple tags. When you configure a tag, you can define a tag with just the key or specify a key and value.
Azure Advanced Property Description Use Custom Blob Service URL Enables using a custom Azure Blob Storage URL. Custom Blob Service URL Custom Azure Blob Storage URL. Typically uses the following format: https://<Azure Account>.blob.core.windows.net
Azure Encryption Enables using Azure default encryption at this time.