Bulk Loading Data into a Delta Lake Table
This solution describes how to build a pipeline that bulk loads Salesforce data into a Delta Lake table on Databricks.
Tip: You can download the sample Salesforce to Delta Lake pipeline from the
StreamSets
Data Collector
pipeline library, import the pipeline into Data Collector,
and then follow these steps for more details on the solution.
Let's say that you want to bulk load Salesforce account data into Databricks Delta Lake for further analysis. You'd like the pipeline to clean some of the account data before loading it into Delta Lake. When the pipeline passes the cleaned data to the Databricks Delta Lake destination, the destination first stages the data in an Amazon S3 staging location, and then uses the COPY command to copy the data from the staging location to a Delta Lake table.
To build this pipeline, complete the following tasks:
- Create the pipeline and configure a Salesforce origin to read account data from Salesforce.
- Configure an Expression Evaluator processor to clean the input data.
- Configure a Databricks Delta Lake destination to stage the pipeline data in text files in Amazon S3 and then copy the staged data to the target Delta Lake table.
- Run the pipeline to move the data from Salesforce to Delta Lake.