Merging Changed Data into a Delta Lake Table

This solution describes how to design a pipeline that reads change data capture (CDC) data from a database and replicates the changes to a Delta Lake table on Databricks.

Tip: You can download the sample MySQL Schema Replication to Delta Lake and MySQL CDC (Binary Log) to Delta Lake pipelines from the StreamSets Data Collector pipeline library, import the pipelines into Data Collector, and then follow these steps for more details on the solution.

Let's say that you want to track customer transactions in a MySQL table and apply those changes to a Delta Lake table for further analysis. That is, you need to apply the same set of updates, deletes, and inserts made to the MySQL table to the Delta Lake table. You first design and run a pipeline to bulk load the initial set of transactions in the MySQL table into the Delta Lake table. Then you design the CDC pipeline that processes subsequent changes.

In the CDC pipeline, you use a MySQL Binary Log origin to capture the changes from the MySQL master database. Due to the structure of the MySQL binary log records, you need to add processors to the pipeline to restructure the record and keep only the necessary fields. When the pipeline passes the data to the Databricks Delta Lake destination, the destination first stages the changed data in an Amazon S3 staging location, and then uses the MERGE command to merge the changed data from the staging location to a Delta Lake table.

To build this CDC pipeline, complete the following tasks:

Create the pipeline and configure a MySQL Binary Log origin to read CDC information provided by MySQL in binary logs.
Configure several processors to restructure the record based on the type of operation performed: INSERT, UPDATE, or DELETE.
Configure a Databricks Delta Lake destination to stage the changed data in text files in Amazon S3 and then merge the staged data to the target Delta Lake table.
Run the pipeline to replicate data from MySQL binary logs to the Delta Lake target table.