JDBC
The JDBC destination writes data to a database table.
The destination can write each batch of data to a new or existing database table. If the specified table doesn't exist, the destination creates the new table in the database. If the specified table exists, the destination can either append data to the table or overwrite data in the table.
When you configure the JDBC destination, you specify the database connection information and any additional JDBC configuration properties you want to use. You specify the table to write to, the mode the destination uses to write to an existing table, and the number of partitions used to write to the database table. You can optionally configure advanced properties related to the JDBC driver.
You can also use a connection to configure the destination.
Before using the JDBC destination, verify if you need to install a JDBC driver.
Database Vendors and Drivers
The JDBC destination can write database data to multiple database vendors.
Database Vendor | Versions and Drivers |
---|---|
Microsoft SQL Server | SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver |
MySQL | MySQL 5.7 with the MySQL Connector/J 8.0.12 driver |
Oracle | Oracle 11g with the Oracle 19.3.0 JDBC driver |
PostgreSQL | PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver |
Installing the JDBC Driver
- Microsoft SQL Server
- PostgreSQL
When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
- Apache Derby
- IBM DB2
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Teradata
If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.
Write Mode
When the JDBC destination is configured to write to a table that doesn't exist, the destination creates the table in the specified database. The table is created based on the first batch of data that the destination processes when you start the pipeline, and that first batch is inserted to the new table.
The write mode defines how the destination writes to an existing table, including the second batch to a table that the destination created.
- Append to table
- The destination appends rows to the table.
- Overwrite table
- The destination removes all rows in the table before writing new rows to the
table. When you select overwrite mode, you also configure how the
destination removes rows from the table:
- Truncates all rows in the table before the write
- Drops and recreates the table before the write.
- Write to a slowly changing dimension
- The destination inserts and updates rows provided by the Slowly Changing Dimension processor.
By default, the destination appends rows to existing tables.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When writing to a database table, Spark creates one connection to the database for each partition. To change the number of partitions that write to the database table, configure the number of partitions for the JDBC destination.
- The size and configuration of the cluster.
- The amount of data being processed.
- The number of concurrent connections that can be made to the database.
If the pipeline fails because the JDBC destination encounters an out of memory error, you likely need to increase the number of partitions for the destination.
Configuring a JDBC Destination
Configure a JDBC destination to write data to a database table. Before using the destination, verify if you need to install a JDBC driver.
-
In the Properties panel, on the General
tab, configure the following properties:
General Property Description Name Stage name. Description Optional description. -
On the Connection tab, configure the following
properties:
Connection Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor. For example, use the following formats for these database vendors:
- MySQL -
jdbc:mysql://<host>:<port>/<database_name>
- Oracle -
jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
- PostgreSQL -
jdbc:postgresql://<host>:<port>/<database_name>
- SQL Server -
jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
- SQL Server 2019 Big Data Cluster
accessed from a Transformer pipeline -
jdbc:sqlserver://<ip>:<port>;databaseName=<database_name>
Where
<ip>
is the IP address of the SQL Server master instance, and<port>
is the port number of the SQL Server master instance. Use the SQL Server user name and password with this connection string. For more information, see SQL Server 2019 JDBC Connection Information.You can optionally include the user name and password in the connection string.
Use Credentials Enables entering credentials. Select when you do not include credentials in the JDBC connection string. Username User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password Password for the JDBC user name. Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by JDBC.
- MySQL -
-
On the Table tab, configure the following
properties:
Table Property Description Schema Name of the schema in which the tables are located. Define when the database requires a schema. Table Name of the database table to write to. If the table does not exist, the destination creates the table. Write Mode Mode to write to an existing database table: - Append to table - Appends rows to the table.
- Overwrite table - Removes all rows in the table before writing new rows to the table.
- Write to a slowly changing dimension - Inserts and updates rows in a slowly changing dimension. Use only in a slowly changing dimension pipeline.
By default, the destination appends rows.
Remove Rows Determines how the destination removes rows from a table: - Truncate table - Truncates all rows in the database table before writing to the table.
- Drop and recreate table - Drops and recreates the table before writing to it.
By default, the destination truncates the table.
Available when the destination uses Overwrite write mode.
Number of Partitions Number of partitions used to write to the database table. Each partition opens its own connection to the database. Use -1 to use the same number of partitions as the pipeline.
Default is -1.
-
On the Advanced tab, optionally configure advanced
properties.
The defaults for these properties should work in most cases:
Advanced Property Description Write Batch Size JDBC batch size used by Spark, which determines the maximum number of rows to insert with each database round-trip. Use -1 to use Spark's default value. Default is -1.
JDBC Driver JDBC driver to include with the pipeline: - Bundle driver from connection string
- Transformer bundles a JDBC driver with the pipeline.
Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.
- Bundle custom driver - Transformer bundles the specified driver with the pipeline.
Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.
- Do not bundle driver - Transformer does not include a driver with the pipeline.
Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.
Default is to bundle the driver from the JDBC connection string.
Driver Class Name Class name of the custom JDBC driver to use. - Bundle driver from connection string
- Transformer bundles a JDBC driver with the pipeline.