JDBC

The JDBC destination writes data to a database table.

The destination can write each batch of data to a new or existing database table. If the specified table doesn't exist, the destination creates the new table in the database. If the specified table exists, the destination can either append data to the table or overwrite data in the table.

When you configure the JDBC destination, you specify the database connection information and any additional JDBC configuration properties you want to use. You specify the table to write to, the mode the destination uses to write to an existing table, and the number of partitions used to write to the database table. You can optionally configure advanced properties related to the JDBC driver.

You can also use a connection to configure the destination.

Before using the JDBC destination, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC destination can write database data to multiple database vendors.

StreamSets has tested this destination with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
Oracle Oracle 11g with the Oracle 19.3.0 JDBC driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver

Installing the JDBC Driver

The JDBC destination includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Write Mode

When the JDBC destination is configured to write to a table that doesn't exist, the destination creates the table in the specified database. The table is created based on the first batch of data that the destination processes when you start the pipeline, and that first batch is inserted to the new table.

The write mode defines how the destination writes to an existing table, including the second batch to a table that the destination created.

You can use one of the following write modes with the JDBC destination:
Append to table
The destination appends rows to the table.
Use append mode only when each new row has a unique primary key. If a row with the same primary key exists in the table, the pipeline fails with a primary key violation.
Overwrite table
The destination removes all rows in the table before writing new rows to the table. When you select overwrite mode, you also configure how the destination removes rows from the table:
  • Truncates all rows in the table before the write
  • Drops and recreates the table before the write.
Write to a slowly changing dimension
The destination inserts and updates rows provided by the Slowly Changing Dimension processor.
Use only in a slowly changing dimension pipeline.

By default, the destination appends rows to existing tables.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing to a database table, Spark creates one connection to the database for each partition. To change the number of partitions that write to the database table, configure the number of partitions for the JDBC destination.

Consider the following when you define the number of partitions to use:
  • The size and configuration of the cluster.
  • The amount of data being processed.
  • The number of concurrent connections that can be made to the database.

If the pipeline fails because the JDBC destination encounters an out of memory error, you likely need to increase the number of partitions for the destination.

Configuring a JDBC Destination

Configure a JDBC destination to write data to a database table. Before using the destination, verify if you need to install a JDBC driver.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Connection tab, configure the following properties:
    Connection Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor.

    For example, use the following formats for these database vendors:

    • MySQL - jdbc:mysql://<host>:<port>/<database_name>
    • Oracle - jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
    • PostgreSQL - jdbc:postgresql://<host>:<port>/<database_name>
    • SQL Server - jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
    • SQL Server 2019 Big Data Cluster accessed from a Transformer pipeline - jdbc:sqlserver://<ip>:<port>;databaseName=<database_name>

      Where <ip> is the IP address of the SQL Server master instance, and <port> is the port number of the SQL Server master instance. Use the SQL Server user name and password with this connection string. For more information, see SQL Server 2019 JDBC Connection Information.

      You can optionally include the user name and password in the connection string.

    Use Credentials Enables entering credentials. Select when you do not include credentials in the JDBC connection string.
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database.

    Password Password for the JDBC user name.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use.

    To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by JDBC.

  3. On the Table tab, configure the following properties:
    Table Property Description
    Schema Name of the schema in which the tables are located. Define when the database requires a schema.
    Table Name of the database table to write to. If the table does not exist, the destination creates the table.
    Write Mode Mode to write to an existing database table:
    • Append to table - Appends rows to the table.
    • Overwrite table - Removes all rows in the table before writing new rows to the table.
    • Write to a slowly changing dimension - Inserts and updates rows in a slowly changing dimension. Use only in a slowly changing dimension pipeline.

    By default, the destination appends rows.

    Remove Rows Determines how the destination removes rows from a table:
    • Truncate table - Truncates all rows in the database table before writing to the table.
    • Drop and recreate table - Drops and recreates the table before writing to it.

    By default, the destination truncates the table.

    Available when the destination uses Overwrite write mode.

    Number of Partitions Number of partitions used to write to the database table. Each partition opens its own connection to the database.

    Use -1 to use the same number of partitions as the pipeline.

    Default is -1.

  4. On the Advanced tab, optionally configure advanced properties.
    The defaults for these properties should work in most cases:
    Advanced Property Description
    Write Batch Size JDBC batch size used by Spark, which determines the maximum number of rows to insert with each database round-trip. Use -1 to use Spark's default value.

    Default is -1.

    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.