JDBC Query

The JDBC Query origin reads data from a database based on the specified query.

Use the JDBC Query origin in batch pipelines only. In a batch pipeline, the origin reads all available data with the specified query, then stops the pipeline. The origin does not support streaming execution mode and does not save offsets or support partitioning.

Use this origin to read from a database when you require a specific query. To read from a partitioned table or to read in streaming execution mode, use the JDBC Table origin.
Note: To read from most database vendors, the origin requires that Apache Spark version 2.4.0 or later is installed on the Transformer machine and on each node in the cluster. To read from Oracle databases, the JDBC Query origin requires Spark version 2.4.4 or later.

When you configure the JDBC Query origin, you specify the database connection information and any additional JDBC configuration properties you want to use. You can also use a connection to configure the origin.

You specify the SQL query to use for the read. You can specify a separate query to use when previewing data for pipeline development and testing. You can define a fetch size that is used by both queries.

You can configure the origin to cache the data for reuse throughout the pipeline run. You can also specify the JDBC driver to include with the pipeline.

Before using the JDBC Query origin, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC Query origin can read database data from multiple database vendors.

StreamSets has tested this origin with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
Oracle Oracle 11g with the Oracle 19.3.0 JDBC driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver

Installing the JDBC Driver

The JDBC Query origin includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

SQL Queries

You can configure two separate queries for the JDBC Query origin:
Pipeline run query
The pipeline run query defines the data that is read from the database and passed to the pipeline for processing.
You can use any SQL query that is valid for the database that the origin reads from. For example, you might write a query that joins the data from two different tables.
Configure this query using the Query property on the JDBC tab.
Preview query
The preview query defines the data that is read from the database when you preview a pipeline. Use the preview query to provide a smaller data set for pipeline development and testing.
You can use any SQL query that is valid for the database that the origin reads from. If you do not specify a preview query, the origin uses the pipeline run query to provide preview data. This can generate an unnecessarily large data set.
Important: To prevent an unnecessary use of resources and a long preview wait time, configure a preview query that returns approximately the number of rows that you want data preview to display.
For example, say your pipeline run query returns 25,000 rows of data and the preview batch size is 10 records. If you do not specify a preview query, when you run data preview, the origin must use the pipeline run query and return 25,000 records to provide the 10 record batch of preview data.
Configure this query using the Preview Query property on the Advanced tab.

Configuring a JDBC Query Origin

Configure a JDBC Query origin to read data from a database table. Before using the origin, verify if you need to install a JDBC driver.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the JDBC tab, configure the following properties:
    JDBC Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor.

    For example, use the following formats for these database vendors:

    • MySQL - jdbc:mysql://<host>:<port>/<database_name>
    • Oracle - jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
    • PostgreSQL - jdbc:postgresql://<host>:<port>/<database_name>
    • SQL Server - jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
    Use Credentials Enables entering credentials on the Credentials tab. Select when you do not include credentials in the JDBC connection string.
    Query SQL query that provides data for pipeline processing.

    Use any SQL query that is valid for the database.

    Additional JDBC Configuration Properties Additional JDBC configuration properties to use.

    To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by JDBC.

  3. On the Credentials tab, optionally configure the following properties:
    Credentials Property Description
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database. For example, the user account specified for an origin must be able to read data from the database. The user account specified for a destination must be able to write to the database.

    Password Password for the JDBC user name.
  4. On the Advanced tab, optionally configure the following properties:
    Table Property Description
    Specify Preview Query Allows specifying a query to provide preview data.
    Preview Query Query to provide preview data. Used only when you preview the pipeline.
    Specify Fetch Size Enables defining a fetch size to use for all SQL queries.
    Fetch Size Suggests the number of rows that the JDBC driver should fetch with each database round-trip. The specified fetch size is used for both the pipeline and preview queries.

    Use 0 to skip defining a fetch size.

    For more information about configuring a fetch size, see the database documentation.

    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.