JDBC Query

The JDBC Query executor connects through JDBC to a database and runs one or more user-defined SQL queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Use the JDBC Query executor as part of an event stream in the pipeline. For example, you might use the JDBC Query executor to execute a stored procedure in the database when the pipeline generates a pipeline stop event.
Important: This stage does not support connecting to non-RDBMS systems, including Hive, Impala, Kudu, or Snowflake. Support for untested systems is not guaranteed. For a list of tested systems, see "Database Vendors and Drivers". Use the Hive Query executor to run queries on Hive or Impala.

The JDBC Query executor can commit data to the database after each batch or can commit data to the database for each record. By default, the executor commits after each batch.

When you configure the JDBC Query executor, you specify JDBC connection properties and the queries to run. You can configure custom properties that your driver requires, advanced connection properties, and the commit type to use. You can also enable the executor to run insert and delete queries in parallel to improve throughput.

To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.

You can also use a connection to configure the executor.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Database Vendors and Drivers

The JDBC Query executor can run queries on database data from multiple database vendors.

The following table lists the supported and tested database versions for this stage. You can use the stage with other JDBC-compliant databases, but full support is not guaranteed. For a full list of supported versions, see Supported Systems and Versions in the Data Collector documentation.
Database Vendor Supported Versions Tested Versions
MySQL MySQL 5.7 and later
  • MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
  • MySQL 8.0 with the MySQL Connector/J 8.0.12 driver
Oracle
  • Oracle 11g Release 2, 12c, 18c, 19c, 21c
  • Oracle Real Application Clusters (RAC) 11g Release 2, 12c, 18c, 19c, 21c
Also supported:
  • Hosted systems, such as Amazon RDS, that run supported versions of Oracle RAC
  • Derived systems, such as Oracle Exadata, that run supported versions of Oracle RAC
  • Oracle 11g Release 2, 19c with the Oracle 21.8.0.0 JDBC driver version
PostgreSQL PostgreSQL 9.x and later
  • PostgreSQL 9.6.9
  • PostgreSQL 10.4
  • PostgreSQL 11.7
  • PostgreSQL 12.2
  • PostgreSQL 13.0
  • PostgreSQL 14.0
  • PostgreSQL 15.0
Microsoft SQL Server
  • SQL Server 2017
  • SQL Server 2019
  • SQL Server 2017
  • SQL Server 2019

Installing the JDBC Driver

Before you use the JDBC Query executor, install the JDBC driver for the database. You cannot access the database until you install the required driver.

Note: When connecting to a PostgreSQL, Microsoft SQL Server, or MariaDB database, you do not need to install a JDBC driver. Data Collector includes the JDBC driver required for those databases.

You install the driver into the JDBC stage library, streamsets-datacollector-jdbc-lib, which includes the executor.

To use the JDBC driver with multiple stage libraries, install the driver into each stage library associated with the stages. For example, if you want to use a MySQL JDBC driver with the JDBC Lookup processor and with the MySQL Binary Log origin, you install the driver as an external library for the JDBC stage library, streamsets-datacollector-jdbc-lib, and for the MySQL Binary Log stage library, streamsets-datacollector-mysql-binlog-lib.

For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

SQL Queries

You define one or more SQL queries that the JDBC Query executor runs on the database each time it receives an event record.

The JDBC Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.

When running multiple queries for an event record, the executor skips the subsequent queries if a query fails.

You can include a subset of the functions provided with the StreamSets expression language in a query. When entering a query in the executor, press Ctrl + Space Bar to view the list of valid functions you can use.

Event Generation

The JDBC Query executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.

JDBC Query events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the JDBC Query executor have the following event-related record header attributes. Record header attributes are stored as String values.
Record Header Attribute Description
sdc.event.type Event type. Uses the following event types:
  • successful-query - Generated after a query successfully completes.
  • failed-query - Generated after a query fails.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The JDBC Query executor can generate the following types of event records:
Successful query

The executor generates a successful-query event record after successfully completing a query.

Successful-query event records have the sdc.event.type record header attribute set to sucessful-query and include the following fields:
Event Field Name Description
query Query completed.
query-result Number of rows affected by query. Included if the Include Query Result Count in Events property is selected.
Failed query

The executor generates a failed-query event record after failing to complete a query.

Failed-query event records have the sdc.event.type record header attribute set to failed-query and include the following field:
Event Field Name Description
query Query attempted.

Configuring a JDBC Query Executor

Configure a JDBC Query executor as part of an event stream to run one or more database queries each time it receives an event record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the JDBC tab, configure the following properties:
    JDBC Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor.

    For example, use the following formats for these database vendors:

    • MySQL - jdbc:mysql://<host>:<port>/<database_name>
    • Oracle - jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
    • PostgreSQL - jdbc:postgresql://<host>:<port>/<database_name>
    • SQL Server - jdbc:sqlserver://<host>:<port>;databaseName=<database_name>

    You can optionally include the user name and password in the connection string.

    For Azure Managed Identity, use the JDBC connection string provided in your Azure database connection string settings.

    SQL Queries One or more SQL queries to run each time the executor receives an event record.

    Using simple or bulk edit mode, click the Add icon to add additional queries.

    The executor processes multiple queries in order, and waits for each query to complete before continuing to the next query.

    Use Credentials Enables entering credentials on the Credentials tab. Select when you do not include credentials in the JDBC connection string.
    Include Query Result Count in Events Includes the number of rows impacted by a query in generated event records.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value.

    Use the property names and values as expected by JDBC.

  3. If you configured the origin to enter JDBC credentials separately from the JDBC connection string on the JDBC tab, then configure the following properties on the Credentials tab:
    Credentials Property Description
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database.

    Password Password for the JDBC user name.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
  4. On the Resilience tab, configure the following properties:
    Resilience Property Description
    Max Connection Attempts Maximum number of attempts to establish a connection to the data source.

    Default is 3.

    Wait Duration Time in seconds to wait between attempts to establish a connection to the data source.

    Default is 10 seconds.

  5. When using JDBC versions older than 4.0, on the Legacy Drivers tab, optionally configure the following properties:
    Legacy Drivers Property Description
    JDBC Class Driver Name Class name for the JDBC driver. Required for JDBC versions older than version 4.0.
    Connection Health Test Query Optional query to test the health of a connection. Recommended only when the JDBC version is older than 4.0.
  6. On the Advanced tab, optionally configure advanced properties.
    The defaults for these properties should work in most cases:
    Advanced Property Description
    Maximum Pool Size Maximum number of connections to create.

    Default is 1. The recommended value is 1.

    Minimum Idle Connections Minimum number of connections to create and maintain. To define a fixed connection pool, set to the same value as Maximum Pool Size.

    Default is 1.

    Enable Parallel Queries

    Runs insert and delete queries in parallel to improve throughput. When enabled, the executor groups all queries from the batch into insert, delete, or other queries. The executor then runs all insert queries, followed by all other queries, and then all delete queries. It runs insert and delete queries simultaneously on all the configured connections to the database, but runs other queries serially.

    When disabled, the executor runs each query serially in the order that it arrives.

    For best performance, select Auto Commit when you select Enable Parallel Queries so that the executor commits to the database after each query.
    Note: If you configure the executor to run multiple queries in parallel on the same table in a PostgreSQL database, StreamSets recommends disabling this property. Enabling the property in this situation might cause the pipeline to fail if a query violates the transaction started in an earlier query.
    Connection Timeout (seconds) Maximum time to wait for a connection. Use a time constant in an expression to define the time increment.
    Default is 30 seconds, defined as follows:
    ${30 * SECONDS}
    Idle Timeout (seconds) Maximum time to allow a connection to idle. Use a time constant in an expression to define the time increment.

    Use 0 to avoid removing any idle connections.

    When the entered value is close to or more than the maximum lifetime for a connection, Data Collector ignores the idle timeout.

    Default is 10 minutes, defined as follows:
    ${10 * MINUTES}
    Max Connection Lifetime (seconds) Maximum lifetime for a connection. Use a time constant in an expression to define the time increment.

    Use 0 to set no maximum lifetime.

    When a maximum lifetime is set, the minimum valid value is 30 minutes.

    Default is 30 minutes, defined as follows:
    ${30 * MINUTES}
    Batch Commit Determines if the executor commits to the database after each batch.

    Default is enabled.

    Auto Commit Determines if auto-commit mode is enabled. When enabled, the executor commits to the database after each query. When disabled, the executor commits to the database after each record.

    Default is disabled.

    Transaction Isolation Transaction isolation level used to connect to the database.

    Default is the default transaction isolation level set for the database. You can override the database default by setting the level to any of the following:

    • Read committed
    • Read uncommitted
    • Repeatable read
    • Serializable
    Init Query SQL query to perform immediately after the stage connects to the database. Use to set up the database session as needed.

    The query is performed after each connection to the database. If the stage disconnects from the database during the pipeline run, for example if a network timeout occurrs, the stage performs the query again when it reconnects to the database.

    For example, in case of Oracle, the following query returns 1 to verify that the stage is connected to the database: Select 1 from dual;