JDBC Query

The JDBC Query executor connects through JDBC to a database and runs one or more user-defined SQL queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Use the JDBC Query executor as part of an event stream in the pipeline. For example, you might use the JDBC Query executor to execute a stored procedure in the database when the pipeline generates a pipeline stop event.

Important: This stage does not support connecting to non-RDBMS systems, including Hive, Impala, Kudu, or Snowflake. Support for untested systems is not guaranteed. For a list of tested systems, see "Database Vendors and Drivers". Use the Hive Query executor to run queries on Hive or Impala.

The JDBC Query executor can commit data to the database after each batch or can commit data to the database for each record. By default, the executor commits after each batch.

When you configure the JDBC Query executor, you specify JDBC connection properties and the queries to run. You can configure custom properties that your driver requires, advanced connection properties, and the commit type to use. You can also enable the executor to run insert and delete queries in parallel to improve throughput.

To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.

You can also use a connection to configure the executor.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Database Vendors and Drivers

The JDBC Query executor can run queries on database data from multiple database vendors.

The following table lists the supported and tested database versions for this stage. You can use the stage with other JDBC-compliant databases, but full support is not guaranteed. For a full list of supported versions, see Supported Systems and Versions in the Data Collector documentation.


Database Vendor	Supported Versions	Tested Versions
MySQL	MySQL 5.7 and later	MySQL 5.7 with the MySQL Connector/J 8.0.12 driver MySQL 8.0 with the MySQL Connector/J 8.0.12 driver
Oracle	Oracle 11g Release 2, 12c, 18c, 19c, 21c Oracle Real Application Clusters (RAC) 11g Release 2, 12c, 18c, 19c, 21c Also supported: Hosted systems, such as Amazon RDS, that run supported versions of Oracle RAC Derived systems, such as Oracle Exadata, that run supported versions of Oracle RAC	Oracle 11g Release 2, 19c with the Oracle 21.8.0.0 JDBC driver version
PostgreSQL	PostgreSQL 9.x and later	PostgreSQL 9.6.9 PostgreSQL 10.4 PostgreSQL 11.7 PostgreSQL 12.2 PostgreSQL 13.0 PostgreSQL 14.0 PostgreSQL 15.0
Microsoft SQL Server	SQL Server 2017 SQL Server 2019	SQL Server 2017 SQL Server 2019

Installing the JDBC Driver

Before you use the JDBC Query executor, install the JDBC driver for the database. You cannot access the database until you install the required driver.

Note: When connecting to a PostgreSQL, Microsoft SQL Server, or MariaDB database, you do not need to install a JDBC driver. Data Collector includes the JDBC driver required for those databases.

You install the driver into the JDBC stage library, streamsets-datacollector-jdbc-lib, which includes the executor.

To use the JDBC driver with multiple stage libraries, install the driver into each stage library associated with the stages. For example, if you want to use a MySQL JDBC driver with the JDBC Lookup processor and with the MySQL Binary Log origin, you install the driver as an external library for the JDBC stage library, streamsets-datacollector-jdbc-lib, and for the MySQL Binary Log stage library, streamsets-datacollector-mysql-binlog-lib.

For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

SQL Queries

You define one or more SQL queries that the JDBC Query executor runs on the database each time it receives an event record.

The JDBC Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.

When running multiple queries for an event record, the executor skips the subsequent queries if a query fails.

You can include a subset of the functions provided with the StreamSets expression language in a query. When entering a query in the executor, press Ctrl + Space Bar to view the list of valid functions you can use.

Event Generation

The JDBC Query executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.

JDBC Query events can be used in any logical way. For example:

With the Email executor to send a custom email after receiving an event.
For an example, see Sending Email During Pipeline Processing.
With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the JDBC Query executor have the following event-related record header attributes. Record header attributes are stored as String values.


Record Header Attribute	Description
sdc.event.type	Event type. Uses the following event types: successful-query - Generated after a query successfully completes. failed-query - Generated after a query fails.
sdc.event.version	Integer that indicates the version of the event record type.
sdc.event.creation_timestamp	Epoch timestamp when the stage created the event.

The JDBC Query executor can generate the following types of event records:

Successful query

The executor generates a successful-query event record after successfully completing a query.

Successful-query event records have the sdc.event.type record header attribute set to sucessful-query and include the following fields:


Event Field Name	Description
query	Query completed.
query-result	Number of rows affected by query. Included if the Include Query Result Count in Events property is selected.

Failed query

The executor generates a failed-query event record after failing to complete a query.

Failed-query event records have the sdc.event.type record header attribute set to failed-query and include the following field:


Event Field Name	Description
query	Query attempted.

Configuring a JDBC Query Executor

Configure a JDBC Query executor as part of an event stream to run one or more database queries each time it receives an event record.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Produce Events	Generates event records when events occur. Use for event handling.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the JDBC tab, configure the following properties:


JDBC Property	Description
Connection	Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
JDBC Connection String	Connection string used to connect to the database. Use the connection string format required by the database vendor. For example, use the following formats for these database vendors: MySQL - `jdbc:mysql://<host>:<port>/<database_name>` Oracle - `jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>` PostgreSQL - `jdbc:postgresql://<host>:<port>/<database_name>` SQL Server - `jdbc:sqlserver://<host>:<port>;databaseName=<database_name>` You can optionally include the user name and password in the connection string. For Azure Managed Identity, use the JDBC connection string provided in your Azure database connection string settings.
SQL Queries	One or more SQL queries to run each time the executor receives an event record. Using simple or bulk edit mode, click the Add icon to add additional queries. The executor processes multiple queries in order, and waits for each query to complete before continuing to the next query.
Use Credentials	Enables entering credentials on the Credentials tab. Select when you do not include credentials in the JDBC connection string.
Include Query Result Count in Events	Includes the number of rows impacted by a query in generated event records.
Additional JDBC Configuration Properties	Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. Use the property names and values as expected by JDBC.

If you configured the origin to enter JDBC credentials separately from the JDBC connection string on the JDBC tab, then configure the following properties on the Credentials tab:


Credentials Property	Description
Username	User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password	Password for the JDBC user name. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

On the Resilience tab, configure the following properties:


Resilience Property	Description
Max Connection Attempts	Maximum number of attempts to establish a connection to the data source. Default is 3.
Wait Duration	Time in seconds to wait between attempts to establish a connection to the data source. Default is 10 seconds.

When using JDBC versions older than 4.0, on the Legacy Drivers tab, optionally configure the following properties:


Legacy Drivers Property	Description
JDBC Class Driver Name	Class name for the JDBC driver. Required for JDBC versions older than version 4.0.
Connection Health Test Query	Optional query to test the health of a connection. Recommended only when the JDBC version is older than 4.0.

On the Advanced tab, optionally configure advanced properties.

The defaults for these properties should work in most cases:


Advanced Property	Description
Maximum Pool Size	Maximum number of connections to create. Default is 1. The recommended value is 1.
Minimum Idle Connections	Minimum number of connections to create and maintain. To define a fixed connection pool, set to the same value as Maximum Pool Size. Default is 1.
Enable Parallel Queries	Runs insert and delete queries in parallel to improve throughput. When enabled, the executor groups all queries from the batch into insert, delete, or other queries. The executor then runs all insert queries, followed by all other queries, and then all delete queries. It runs insert and delete queries simultaneously on all the configured connections to the database, but runs other queries serially. When disabled, the executor runs each query serially in the order that it arrives. For best performance, select Auto Commit when you select Enable Parallel Queries so that the executor commits to the database after each query. Note: If you configure the executor to run multiple queries in parallel on the same table in a PostgreSQL database, disable this property. Enabling the property in this situation might cause the pipeline to fail if a query violates the transaction started in an earlier query.
Connection Timeout (seconds)	Maximum time to wait for a connection. Use a time constant in an expression to define the time increment. Default is 30 seconds, defined as follows: `${30 * SECONDS}`
Idle Timeout (seconds)	Maximum time to allow a connection to idle. Use a time constant in an expression to define the time increment. Use 0 to avoid removing any idle connections. When the entered value is close to or more than the maximum lifetime for a connection, Data Collector ignores the idle timeout. Default is 10 minutes, defined as follows: `${10 * MINUTES}`
Max Connection Lifetime (seconds)	Maximum lifetime for a connection. Use a time constant in an expression to define the time increment. Use 0 to set no maximum lifetime. When a maximum lifetime is set, the minimum valid value is 30 minutes. Default is 30 minutes, defined as follows: `${30 * MINUTES}`
Batch Commit	Determines if the executor commits to the database after each batch. Default is enabled.
Auto Commit	Determines if auto-commit mode is enabled. When enabled, the executor commits to the database after each query. When disabled, the executor commits to the database after each record. Default is disabled.
Transaction Isolation	Transaction isolation level used to connect to the database. Default is the default transaction isolation level set for the database. You can override the database default by setting the level to any of the following: Read committed Read uncommitted Repeatable read Serializable
Init Query	SQL query to perform immediately after the stage connects to the database. Use to set up the database session as needed. The query is performed after each connection to the database. If the stage disconnects from the database during the pipeline run, for example if a network timeout occurrs, the stage performs the query again when it reconnects to the database. For example, in case of Oracle, the following query returns 1 to verify that the stage is connected to the database: `Select 1 from dual;`