JDBC Query
The JDBC Query executor connects through JDBC to a database and runs one or more user-defined SQL queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions.
The JDBC Query executor can commit data to the database after each batch or can commit data to the database for each record. By default, the executor commits after each batch.
When you configure the JDBC Query executor, you specify JDBC connection properties and the queries to run. You can configure custom properties that your driver requires, advanced connection properties, and the commit type to use. You can also enable the executor to run insert and delete queries in parallel to improve throughput.
To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.
You can also use a connection to configure the executor.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Database Vendors and Drivers
The JDBC Query executor can run queries on database data from multiple database vendors.
Database Vendor | Supported Versions | Tested Versions |
---|---|---|
MySQL | MySQL 5.7 and later |
|
Oracle |
Also supported:
|
|
PostgreSQL | PostgreSQL 9.x and later |
|
Microsoft SQL Server |
|
|
Installing the JDBC Driver
Before you use the JDBC Query executor, install the JDBC driver for the database. You cannot access the database until you install the required driver.
You install the driver into the JDBC stage library, streamsets-datacollector-jdbc-lib
, which includes the executor.
To use the JDBC driver
with multiple stage libraries, install the driver into each stage library associated
with the stages.
For example, if you want to use a MySQL JDBC driver with the JDBC Lookup processor
and with the MySQL Binary Log origin, you install the driver as an external library
for the JDBC stage library, streamsets-datacollector-jdbc-lib
, and for the MySQL Binary Log stage library, streamsets-datacollector-mysql-binlog-lib
.
For information about installing additional drivers, see Install External Libraries.
SQL Queries
You define one or more SQL queries that the JDBC Query executor runs on the database each time it receives an event record.
The JDBC Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.
When running multiple queries for an event record, the executor skips the subsequent queries if a query fails.
You can include a subset of the functions provided with the StreamSets expression language in a query. When entering a query in the executor, press Ctrl + Space Bar to view the list of valid functions you can use.
Event Generation
The JDBC Query executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Successful query
-
The executor generates a successful-query event record after successfully completing a query.
Successful-query event records have thesdc.event.type
record header attribute set tosucessful-query
and include the following fields:Event Field Name Description query Query completed. query-result Number of rows affected by query. Included if the Include Query Result Count in Events property is selected. - Failed query
-
The executor generates a failed-query event record after failing to complete a query.
Failed-query event records have thesdc.event.type
record header attribute set tofailed-query
and include the following field:Event Field Name Description query Query attempted.
Configuring a JDBC Query Executor
Configure a JDBC Query executor as part of an event stream to run one or more database queries each time it receives an event record.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Produce Events Generates event records when events occur. Use for event handling. Required Fields Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses.Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the JDBC tab, configure the following properties:
JDBC Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .
JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor. For example, use the following formats for these database vendors:
- MySQL -
jdbc:mysql://<host>:<port>/<database_name>
- Oracle -
jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
- PostgreSQL -
jdbc:postgresql://<host>:<port>/<database_name>
- SQL Server -
jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
You can optionally include the user name and password in the connection string.
For Azure Managed Identity, use the JDBC connection string provided in your Azure database connection string settings.
SQL Queries One or more SQL queries to run each time the executor receives an event record. Using simple or bulk edit mode, click the Add icon to add additional queries.
The executor processes multiple queries in order, and waits for each query to complete before continuing to the next query.
Use Credentials Enables entering credentials on the Credentials tab. Select when you do not include credentials in the JDBC connection string. Include Query Result Count in Events Includes the number of rows impacted by a query in generated event records. Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. Use the property names and values as expected by JDBC.
- MySQL -
-
If you configured the origin to enter JDBC credentials
separately from the JDBC connection string on the JDBC tab,
then configure the following properties on the Credentials
tab:
Credentials Property Description Username User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password Password for the JDBC user name. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. -
On the Resilience tab, configure the following
properties:
Resilience Property Description Max Connection Attempts Maximum number of attempts to establish a connection to the data source. Default is 3.
Wait Duration Time in seconds to wait between attempts to establish a connection to the data source. Default is 10 seconds.
-
When using JDBC versions older than 4.0, on the Legacy
Drivers tab, optionally configure the following
properties:
Legacy Drivers Property Description JDBC Class Driver Name Class name for the JDBC driver. Required for JDBC versions older than version 4.0. Connection Health Test Query Optional query to test the health of a connection. Recommended only when the JDBC version is older than 4.0. -
On the Advanced tab, optionally configure advanced
properties.
The defaults for these properties should work in most cases:
Advanced Property Description Maximum Pool Size Maximum number of connections to create. Default is 1. The recommended value is 1.
Minimum Idle Connections Minimum number of connections to create and maintain. To define a fixed connection pool, set to the same value as Maximum Pool Size. Default is 1.
Enable Parallel Queries Runs insert and delete queries in parallel to improve throughput. When enabled, the executor groups all queries from the batch into insert, delete, or other queries. The executor then runs all insert queries, followed by all other queries, and then all delete queries. It runs insert and delete queries simultaneously on all the configured connections to the database, but runs other queries serially.
When disabled, the executor runs each query serially in the order that it arrives.
For best performance, select Auto Commit when you select Enable Parallel Queries so that the executor commits to the database after each query.Note: If you configure the executor to run multiple queries in parallel on the same table in a PostgreSQL database, StreamSets recommends disabling this property. Enabling the property in this situation might cause the pipeline to fail if a query violates the transaction started in an earlier query.Connection Timeout (seconds) Maximum time to wait for a connection. Use a time constant in an expression to define the time increment. Default is 30 seconds, defined as follows:${30 * SECONDS}
Idle Timeout (seconds) Maximum time to allow a connection to idle. Use a time constant in an expression to define the time increment. Use 0 to avoid removing any idle connections.
When the entered value is close to or more than the maximum lifetime for a connection, Data Collector ignores the idle timeout.
Default is 10 minutes, defined as follows:${10 * MINUTES}
Max Connection Lifetime (seconds) Maximum lifetime for a connection. Use a time constant in an expression to define the time increment. Use 0 to set no maximum lifetime.
When a maximum lifetime is set, the minimum valid value is 30 minutes.
Default is 30 minutes, defined as follows:${30 * MINUTES}
Batch Commit Determines if the executor commits to the database after each batch. Default is enabled.
Auto Commit Determines if auto-commit mode is enabled. When enabled, the executor commits to the database after each query. When disabled, the executor commits to the database after each record. Default is disabled.
Transaction Isolation Transaction isolation level used to connect to the database. Default is the default transaction isolation level set for the database. You can override the database default by setting the level to any of the following:
- Read committed
- Read uncommitted
- Repeatable read
- Serializable
Init Query SQL query to perform immediately after the stage connects to the database. Use to set up the database session as needed. The query is performed after each connection to the database. If the stage disconnects from the database during the pipeline run, for example if a network timeout occurrs, the stage performs the query again when it reconnects to the database.
For example, in case of Oracle, the following query returns 1 to verify that the stage is connected to the database:
Select 1 from dual;