JDBC Query
The JDBC Query origin reads data from a database based on the specified query.
Use the JDBC Query origin in batch pipelines only. In a batch pipeline, the origin reads all available data with the specified query, then stops the pipeline. The origin does not support streaming execution mode and does not save offsets or support partitioning.
When you configure the JDBC Query origin, you specify the database connection information and any additional JDBC configuration properties you want to use. You can also use a connection to configure the origin.
You specify the SQL query to use for the read. You can specify a separate query to use when previewing data for pipeline development and testing. You can define a fetch size that is used by both queries.
You can configure the origin to cache the data for reuse throughout the pipeline run. You can also specify the JDBC driver to include with the pipeline.
Before using the JDBC Query origin, verify if you need to install a JDBC driver.
Database Vendors and Drivers
The JDBC Query origin can read database data from multiple database vendors.
Database Vendor | Versions and Drivers |
---|---|
Microsoft SQL Server | SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver |
MySQL | MySQL 5.7 with the MySQL Connector/J 8.0.12 driver |
Oracle | Oracle 11g with the Oracle 19.3.0 JDBC driver |
PostgreSQL | PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver |
Installing the JDBC Driver
- Microsoft SQL Server
- PostgreSQL
When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
- Apache Derby
- IBM DB2
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Teradata
If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.
SQL Queries
- Pipeline run query
- The pipeline run query defines the data that is read from the database and passed to the pipeline for processing.
- Preview query
- The preview query defines the data that is read from the database when you preview a pipeline. Use the preview query to provide a smaller data set for pipeline development and testing.
Configuring a JDBC Query Origin
Configure a JDBC Query origin to read data from a database table. Before using the origin, verify if you need to install a JDBC driver.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the JDBC tab, configure the following properties:
JDBC Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .
JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor. For example, use the following formats for these database vendors:
- MySQL -
jdbc:mysql://<host>:<port>/<database_name>
- Oracle -
jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
- PostgreSQL -
jdbc:postgresql://<host>:<port>/<database_name>
- SQL Server -
jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
- SQL Server 2019 Big Data Cluster
accessed from a Transformer pipeline -
jdbc:sqlserver://<ip>:<port>;databaseName=<database_name>
Where
<ip>
is the IP address of the SQL Server master instance, and<port>
is the port number of the SQL Server master instance. Use the SQL Server user name and password with this connection string. For more information, see SQL Server 2019 JDBC Connection Information.You can optionally include the user name and password in the connection string.
Use Credentials Enables entering credentials on the Credentials tab. Select when you do not include credentials in the JDBC connection string. Query SQL query that provides data for pipeline processing. Use any SQL query that is valid for the database.
Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.
Use the property names and values as expected by JDBC.
- MySQL -
-
On the Credentials tab, optionally configure the following
properties:
Credentials Property Description Username User name for the JDBC connection. The user account must have the correct permissions or privileges in the database. For example, the user account specified for an origin must be able to read data from the database. The user account specified for a destination must be able to write to the database.
Password Password for the JDBC user name. -
On the Advanced tab, optionally configure the following
properties:
Table Property Description Specify Preview Query Allows specifying a query to provide preview data. Preview Query Query to provide preview data. Used only when you preview the pipeline. Specify Fetch Size Enables defining a fetch size to use for all SQL queries. Fetch Size Suggests the number of rows that the JDBC driver should fetch with each database round-trip. The specified fetch size is used for both the pipeline and preview queries. Use 0 to skip defining a fetch size.
For more information about configuring a fetch size, see the database documentation.
JDBC Driver JDBC driver to include with the pipeline: - Bundle driver from connection string
- Transformer bundles a JDBC driver with the pipeline.
Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.
- Bundle custom driver - Transformer bundles the specified driver with the pipeline.
Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.
- Do not bundle driver - Transformer does not include a driver with the pipeline.
Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.
Default is to bundle the driver from the JDBC connection string.
Driver Class Name Class name of the custom JDBC driver to use. - Bundle driver from connection string
- Transformer bundles a JDBC driver with the pipeline.