JDBC Lookup

The JDBC Lookup processor performs a lookup on a database table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found.

When you configure the JDBC Lookup processor, you specify database connection information, the name of the lookup table, and any additional JDBC configuration properties that you want to use. You can optionally configure advanced properties related to the JDBC driver.

You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.

When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.

When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.

If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.

If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.

Before using the JDBC Lookup processor, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC Lookup processor can return data from multiple database vendors.

StreamSets has tested this processor with the following database vendors, versions, and JDBC drivers:


Database Vendor	Versions and Drivers
Microsoft SQL Server	SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver
MySQL	MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
Oracle	Oracle 11g with the Oracle 19.3.0 JDBC driver
PostgreSQL	PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver

Installing the JDBC Driver

The JDBC Lookup processor includes drivers for the following databases:

Microsoft SQL Server
PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:

Apache Derby
IBM DB2
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Configuring a JDBC Lookup Processor

Configure a JDBC Lookup processor to perform lookups on a database table. Before using the processor, verify if you need to install a JDBC driver.

On the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.

On the Lookup tab, configure the following properties:


Lookup Property	Description
Lookup Behavior	Lookup task to perform: Return the first matching row Return all matching rows, generating a record for each Return a count of matching rows Return true if matches exist, otherwise false
Lookup Keys	Lookup keys to use to find matching records. Specify the following information: Lookup Field - Field in the record to use for the lookup. Lookup Column - Column in the table to use for the lookup. Operator - Spark SQL operator to use for the lookup. Record field names must match the lookup key column names exactly. This property is case sensitive. Click the Add icon to configure additional keys. You can use simple or bulk edit mode to configure the keys.
Return Columns	Columns in the table to return and include in the matching record. Click the Add icon to specify additional columns. You can use simple or bulk edit mode to configure the columns. This property is case sensitive. Available when Lookup Behavior is set to return the first matching row or all matching rows.
Add Prefix to Column Names	Adds the specified prefix to the returned columns before adding fields to records. Use when the original record might include a field with the same name. Using this property is strongly recommended. Available when Lookup Behavior is set to return the first matching row or all matching rows.
Prefix	Prefix to add to the returned columns. Available when Lookup Behavior is set to return the first matching row or all matching rows.
Sorting	Sorts the matching rows in the specified order to determine how records are generated. Specify the following information: Column to Sort - Column to use for the sort. Sort Order - Sort order to use: ascending or descending. Click the Add icon to add additional sort columns. You can use simple or bulk edit mode to configure the columns. Columns are sorted in the configured order. Available when Lookup Behavior is set to return the first matching row or all matching rows.
Max Rows	Maximum number of rows to return. The processor generates a record for each row. Available when Lookup Behavior is set to return all matching rows.
Target Field	Name of the field to store the results of the lookup. Available when Lookup behavior is set to perform a count of matching rows or to return true if a match exists.
Load Lookup Data Only Once	Reads the lookup table in a single batch and caches the results for reuse. Use when data in the lookup table is not expected to change. This property is ignored in batch execution mode.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

On the Connection tab, configure the following properties:


Connection Property	Description
JDBC Connection String	Connection string used to connect to the database. Use the connection string format required by the database vendor. For example, use the following formats for these database vendors: MySQL - `jdbc:mysql://<host>:<port>/<database_name>` Oracle - `jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>` PostgreSQL - `jdbc:postgresql://<host>:<port>/<database_name>` SQL Server - `jdbc:sqlserver://<host>:<port>;databaseName=<database_name>`
Use Credentials	Enables entering credentials. Select when you do not include credentials in the JDBC connection string.
Username	User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password	Password for the JDBC user name. Tip: To secure sensitive information, you can use credential stores or runtime resources.
Additional JDBC Configuration Properties	Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by JDBC.

On the Table tab, configure the following properties:


Table Property	Description
Schema	Name of the schema in which the tables are located. Define when the database requires a schema.
Table	Database table to read.

On the Advanced tab, optionally configure advanced properties.

The default for this property should work in most cases:


Advanced Property	Description
JDBC Driver	JDBC driver to include with the pipeline: Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline. Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes. Bundle custom driver - Transformer bundles the specified driver with the pipeline. Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes. Do not bundle driver - Transformer does not include a driver with the pipeline. Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use. Default is to bundle the driver from the JDBC connection string.
Driver Class Name	Class name of the custom JDBC driver to use.