JDBC Lookup

The JDBC Lookup processor performs a lookup on a database table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found.

When you configure the JDBC Lookup processor, you specify database connection information, the name of the lookup table, and any additional JDBC configuration properties that you want to use. You can optionally configure advanced properties related to the JDBC driver.

You can also use a connection to configure the processor.

You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.

When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.

When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.

If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.

If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.

Before using the JDBC Lookup processor, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC Lookup processor can return data from multiple database vendors.

StreamSets has tested this processor with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
Oracle Oracle 11g with the Oracle 19.3.0 JDBC driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver

Installing the JDBC Driver

The JDBC Lookup processor includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Configuring a JDBC Lookup Processor

Configure a JDBC Lookup processor to perform lookups on a database table. Before using the processor, verify if you need to install a JDBC driver.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Lookup tab, configure the following properties:
    Lookup Property Description
    Lookup Behavior Lookup task to perform:
    • Return the first matching row
    • Return all matching rows, generating a record for each
    • Return a count of matching rows
    • Return true if matches exist, otherwise false
    Lookup Keys Lookup keys to use to find matching records. Specify the following information:
    • Lookup Field - Field in the record to use for the lookup.
    • Lookup Column - Column in the table to use for the lookup.
    • Operator - Spark SQL operator to use for the lookup.

    Record field names must match the lookup key column names exactly. This property is case sensitive.

    Click the Add icon to configure additional keys. You can use simple or bulk edit mode to configure the keys.

    Return Columns Columns in the table to return and include in the matching record. Click the Add icon to specify additional columns. You can use simple or bulk edit mode to configure the columns.

    This property is case sensitive.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Add Prefix to Column Names Adds the specified prefix to the returned columns before adding fields to records. Use when the original record might include a field with the same name.

    Using this property is strongly recommended.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Prefix Prefix to add to the returned columns.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Sorting Sorts the matching rows in the specified order to determine how records are generated. Specify the following information:
    • Column to Sort - Column to use for the sort.
    • Sort Order - Sort order to use: ascending or descending.

    Click the Add icon to add additional sort columns. You can use simple or bulk edit mode to configure the columns.

    Columns are sorted in the configured order.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Max Rows Maximum number of rows to return. The processor generates a record for each row.

    Available when Lookup Behavior is set to return all matching rows.

    Target Field Name of the field to store the results of the lookup.

    Available when Lookup behavior is set to perform a count of matching rows or to return true if a match exists.

    Load Lookup Data Only Once Reads the lookup table in a single batch and caches the results for reuse.

    Use when data in the lookup table is not expected to change.

    This property is ignored in batch execution mode.

    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  3. On the Connection tab, configure the following properties:
    Connection Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .

    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database vendor.

    For example, use the following formats for these database vendors:

    • MySQL - jdbc:mysql://<host>:<port>/<database_name>
    • Oracle - jdbc:oracle:<driver_type>:@<host>:<port>:<service_name>
    • PostgreSQL - jdbc:postgresql://<host>:<port>/<database_name>
    • SQL Server - jdbc:sqlserver://<host>:<port>;databaseName=<database_name>
    • SQL Server 2019 Big Data Cluster accessed from a Transformer pipeline - jdbc:sqlserver://<ip>:<port>;databaseName=<database_name>

      Where <ip> is the IP address of the SQL Server master instance, and <port> is the port number of the SQL Server master instance. Use the SQL Server user name and password with this connection string. For more information, see SQL Server 2019 JDBC Connection Information.

      You can optionally include the user name and password in the connection string.

    Use Credentials Enables entering credentials. Select when you do not include credentials in the JDBC connection string.
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database.

    Password Password for the JDBC user name.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use.

    To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by JDBC.

  4. On the Table tab, configure the following properties:
    Table Property Description
    Schema Name of the schema in which the tables are located. Define when the database requires a schema.
    Table Database table to read.
  5. On the Advanced tab, optionally configure advanced properties.
    The default for this property should work in most cases:
    Advanced Property Description
    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.