JDBC Table
The JDBC Table origin reads data from a database table. Use the JDBC Table origin to process data from a database that is not natively supported.
Transformer provides database-specific origins, such as Google Big Query, Oracle JDBC Table, and Snowflake. When possible, StreamSets recommends using available database-specific origins. To read from one or more tables using a custom query, use the JDBC Query origin.
The origin can read all the columns from a table or only specified columns from a table. In each batch, the origin reads a specified number of rows, distributing those rows uniformly across the specified partitions. When reading the last row, the origin saves the value from a specified offset column. In the subsequent batch, the origin uses the offset to locate the last row read and starts reading from the next row.
When you configure the JDBC Table origin, you specify the database connection information and any additional JDBC configuration properties you want to use. You configure the table to read and optionally specify the columns to read from the table. You can specify an additional predicate to include any conditions that you would place in a WHERE clause.
You can also use a connection to configure the origin.
You define the offset column, the maximum number of rows to include in each batch, and the number of partitions used to read from the database table. You can optionally configure advanced properties related to the JDBC driver.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Before using the JDBC Table origin, verify if you need to install a JDBC driver.
Database Vendors and Drivers
The JDBC Table origin can read database data from multiple database vendors.
Database Vendor | Versions and Drivers |
---|---|
Microsoft SQL Server | SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver |
MySQL | MySQL 5.7 with the MySQL Connector/J 8.0.12 driver |
Oracle | Oracle 11g with the Oracle 19.3.0 JDBC driver |
PostgreSQL | PostgreSQL 9.6.2 with the PostgreSQL 42.2.5 driver |
Installing the JDBC Driver
- Microsoft SQL Server
- PostgreSQL
When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
- Apache Derby
- IBM DB2
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Teradata
If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
For the JDBC Table origin, Spark determines the partitioning based on the number of partitions that you configure for the origin. Spark creates one connection to the database for each partition.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
- The size and configuration of the cluster.
- The amount of data being processed.
- The number of concurrent connections that can be made to the database.
If the pipeline fails because the origin encounters an out of memory error, you likely need to increase the number of partitions for the origin.
Offset Column Requirement
The JDBC Table origin uses a single offset column. The offset column should contain unique, incremental values and should not contain null values. supported data type.
When a table includes a single primary key column, the origin uses it as the offset column, by default.
You can configure the origin to use a different offset column. You might specify an alternate offset column when the table uses a composite key or when the data type of the primary key column is not supported.
- Track processing
- The origin tracks processing using values in the offset column. When reading the last row for a batch, the origin saves the value from the offset column. In the subsequent batch, the origin starts reading from the following row.
- Create partitions
- When creating partitions, the origin determines the data to be processed and
then divides the data into partitions based on ranges of offset values.
For example, say you have rows with integer offsets from 1 to 1000 and you configure the origin to create two partitions. The first partition might include records with offsets from 1-500, and the second partition, the offsets from 501-1000.
Supported Offset Data Types
- Bigint
- Decimal or Numeric with a scale of 0
- Integer
- Smallint
Configuring a JDBC Table Origin
Configure a JDBC Table origin to read data from a database table. Before using the origin, verify if you need to install a JDBC driver.