Groovy Scripting

Data Collector

The Groovy Scripting origin runs a Groovy script to create Data Collector records. The Groovy Scripting origin supports Groovy versions 2.4 and 4.0.

The script runs for the duration of the pipeline. The origin can support a complex multithreaded script or a simple single-threaded script. The script can act on script parameters configured in the stage. The basic flow of a script must do the following:

Create threads if supporting multithreaded processing
Create batches
Create records
Add the records to a batch
Process the batch
Stop when the pipeline stops

The script must handle all necessary processing, such as generating events, sending errors for handling, and stopping when users stop the pipeline or when there is no more data. You can call external Java code from the script.

To handle restarts, the script must maintain an offset to track where the origin stopped and should restart. For the offset, the script requires a key, called an entity, associated with a unique value. For multithreaded processing, the entity must identify the partition of data processed by each thread. The method that processes batches saves an offset value for each entity.

For example, suppose your script processes data about U.S. states, using an API to read data with a URL of the form ../<state>&page=<number>. In the script, each thread reads data from one state until finished with that state. You can set the entity to the state and the offset to the page number.

You can reset the origin reset the origin reset the origin to process all available data.

The origin provides extensive sample code that you can use to develop your script.

When configuring the origin, you enter the script and the inputs required, including the batch size and number of threads, along with any script parameters used in the script.