External Resources

StreamSets engines can require access to external files and libraries, depending on how you design pipelines.

For example, JDBC stages require a JDBC driver to access the database. When you use a JDBC stage, you must make the driver available as an external resource.

When you create a deployment, you specify one of the following sources for external resources:
None
Use no configured source when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.
When you use no source, external resources are defined in the engine instance rather than the deployment. When your pipelines require access to external resources, you can upload the files to the engine instance.
To use no configured source, set the External Resource Source property to None when configuring a deployment.
Archive file
Use an archive file that includes the external resources when the deployment launches multiple engine instances and when your pipelines require external resources.
When you use an external resource archive, external resources are defined in the deployment. When your pipelines require access to external resources, you extract the archive file, add the additional resources, and then compress the archive file again.
To use an external resource archive, set the External Resource Source property to Archive File when configuring a deployment.

External Resource Types

Depending on how you design pipelines, StreamSets engines can require access to the following types of external resources:
External Resource Type Description
Runtime resource files Files that define pipeline property values that are called from within a pipeline. For more information, see:
External libraries External libraries required by pipeline stages. External libraries can include JDBC or JMS drivers or external Java libraries. For more information, see:
Custom stage libraries Stage libraries for custom stages. For example, you might develop a custom processor to perform custom processing in a pipeline.
For more information, see Custom Stage Libraries in the Data Collector engine documentation.
Important: To use custom stage libraries, you must configure the deployment to use an external resource archive.

No Source

Configure the deployment to use no source for external resources when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.

When using no configured source for the deployment, you can upload external resources to the engine instance.
Important: To use custom stage libraries, you must configure the deployment to use an external resource archive.
The location that you use to upload external resources depends on the external resource type:
  • Runtime resource files - Use the engine details page.
  • External libraries - Use the pipeline canvas or the engine details page.
    Tip: In most cases, using the pipeline canvas is simpler because the stage library that requires access to the external library is automatically selected for you.

Uploading Resources from the Pipeline Canvas

When using no configured source for the deployment, you can upload external libraries, such as JDBC drivers, from the pipeline canvas.

  1. In the pipeline canvas, select the stage that requires an external library.
  2. In the Properties panel, click the External Libraries tab, and then click Upload External Library.

    The stage library that includes the stage is automatically selected.

  3. Click Browse File to choose the file to upload, and then click Upload.
  4. Click Restart Engine to enable the changes.

Uploading Resources from the Engine Details

When using no configured source for the deployment, you can upload runtime resource files and external libraries from the engine details page.

  1. In the Navigation panel, click Set Up > Engines.
  2. Click an engine type tab.
  3. Expand the details for the engine that you want to configure.
  4. Click View Engine Configuration, and then click the External Resources tab.
  5. Select the type of resource you are uploading:
    • Resources
    • External libraries
  6. Click the Upload icon: .
  7. If uploading an external library, select the stage library that needs to access the external library.
    For example, if you are installing a JDBC driver for the JDBC Multitable Consumer origin, select the JDBC stage library. If you are installing an external Java library for the Groovy Evaluator processor, select the Groovy stage library.
    Tip: If you upload external libraries from the pipeline canvas, the required stage library is automatically selected for you.
  8. Click Browse File to choose the file to upload, and then click Upload.
  9. Click Restart Engine to enable the changes.

Removing Resources

When using no configured source for the deployment, you can remove existing runtime resource files and external libraries from the engine details page.

Remove outdated or unused runtime resource files and external libraries to prevent them from being incorrectly used when running a pipeline.

  1. In the Navigation panel, click Set Up > Engines.
  2. Click an engine type tab.
  3. Click the name of the engine that you want to configure.
  4. Click the External Resources tab, and then click the external resource type that you want to remove:
    • Resources
    • External libraries
  5. Select the resources that you want to remove, and then click the Delete icon: .
  6. Click Restart Engine to enable the changes.

Archive File as the Source

Configure the deployment to use an external resource archive when a deployment launches multiple engine instances and when your pipelines require external resources.

You typically configure a deployment to use an external resource archive when you are ready to move to production, after you have finished building your pipelines and have finalized the list of external resources that your pipelines require.

You generate an archive file in the TGZ or ZIP format, using the required folder names and directory structure. You store the file in a location that is accessible to all machines running an engine instance for the deployment. Then, you edit the deployment to define the location of the archive file.

After you configure the external resource archive and restart all engine instances in the deployment, the archive file contents are extracted and copied into each engine instance.

When your pipelines require additional external resources, you extract the archive file, add the additional resources, and then compress the archive file again.

Archive Structure

An external resource archive file must use the required folder names and directory structure.

The root folder must be named externalResources and include the following directories:
resources
The resources directory must include text files created for runtime resources.
streamsets-libs-extras
The streamsets-libs-extras directory must include a subdirectory for each set of required external libraries based on the stage library name, as follows: <stage library name>/lib/
For example, external libraries used by stages included in the Data Collector JMS stage library must be included in the following subdirectory: streamsets-datacollector-jms-lib/lib/
For a list of Data Collector stage library names and the stages included in each library, see Stage Libraries in the Data Collector documentation.
user-libs
The user-libs directory must include a subdirectory for each custom stage.

If your pipelines do not use one of the external resource types, you can omit that directory. For example, if you have not developed custom stage libraries, you do not need to include the user-libs directory.

Tip: You can download a sample externalResources.tgz file by clicking the download icon next to the External Resource Location property when you configure the engine details for a deployment. Or, if you configure a deployment to use no source and upload external resources to the engine instance, Control Hub stores the resources in the externalResources folder in the engine installation directory. You can locate the folder on the engine installation machine.

Sample

Let's look at the contents of a sample external resource archive file created for Data Collector.

This sample archive file includes a runtime resource file named JDBC.txt, the MySQL JDBC driver for stages included in the JDBC stage library, and the Oracle JDBC driver for the Oracle Bulkload origin included in the JDBC Oracle stage library. It does not include any custom stage libraries:

externalResources
  resources
    JDBC.txt
  streamsets-libs-extras
    streamsets-datacollector-jdbc-lib
      lib
        mysql-connector-java-8.0.12.jar
    streamsets-datacollector-jdbc-oracle-lib
      lib
        ojdbc8-19.3.0.0.jar

Archive Location

You store an external resource archive file in one of the following types of locations:

Use the location most appropriate for your deployment type.

For example, for a self-managed deployment of engines to local on-premises machines, you might store the external resource archive file on a networked file system.

For a cloud service provider deployment, it's typically simpler to store the external resource archive file with that same cloud service provider. For example, for an Amazon EC2 deployment, you might store the file in Amazon S3.

File System

You can store an external resource archive file on a local or network file system. To ensure that all engine instances managed by the deployment can access the file, mount that directory from all engine machines or provide all engine machines access to that file using an HTTP URL.

When you configure the external resource location for the deployment, enter the path to the file. For example:

/mnt/shared/externalResources.tgz
Important: To use a file system when using a Docker image for a self-managed deployment, you must modify the installation script command to mount the file to the engine container. To use a file system when using a Kubernetes deployment, you must use advanced mode to edit the deployment YAML file to mount the file to the engine container when you create the deployment.

Web Server

You can store an external resource archive file on a web server. To ensure that all engine instances managed by the deployment can access the file, provide all engine machines access to that file using an HTTP URL.

When you configure the external resource location for the deployment, enter the required URL format for the web server. For example:

https://<hostname>:<port>/shared/externalResources.tgz

Cloud Service

You can store an external resource archive file on one of the following cloud services:

Amazon S3
You can store the file in a private or public Amazon S3 bucket, based on the deployment type:
  • Private bucket - Supported for Amazon E2 deployments only. To ensure that all engine instances managed by the Amazon EC2 deployment can access the file, your AWS administrator must grant the AWS instance profile associated with the provisioned EC2 instances read access to the bucket.
  • Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
When you configure the external resource location for the deployment, enter the URL as follows, based on whether the file is private or publicly available:
  • Private URL - s3://<bucket_name>/<path>/externalResources.tgz
  • Public URL - https://<bucket name>.s3.<region>.amazonaws.com/externalResources.tgz
Google Cloud Storage
You can store the file in a private or public Google Cloud Storage bucket, based on the deployment type:
  • Private bucket - Supported for GCE deployments only. To ensure that all engine instances managed by the GCE deployment can access the file, your Google Cloud administrator must grant the GCP instance service account associated with the provisioned VM instances read access to the bucket.
  • Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
When you configure the external resource location for the deployment, enter the URL as follows, based on whether the file is private or publicly available:
  • Private URL - gs://<bucket_name>/<path>/externalResources.tgz
  • Public URL - https://storage.googleapis.com/<bucket_name>/externalResources.tgz
Azure Blob Storage or Azure Data Lake Storage Gen2
You can store the file in a private or public Azure Blob Storage or Azure Data Lake Storage Gen2 container, based on the deployment type:
  • Private container - Supported for Azure VM deployments only. To ensure that all engine instances managed by the Azure VM deployment can access the file, your Azure administrator must grant the Azure managed identity associated with the provisioned VM instances read access to the container.
  • Public container - Supported for all deployment types, as long as you share the container publicly.
When you configure the external resource location for the deployment, enter the URL as follows:
https://<storage account name>.blob.core.windows.net/<container_name>/<blob_name>/externalResources.tgz

Setting Up an Archive

Set up an external resource archive after you have finalized the list of external resources that your pipelines require.

  1. Generate an archive file in the TGZ or ZIP format that includes all external resources required by your pipelines.

    Ensure that the file uses the required folder names and directory structure.

  2. Store the file in a location that all engine instances managed by the deployment can access.

    For the list of valid locations, see Archive Location.

  3. In the Control Hub Navigation panel, click Set Up > Deployments.
  4. Locate the deployment that you want to modify.
  5. In the Actions column, click the More icon () and then click Edit.
  6. In the Edit Deployment dialog box, expand the Configure Deployment section.
  7. Select Archive File for the External Resource Source property.
  8. In the External Resource Location property, enter the location to the archive file using the required format.

    For the list of required location formats, see Archive Location.

  9. Click Save.
  10. If associated engines are running, click Restart Engines to restart all engine instances for the changes to take effect.

    If associated engines are not running, they inherit the changes when the engines restart.

Updating an Archive

When a deployment uses an external resource archive and your pipelines require additional resources, you manually update the archive file to include new external resources and then restart all engine instances in the deployment.

  1. Download the archive file from the configured external resource location.
  2. Extract the file, and add the additional external resources to the required subfolder for the resource type.
  3. Compress the archive file in the TGZ or ZIP format.
  4. Upload the updated archive file to the configured external resource location.

    To update associated engines, you must restart each engine.

  5. In the Navigation panel, click Set Up > Engines.
  6. Click an engine type tab.
  7. For each engine in the deployment, in the Actions column, click the More icon () and then click Restart Engine.