External Resources
StreamSets engines can require access to external files and libraries, depending on how you design pipelines.
For example, JDBC stages require a JDBC driver to access the database. When you use a JDBC stage, you must make the driver available as an external resource.
- None
- Use no configured source when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.
- Archive file
- Use an archive file that includes the external resources when the deployment launches multiple engine instances and when your pipelines require external resources.
External Resource Types
External Resource Type | Description |
---|---|
Runtime resource files | Files that define pipeline property values that are called from
within a pipeline. For more information, see:
|
External libraries | External libraries required by pipeline stages. External
libraries can include JDBC or JMS drivers or external Java
libraries. For more information, see:
|
Custom stage libraries | Stage libraries for custom stages. For example, you might develop
a custom processor to perform custom processing in a pipeline.
For more information, see Custom Stage Libraries
in the Data Collector engine documentation. Important: To use custom
stage libraries, you must configure the deployment to use an
external resource archive. |
No Source
Configure the deployment to use no source for external resources when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.
- Runtime resource files - Use the engine details page.
- External libraries - Use the pipeline canvas or the engine details page. Tip: In most cases, using the pipeline canvas is simpler because the stage library that requires access to the external library is automatically selected for you.
Uploading Resources from the Pipeline Canvas
When using no configured source for the deployment, you can upload external libraries, such as JDBC drivers, from the pipeline canvas.
- In the pipeline canvas, select the stage that requires an external library.
-
In the Properties panel, click the External Libraries
tab, and then click Upload External Library.
The stage library that includes the stage is automatically selected.
- Click Browse File to choose the file to upload, and then click Upload.
- Click Restart Engine to enable the changes.
Uploading Resources from the Engine Details
When using no configured source for the deployment, you can upload runtime resource files and external libraries from the engine details page.
- In the Navigation panel, click .
- Click an engine type tab.
- Expand the details for the engine that you want to configure.
- Click View Engine Configuration, and then click the External Resources tab.
-
Select the type of resource you are uploading:
- Resources
- External libraries
- Click the Upload icon: .
-
If uploading an external library, select the stage library that needs to access
the external library.
For example, if you are installing a JDBC driver for the JDBC Multitable Consumer origin, select the JDBC stage library. If you are installing an external Java library for the Groovy Evaluator processor, select the Groovy stage library.Tip: If you upload external libraries from the pipeline canvas, the required stage library is automatically selected for you.
- Click Browse File to choose the file to upload, and then click Upload.
- Click Restart Engine to enable the changes.
Removing Resources
When using no configured source for the deployment, you can remove existing runtime resource files and external libraries from the engine details page.
Remove outdated or unused runtime resource files and external libraries to prevent them from being incorrectly used when running a pipeline.
- In the Navigation panel, click .
- Click an engine type tab.
- Click the name of the engine that you want to configure.
-
Click the External Resources tab, and then click the
external resource type that you want to remove:
- Resources
- External libraries
- Select the resources that you want to remove, and then click the Delete icon: .
- Click Restart Engine to enable the changes.
Archive File as the Source
Configure the deployment to use an external resource archive when a deployment launches multiple engine instances and when your pipelines require external resources.
You typically configure a deployment to use an external resource archive when you are ready to move to production, after you have finished building your pipelines and have finalized the list of external resources that your pipelines require.
You generate an archive file in the TGZ or ZIP format, using the required folder names and directory structure. You store the file in a location that is accessible to all machines running an engine instance for the deployment. Then, you edit the deployment to define the location of the archive file.
After you configure the external resource archive and restart all engine instances in the deployment, the archive file contents are extracted and copied into each engine instance.
When your pipelines require additional external resources, you extract the archive file, add the additional resources, and then compress the archive file again.
Archive Structure
An external resource archive file must use the required folder names and directory structure.
- resources
- The resources directory must include text files created for runtime resources.
- streamsets-libs-extras
- The streamsets-libs-extras directory must include a
subdirectory for each set of required external libraries based on the stage
library name, as follows:
<stage library name>/lib/
- user-libs
- The user-libs directory must include a subdirectory for each custom stage.
If your pipelines do not use one of the external resource types, you can omit that directory. For example, if you have not developed custom stage libraries, you do not need to include the user-libs directory.
Sample
Let's look at the contents of a sample external resource archive file created for Data Collector.
This sample archive file includes a runtime resource file named JDBC.txt, the MySQL JDBC driver for stages included in the JDBC stage library, and the Oracle JDBC driver for the Oracle Bulkload origin included in the JDBC Oracle stage library. It does not include any custom stage libraries:
externalResources
resources
JDBC.txt
streamsets-libs-extras
streamsets-datacollector-jdbc-lib
lib
mysql-connector-java-8.0.12.jar
streamsets-datacollector-jdbc-oracle-lib
lib
ojdbc8-19.3.0.0.jar
Archive Location
Use the location most appropriate for your deployment type.
For example, for a self-managed deployment of engines to local on-premises machines, you might store the external resource archive file on a networked file system.
For a cloud service provider deployment, it's typically simpler to store the external resource archive file with that same cloud service provider. For example, for an Amazon EC2 deployment, you might store the file in Amazon S3.
File System
You can store an external resource archive file on a local or network file system. To ensure that all engine instances managed by the deployment can access the file, mount that directory from all engine machines or provide all engine machines access to that file using an HTTP URL.
When you configure the external resource location for the deployment, enter the path to the file. For example:
Web Server
You can store an external resource archive file on a web server. To ensure that all engine instances managed by the deployment can access the file, provide all engine machines access to that file using an HTTP URL.
When you configure the external resource location for the deployment, enter the required URL format for the web server. For example:
https://<hostname>:<port>/shared/externalResources.tgz
Cloud Service
You can store an external resource archive file on one of the following cloud services:
- Amazon S3
- You can store the file in a private or public Amazon S3 bucket, based on the
deployment type:
- Private bucket - Supported for Amazon E2 deployments only. To ensure that all engine instances managed by the Amazon EC2 deployment can access the file, your AWS administrator must grant the AWS instance profile associated with the provisioned EC2 instances read access to the bucket.
- Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
- Google Cloud Storage
- You can store the file in a private or public Google Cloud Storage bucket, based
on the deployment type:
- Private bucket - Supported for GCE deployments only. To ensure that all engine instances managed by the GCE deployment can access the file, your Google Cloud administrator must grant the GCP instance service account associated with the provisioned VM instances read access to the bucket.
- Public bucket - Supported for all deployment types, as long as you share the bucket publicly.
- Azure Blob Storage or Azure Data Lake Storage Gen2
- You can store the file in a private or public Azure Blob Storage or Azure Data
Lake Storage Gen2 container, based on the deployment type:
- Private container - Supported for Azure VM deployments only. To ensure that all engine instances managed by the Azure VM deployment can access the file, your Azure administrator must grant the Azure managed identity associated with the provisioned VM instances read access to the container.
- Public container - Supported for all deployment types, as long as you share the container publicly.
Setting Up an Archive
Set up an external resource archive after you have finalized the list of external resources that your pipelines require.
-
Generate an archive file in the TGZ or ZIP format that includes all external
resources required by your pipelines.
Ensure that the file uses the required folder names and directory structure.
-
Store the file in a location that all engine instances managed by the
deployment can access.
For the list of valid locations, see Archive Location.
- In the Control Hub Navigation panel, click .
- Locate the deployment that you want to modify.
- In the Actions column, click the More icon () and then click Edit.
- In the Edit Deployment dialog box, expand the Configure Deployment section.
- Select Archive File for the External Resource Source property.
-
In the External Resource Location property, enter the
location to the archive file using the required format.
For the list of required location formats, see Archive Location.
- Click Save.
-
If associated engines are running, click Restart Engines
to restart all engine instances for the changes to take effect.
If associated engines are not running, they inherit the changes when the engines restart.
Updating an Archive
When a deployment uses an external resource archive and your pipelines require additional resources, you manually update the archive file to include new external resources and then restart all engine instances in the deployment.
- Download the archive file from the configured external resource location.
- Extract the file, and add the additional external resources to the required subfolder for the resource type.
- Compress the archive file in the TGZ or ZIP format.
-
Upload the updated archive file to the configured external resource
location.
To update associated engines, you must restart each engine.
- In the Navigation panel, click .
- Click an engine type tab.
- For each engine in the deployment, in the Actions column, click the More icon () and then click Restart Engine.