User-Provided Stage Library Mode

For advanced use cases, you can configure a Data Collector or Transformer deployment to use the user-provided stage library mode where you provide the stage library files during the engine installation. For example, you might use the user-provided stage library mode if your organization requires that the stage library files be scanned for security purposes before they are installed on the engine machines.

Note: A Transformer for Snowflake engine includes all available stages and credential stores. As a result, you cannot configure the user-provided stage library mode for a Transformer for Snowflake deployment.
Important: In most situations, you can use the default managed stage library mode. The managed stage library mode allows you to select stage libraries as part of the deployment, and then IBM StreamSets automatically synchronizes the files to your engines.

Before you can use the user-provided stage library mode, you must complete the prerequisite tasks.

Prerequisites

Before you can use the user-provided stage library mode, you must complete the following prerequisite tasks:

Display the Stage Library Mode Property

By default, deployments use the managed stage library mode.

Before you can use the user-provided stage library mode, a user with the Organization Administrator role must modify the organization configuration properties to display the Stage Library Mode property for deployments.

  1. As an organization administrator, click Manage > My Organization in the Navigation panel.
  2. Click Advanced.
  3. Select the Show the Stage Library Mode UI field for CSP Deployments property.
  4. Click Save to save changes made to advanced properties.

    It can take a few minutes for the change to take effect.

Download the Stage Libraries

Download the stage library files that you want to install on engines.

  1. Enter the full URL to the stage library files located on https://archives.streamsets.com. Ensure that the stage library file version matches the engine version.

    The download URL depends on the following engine types:

    Data Collector
    To download all stage library files, enter the following URL:

    https://archives.streamsets.com/datacollector/<version>/tarball/streamsets-datacollector-all-<version>.tgz

    For example, to download all stage library files for Data Collector 5.10.0, enter:

    https://archives.streamsets.com/datacollector/5.10.0/tarball/streamsets-datacollector-all-5.10.0.tgz

    Downloading all stage library files can take some time. Alternatively, you can individually download stage library files by entering the following URL:

    https://archives.streamsets.com/datacollector/<version>/tarball/streamsets-datacollector-<stagelib_name>-lib-<version>.tgz

    For example, to download the Amazon Web Services stage library for Data Collector 5.10.0, enter:

    https://archives.streamsets.com/datacollector/5.10.0/tarball/streamsets-datacollector-aws-lib-5.10.0.tgz

    You can locate the stage library name from the Data Collector documentation. Or, you can configure a deployment for the managed stage library mode, select the stage libraries from the Control Hub UI, and then view the Summary tab.

    Transformer
    To download all stage library files, enter the following URL:

    https://archives.streamsets.com/transformer/<version>/<scala_version>/tarball/streamsets-transformer-all_<scala_version>-<version>.tgz

    For example, to download all stage library files for Transformer 5.7.0 using Scala 2.12, enter:

    https://archives.streamsets.com/transformer/5.7.0/2.12/tarball/streamsets-transformer-all_2.12-5.7.0.tgz

    Downloading all stage library files can take some time. Alternatively, you can individually download stage library files by entering the following URL:

    https://archives.streamsets.com/transformer/<version>/<scala_version>/tarball/streamsets-spark-<stagelib_name>-lib_<scala_version>-<version>.tgz

    For example, to download the JDBC stage library for Transformer 5.7.0 using Scala 2.12, enter:

    https://archives.streamsets.com/transformer/5.7.0/2.12/tarball/streamsets-spark-jdbc-lib_2.12-5.7.0.tgz

    To locate a stage library name, configure a deployment for the managed stage library mode, select the stage libraries from the Control Hub UI, and then view the Summary tab.

  2. Locate the downloaded TGZ file in your default downloads directory.
  3. Extract the TGZ file and locate the stage library folder under the streamsets-libs folder.
    For example, if you downloaded the Amazon Web Services stage library for Data Collector 5.10.0, the extracted TGZ file contains the following folders:
    streamsets-datacollector-5.10.0
       streamsets-libs
          streamsets-datacollector-aws-lib

    The streamsets-datacollector-aws-lib folder includes the Amazon Web Services stage library files.

  4. Copy the downloaded TGZ file or the extracted stage library folders to another location as needed.

Provide Files for Self-Managed Deployments

To provide stage library files for a self-managed deployment, in the Configure Engine step of the deployment wizard, select User-Provided for the Stage Library Mode property.

When you launch engines for the deployment, the streamsets-libs directory in the engine installation contains a few default stage libraries. Copy the downloaded stage library files into the directory, and then restart the engine.

For example, for a Data Collector 5.10.0 tarball, copy the downloaded stage library files into the /streamsets-datacollector-5.10.0/streamsets-libs directory and then restart the engine.

For a Docker image installation, you can provide the files to the engine by editing the running container. For example, for a Data Collector 5.10.0 Docker image, you can start a Bash shell in the running Docker container, copy the downloaded stage library files into the /opt/streamsets-datacollector-5.10.0/streamsets-libs directory, and then restart the engine.

Alternatively, you can configure the Docker image to mount an external directory containing the downloaded stage library files, or you can create a custom Docker image derived from an IBM StreamSets engine image that includes the downloaded stage library files.

Provide Files for Cloud Service Provider Deployments

To provide stage library files for cloud service provider deployments, such as Amazon EC2, Azure VM, or GCE deployments, in the Configure Engine step of the deployment wizard, select User-Provided for the Stage Library Mode property.

Then in the Configure Autoscaling Group step of the deployment wizard, define the Init Script property to include commands that copy the downloaded stage library files into the streamsets-libs folder in the engine installation.

For example, if you copied the Amazon Web Services stage library TGZ file for Data Collector 5.10.0 to your own web server, you might define the initialization script as follows:
#!/bin/bash
wget -q https://<web_server>.com/streamsets-datacollector-aws-lib-5.10.0.tgz -P /tmp/
tar -zxf /tmp/streamsets-datacollector-aws-lib-5.10.0.tgz -C /opt/streamsets-datacollector/streamsets-libs/ --strip-components=2

When you start the deployment, the initialization script copies the files into the engine installation on each provisioned instance in your cloud account.

Provide Files for Kubernetes Deployments

To provide stage library files for a Kubernetes deployment, in the Configure Engine step of the deployment wizard, select User-Provided for the Stage Library Mode property.

Then in the Configure Kubernetes Deployment step of the deployment wizard, use advanced mode to directly edit the deployment YAML file such that the downloaded stage library files are copied into the streamsets-libs folder in the engine installation.

For example, you might create a static persistent volume in Kubernetes with the downloaded stage library directories and a persistent volume claim. For details on Kubernetes persistent volumes, see the Kubernetes documentation.

Then you would edit the deployment YAML file to use your persistent volume and persistent volume claim to mount the stage library directory volume as read-only. If using Data Collector, you might add the following lines to the spec/template/spec/containers[0] section:
volumeMounts:
 - mountPath: /opt/streamsets-datacollector-<version>/streamsets-libs
   name: stagelibs
   readOnly: true
   subPath: streamsets-libs
And then add the following lines to the spec/template/spec section:
volumes:
 - name: stagelibs
   persistentVolumeClaim:
     claimName: stage-libs-claim
     readOnly: true

When you start the deployment, the stage library files are mounted into the engine installation on each Kubernetes pod.

Alternatively, you can create a custom Docker image derived from an IBM StreamSets engine image that includes the downloaded stage library files, and then use advanced mode to configure the deployment YAML file to use the custom image.