Azure VM Deployments

You can create an Azure Virtual Machine (Azure VM) deployment for an active Azure environment.

When you create an Azure VM deployment, you define the engine type, version, and configuration to deploy to the Azure virtual network (VNet) specified in the environment. You also specify the number of engine instances to deploy. Each engine instance runs on a dedicated VM instance.

When you start an Azure VM deployment, Control Hub connects to the Azure VNet specified in the environment and then uses an Azure Resource Manager template to create an Azure Deployment. Azure Resource Manager provisions the group of VM instances in the VNet and then deploys and launches one StreamSets engine instance on each VM instance.

Azure Resource Manager manages the provisioning and monitoring of the VM instances. Control Hub simply receives the status of the deployed StreamSets engine instances and sends any updates to Resource Manager.

When you stop an Azure VM deployment, Resource Manager deletes the existing VM instances.

Important: You are responsible for all costs from Microsoft Azure incurred by the resources provisioned by Control Hub. StreamSets strongly advises against directly modifying the provisioned resources in Azure. Doing so may cause unexpected errors.

For more information about Azure Resource Manager, see the Azure Resource Manager documentation.

Before you create an Azure VM deployment, you must complete several prerequisites.

VM Instance Details

Each provisioned Azure VM instance is set up with the following software, based on the deployed engine type and version.
Note: If you need to set up the provisioned instances with additional software, you can define an initialization script for the deployment.
Engine Type and Version Software
Data Collector 5.8.0 and later
  • Ubuntu 22.04
  • OpenJDK 8 by default, or the OpenJDK version defined in the deployment
  • StreamSets Data Collector engine as a tarball
Data Collector 5.7.x and earlier
  • CentOS 7.x
  • OpenJDK 8
  • Latest Docker version
  • StreamSets Data Collector engine as a Docker image
Transformer - all versions
  • CentOS 7.x
  • For Scala 2.11:
    • OpenJDK 8
    • Apache Spark 2.4.8 prebuilt with Scala 2.11
  • For Scala 2.12:
    • OpenJDK 11
    • Apache Spark 3.0.3 prebuilt with Scala 2.12
  • Latest Docker version
  • StreamSets Transformer engine as a Docker image

Secrets

When you start an Azure VM deployment, the following information is stored as secrets in Azure Key Vault:
  • Authentication token that the deployment uses to communicate with the StreamSets platform.
  • Proxy credentials, including the HTTP and HTTPS proxy user and password, when you configure engines to use a proxy server.

Control Hub creates a unique key vault for each Azure VM deployment.

Prerequisites

Before you create an Azure VM deployment, complete the following prerequisites:
Create a Microsoft Azure (Azure) environment
Create and activate an Azure environment in Control Hub, as described in Azure Environments.
Configure a managed identity
Ask your Azure administrator to configure a managed identity in Azure to associate with the provisioned VM instances. If a default managed identity is defined for the parent Azure environment, you can skip this prerequisite and simply use the default. If a default is not set or if you'd like to override the default for the deployment, see Configure Managed Identities for VM Instances.
Create a resource group
Ask your Azure administrator to create a resource group in Azure that the provisioned VM instances are assigned to. If a default resource group is defined for the parent Azure environment, you can skip this prerequisite and simply use the default. If a default is not set or if you'd like to override the default for the deployment, see Configure Resource Groups for VM Instances.
Create an SSH key pair
Control Hub does not use a secure shell (SSH) key pair to access the VM instances. However, Azure requires that an SSH key be assigned to all VM instances. Designate the SSH key to use in one of the following ways:
  • Create a local key pair. When you create the deployment, enter the full contents of the public key to assign to the provisioned VM instances.
  • Ask your Azure administrator to create a new key pair or to designate an existing key pair in Azure. When you create the deployment, select the key pair name to assign to the provisioned VM instances. For more information on using the Azure portal to create SSH keys to access Linux VM instances, see the Azure Virtual Machines documentation.
Optionally, set up an external resource archive
When your pipelines require external resources and when you plan to deploy multiple engine instances, you must set up an external resource archive that all engine instances can access. When your pipelines do not require external resources or when using a single engine instance to get started with StreamSets, you do not need to complete this prerequisite.
You typically configure a deployment to use an external resource archive when you are ready to move to production, after you have finished building your pipelines and have finalized the list of external resources that your pipelines require. For more information, see External Resources.

Init Script for Custom DNS Servers

When the Azure VNet uses custom DNS servers and you deploy Data Collector 5.7.x and earlier or any version of Transformer, the Azure deployment requires an initialization script so that the StreamSets engine can detect the hostname of the provisioned VM instance.

Copy the following script, and then paste the exact text into the Init Script property in the Configure Autoscaling Group step of the deployment wizard:
#!/bin/sh
set -eux

sed -i '/^ExecStart=/a\
    --hostname %H --add-host %H:127.0.0.1 \\
' /etc/systemd/system/sdc.docker.service

systemctl daemon-reload

If needed, you can add additional commands to the end of this init script.

For more details about defining an init script, see Init Script.

Configuring an Azure VM Deployment

Configure an Azure Virtual Machine (Azure VM) deployment to define the group of engine instances to deploy to an Azure environment.
Important: Before configuring a deployment, you must complete the required prerequisites.

To create a new deployment, click Set Up > Deployments in the Navigation panel, and then click the Create Deployment icon: .

To edit an existing deployment, click Set Up > Deployments in the Navigation panel, click the deployment name, and then click Edit.

Define the Deployment

Define the deployment essentials, including the deployment name and type, the environment that the deployment belongs to, and the engine type and version to deploy.

Once saved, you cannot change the deployment type, the engine version, or the environment.

  1. Configure the following properties:
    Define Deployment Property Description
    Deployment Name Name of the deployment.

    Use a brief name that informs your team of the deployment use case.

    Deployment Type Select one of the following types:
    Environment Active environment where engine instances will be deployed.

    The selected deployment type determines the list of environments that display. For example, if creating an Azure VM deployment, then you can select an active Azure environment.

    Engine Type Type of engine to deploy:
    • Data Collector
    • Transformer
    Engine Version Engine version to deploy.
    Deployment Tags Optional tags that identify similar deployments within Control Hub. Use deployment tags to easily search and filter deployments.

    Enter nested tags using the following format:

    <tag1>/<tag2>/<tag3>

  2. If creating the deployment, click one of the following buttons:
    • Cancel - Cancels creating the deployment and exits the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Engine

Define the configuration of the engine to deploy. You can use the defaults to get started.

  1. Configure the following properties:
    Engine Property Description
    Stage Libraries

    Stage libraries to install on the engine.

    The available stage libraries depend on the selected engine type and version.

    Advanced Configuration

    Access to advanced configuration properties to further customize the engine. As you get started with StreamSets, the default values should work in most cases.

    The available properties depend on the selected engine type.

    External Resource Source Source of the external files and libraries, such as JDBC drivers, required by the engine:
    • None - External resources are not defined in the deployment.

      Select when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.

    • Archive File - External resources are included in an archive file defined in the deployment.

      Select when the deployment launches multiple engine instances and when your pipelines require external resources.

    External Resource Location

    Location of the archive file that contains the external resources used by the engine. The archive file must be in TGZ or ZIP format.

    Enter the location using one of the following formats:
    • File path. For example: /mnt/shared/externalResources.tgz
    • URL. For example, if the file is stored in an Azure Blob Storage or Azure Data Lake Storage Gen2 container: https://<storage account name>.blob.core.windows.net/<container name>/<blob name>/externalResources.tgz
    Tip: Click the download icon to download a sample externalResources.tgz file to view the required directory structure.

    Available when using an archive file as the source for external resources.

    Engine Labels Labels to assign to all engine instances launched for this deployment. Labels determine the group of engine instances that run a job.

    Default is the name of the deployment.

    Max CPU Load (%)

    Maximum percentage of CPU on the host machine that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.

    All engine instances belonging to the deployment inherit these resource threshold values.

    Default is 80.

    Max Memory (%)

    Maximum percentage of the configured Java heap size that an engine instance can use. When an engine equals or exceeds this threshold, Control Hub does not start new pipeline instances on the engine.

    Default is 100.

    Max Running Pipeline Count

    Maximum number of pipelines that can be running on each engine instance. When an engine equals this threshold, Control Hub does not start new pipeline instances on the engine.

    Default is 1,000,000.

  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure Azure VM Zones

Select the zones to provision the VM instances in. If the Azure region selected for the parent environment does not support zones, the deployment wizard skips this step.

  1. Select the zones to provision the VM instances in.
  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Azure VM Autoscaling Group

Configure details about the Azure VM instances that will be provisioned.

  1. Configure the following properties:
    Azure VM Autoscaling Group Property Description
    Desired Instances Number of engine instances to deploy.
    For each instance, Azure Resource Manager provisions a VM instance in the VNet, and then deploys and launches one engine instance on each VM instance.
    Important: If your pipelines require external resources, you must set up an external resource archive that all engine instances can access before increasing the number of instances.

    Default is 1. Set to the minimum value of 0 to temporarily prevent engine instances from running, as an alternative to stopping the deployment but that still incurs minimal costs from the cloud service provider.

    VM Size Size to use for the provisioned VM instances.

    For more information about VM sizes, see the Azure Virtual Machines documentation.

    Managed Identity Managed identity to associate with the provisioned VM instances. Select the managed identity created as an environment prerequisite by your Azure administrator.

    If a default managed identity is defined for the Azure environment, you can accept the default or override it with a different managed identity.

    Resource Group Resource group that the provisioned VM instances are assigned to. Select the resource group created as an environment prerequisite by your Azure administrator.

    If a default resource group is defined for the Azure environment, you can accept the default or override it with a different resource group.

    Azure Tags Tags to apply to all Azure resources provisioned for this deployment.

    Enter the tags as key-value pairs. For tag naming requirements, see the Azure Resource Manager documentation.

    You can define the tags using simple or bulk edit mode. In simple edit mode, click Add Another to define additional tags. In bulk edit mode, configure tags in JSON format.

    Important: These tags are applied to Azure resources, not to Control Hub deployments.
    Init Script

    Initialization script to run on each provisioned instance.

    Use the script to set up provisioned instances with additional software as required by your organization. The script must be a valid shell script with a maximum size of 8 KB.

    Enter the script directly in the property or upload a shell script file that uses an .sh extension. After uploading, you can edit the contents of the script.

    Important: You must include a specific init script when the Azure VNet uses custom DNS servers.
  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure Azure VM SSH Access

Configure the SSH key to assigned to all provisioned VM instances.

  1. Configure the following properties:
    Azure VM SSH Access Property Description
    SSH Key Source Source for the SSH keys:
    • Public SSH Key - Assigns the full contents of a public SSH key to the provisioned VM instances.
    • Existing SSH Key Pair Name - Assigns an existing SSH key pair in your Azure account to the provisioned VM instances.
    Public SSH Key Full contents of the public SSH key to assign to the provisioned VM instances.

    Enter the contents of the key created as a deployment prerequisite.

    Key Pair Name Name of the existing Azure key pair to assign to the provisioned VM instances.

    Select the key pair created as a deployment prerequisite by your Azure administrator.

    Attach Public IP Attaches a public IP address to the provisioned VM instances.

    Select when you need to use SSH from a machine outside of the Azure VNet to access the VM instances.

  2. If creating the deployment, click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Share the Deployment

By default, the deployment can only be seen by you. Share the deployment with other users and groups to grant them access to it.

  1. In the Select Users and Groups field, type a user email address or a group name.
  2. Select users or groups from the list, and then click Add.

    The added users and groups display in the User / Group table.

  3. Modify permissions as needed. By default, each added user or group is granted the following permissions:
    • Read - View the details of the deployment and of all engines managed by the deployment. Restart or shut down individual engines managed by the deployment in the Engines view.
    • Write - Edit, start, stop, and delete the deployment. Delete engines managed by the deployment. Also requires read access on the parent environment.
    • Execute - Start jobs on engines managed by the deployment. Starting jobs also requires execute access on the job and read access on the pipeline.

    For more information, see Deployment Permissions.

  4. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Review and Launch the Deployment

You've successfully finished creating the deployment.

  1. Click one of the following buttons:
    • Exit - Saves the deployment and exits the wizard, displaying the Deactivated deployment in the Deployments view. You can start the deployment at a later time.
    • Launch Deployment - Starts the deployment, provisions VM instances in your Azure account, and launches a StreamSets engine on each instance.
      Note: It can take a few minutes to provision the VM instances and then launch the StreamSets engine instances.
  2. If the deployment launches a Transformer engine that works with a Spark cluster, you must grant the Spark cluster access to Transformer.

    For instructions, see Granting the Spark Cluster Access to Transformer in the Transformer engine documentation.

Editing an Azure VM Deployment

You can edit an Azure VM deployment while it is deactivated or active.

When you stop a deployment, all existing VM instances are deleted. After you edit properties and then restart the deployment, Control Hub uses Azure Resource Manager to provision a new group of VM instances and launch a new StreamSets engine instance on each VM instance.

When you edit a deployment while it is active, existing VM instances might be deleted, depending on the following types of edited properties:

General deployment or engine properties
When you edit general deployment or engine properties while the deployment is active, Azure Resource Manager continues running the existing VM instances. Changes are replicated to all StreamSets engine instances on the next restart of the engines.
For example, let's say you edit the deployment to install additional stage libraries on the engine instances, and then you instruct Control Hub to restart all engine instances. Control Hub restarts the StreamSets engine instances on the running VM instances, which triggers the installation of the additional stage libraries and the engine property changes.
Azure VM properties
When you edit Azure VM properties while the deployment is active, Azure Resource Manager might replace all of the existing VM instances, depending on the change. If a replacement is needed, Resource Manager replaces all of the existing VM instances. This results in engine downtime while the new instances are being provisioned.
For example, if you edit a deployment to add or change the init script, Resource Manager does not replace the existing VM instances. You must restart the deployment so that Resource Manager provisions a new group of VM instances using the changed init script. If you edit a deployment to change the VM size, Resource Manager deletes all existing VM instances, and then provisions new VM instances to replace them.
Note: You cannot change the zones or resource group while the deployment is active. You must stop the deployment to change these properties.

To edit a deployment, locate the deployment in the Deployments view. In the Actions column, click the More icon () and then click Edit.

Tracking URL

When you view the details of an active Azure VM deployment, you can access a tracking URL to the Azure portal. Use the URL to view additional information about the Azure resources automatically provisioned for the deployment.

To access the tracking URL, click an Azure VM deployment name in the Deployments view and then locate the Tracking URL property in the deployment details.

Click the URL to open the Azure portal. The portal displays the overview page of the Azure deployment created for your StreamSets deployment, listing the resources provisioned for the deployment.

Use the Azure portal to explore details about each resource and locate errors that might have occurred.

For example, if you configured the Azure VM deployment to attach a public IP address to the provisioned VM instances, you can expand the deployment details to access the public IP addresses, as follows:

Important: Viewing details about provisioned resources in the portal can help you troubleshoot deployment configuration issues. However, StreamSets strongly advises against directly modifying the provisioned resources using the portal. Doing so may cause unexpected errors.

The following topic provides brief tips on finding the most useful information about the provisioned resources. For more details about monitoring an Azure deployment, see the Microsoft Azure documentation.

Instances

In the overview page of the Azure deployment, click the virtual machine scale set resource, and then click Settings > Instances. The Azure portal lists all VM instances provisioned for the deployment.

Click an instance name to view specific details about the VM instance, including the public and private IP addresses. For example, the following image displays a sample VM instance details page: