Azure VM Deployments

You can create an Azure Virtual Machine (Azure VM) deployment for an active Azure environment.

When you create an Azure VM deployment, you define the engine type, version, and configuration to deploy to the Azure virtual network (VNet) specified in the environment. You also specify the number of engine instances to deploy. Each engine instance runs on a dedicated VM instance.

When you start an Azure VM deployment, Control Hub connects to the Azure VNet specified in the environment and then uses an Azure Resource Manager template to create an Azure Deployment. Azure Resource Manager provisions the group of VM instances in the VNet and then deploys and launches one StreamSets engine instance on each VM instance.

Azure Resource Manager manages the provisioning and monitoring of the VM instances. Control Hub simply receives the status of the deployed StreamSets engine instances and sends any updates to Resource Manager.

When you stop an Azure VM deployment, Resource Manager deletes the existing VM instances.

Important: You are responsible for all costs from Microsoft Azure incurred by the resources provisioned by Control Hub. StreamSets strongly advises against directly modifying the provisioned resources in Azure. Doing so may cause unexpected errors.

For more information about Azure Resource Manager, see the Azure Resource Manager documentation.

Before you create an Azure VM deployment, you must complete several prerequisites.

Note: Due to your account agreement, creating deployments for Azure environments might be disabled for your organization. For more information, contact your StreamSets account team.

VM Instance Details

Each provisioned Azure VM instance is set up with the following software, based on the selected engine type.
Note: If you need to set up the provisioned instances with additional software, you can define an initialization script for the deployment.
Engine Type Software
Data Collector
  • CentOS 7.x
  • OpenJDK 8
  • Docker
  • StreamSets Data Collector engine as a Docker image
Transformer
  • CentOS 7.x
  • For Scala 2.11:
    • OpenJDK 8
    • Apache Spark 2.4.8 prebuilt with Scala 2.11
  • For Scala 2.12:
    • OpenJDK 11
    • Apache Spark 3.0.3 prebuilt with Scala 2.12
  • Docker
  • StreamSets Transformer engine as a Docker image

Prerequisites

Before you create an Azure VM deployment, complete the following prerequisites:
Create a Microsoft Azure (Azure) environment
Create and activate an Azure environment in Control Hub, as described in Azure Environments.
Configure a managed identity
Ask your Azure administrator to configure a managed identity in Azure to associate with the provisioned VM instances. If a default managed identity is defined for the parent Azure environment, you can skip this prerequisite and simply use the default. If a default is not set or if you'd like to override the default for the deployment, see Configure Managed Identities for VM Instances.
Create a resource group
Ask your Azure administrator to create a resource group in Azure that the provisioned VM instances are assigned to. If a default resource group is defined for the parent Azure environment, you can skip this prerequisite and simply use the default. If a default is not set or if you'd like to override the default for the deployment, see Configure Resource Groups for VM Instances.
Create an SSH key pair
Control Hub does not use a secure shell (SSH) key pair to access the VM instances. However, Azure requires that an SSH key be assigned to all VM instances. Designate the SSH key to use in one of the following ways:
  • Create a local key pair. When you create the deployment, enter the full contents of the public key to assign to the provisioned VM instances.
  • Ask your Azure administrator to create a new key pair or to designate an existing key pair in Azure. When you create the deployment, select the key pair name to assign to the provisioned VM instances. For more information on using the Azure portal to create SSH keys to access Linux VM instances, see the Azure Virtual Machines documentation.
Optionally, set up an external resource archive
When your pipelines require external resources and when you plan to deploy multiple engine instances, you must set up an external resource archive that all engine instances can access. When your pipelines do not require external resources or when using a single engine instance to get started with StreamSets, you do not need to complete this prerequisite.
You typically configure a deployment to use an external resource archive when you are ready to move to production, after you have finished building your pipelines and have finalized the list of external resources that your pipelines require. For more information, see External Resources.

Init Script for Custom DNS Servers

When the Azure VNet uses custom DNS servers, you must define the following initialization script in the Azure deployment so that a StreamSets engine can detect the hostname of the provisioned VM instance.

Copy the following script, and then paste the exact text into the Init Script property in the Configure Autoscaling Group step of the deployment wizard:
#!/bin/sh
requireddomain=reddog.microsoft.com
new_ip_address=$(ip -f inet a show eth0| grep inet| awk '{ print $2}' | cut -d/ -f1)

host=`hostname`
nsupdatecmds=/var/tmp/nsupdatecmds
echo "update delete $host.$requireddomain a" > $nsupdatecmds
echo "update add $host.$requireddomain 3600 a $new_ip_address" >> $nsupdatecmds
echo "send" >> $nsupdatecmds

nsupdate $nsupdatecmds

If needed, you can add additional commands to the end of this init script.

For more details about defining an init script, see Init Script.

Creating an Azure VM Deployment

Create an Azure Virtual Machine (Azure VM) deployment to define the group of engine instances to deploy to an Azure environment.
Important: Before creating a deployment, you must complete the required prerequisites.

To create a new Azure VM deployment, click Set Up > Deployments in the Navigation panel, and then click the Create Deployment icon: .

Define the Deployment

Define the deployment essentials, including the deployment name and type, the environment that the deployment belongs to, and the engine type and version to deploy.

Once saved, you cannot change the deployment type, the engine version, or the environment.

  1. Configure the following properties:
    Define Deployment Property Description
    Deployment Name Name of the deployment.

    Use a brief name that informs your team of the deployment use case.

    Deployment Type Select one of the following types:
    Environment Active environment where engine instances will be deployed.

    The selected deployment type determines the list of environments that display. For example, if creating an Azure VM deployment, then you can select an active Azure environment.

    Engine Type Type of engine to deploy:
    • Data Collector
    • Transformer
    Engine Version Engine version to deploy.
    Deployment Tags Optional tags that identify similar deployments within Control Hub. Use deployment tags to easily search and filter deployments.

    Enter nested tags using the following format:

    <tag1>/<tag2>/<tag3>

  2. Click one of the following buttons:
    • Cancel - Cancels creating the deployment and exits the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Engine

Define the configuration of the engine to deploy. You can use the defaults to get started.

  1. Configure the following properties:
    Engine Property Description
    Stage Libraries

    Stage libraries to install on the engine.

    The available stage libraries depend on the selected engine type and version.

    Advanced Configuration

    Access to advanced configuration properties to further customize the engine. As you get started with StreamSets, the default values should work in most cases.

    The available properties depend on the selected engine type.

    External Resource Source Source of the external files and libraries, such as JDBC drivers, required by the engine:
    • None - External resources are not defined in the deployment.

      Select when using a single engine instance to get started with StreamSets, or when your pipelines do not require external resources.

    • Archive File - External resources are included in an archive file defined in the deployment.

      Select when the deployment launches multiple engine instances and when your pipelines require external resources.

    External Resource Location

    Location of the archive file that contains the external resources used by the engine. The archive file must be in TGZ or ZIP format.

    Enter the location using one of the following formats:
    • File path. For example: /mnt/shared/externalResources.tgz
    • HTTP URL. For example, if the file is stored in an Azure Blob Storage or Azure Data Lake Storage Gen2 container and the URL is shared publicly: https://<storage account name>.blob.core.windows.net/<container name>/<blob name>/externalResources.tgz
    Tip: Click the download icon to download a sample externalResources.tgz file to view the required directory structure.

    Available when using an archive file as the source for external resources.

    Engine Labels Labels to assign to all engine instances launched for this deployment. Labels determine the group of engine instances that run a job.

    Default is the name of the deployment.

  2. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure Azure VM Zones

Select the zones to provision the VM instances in. If the Azure region selected for the parent environment does not support zones, the deployment wizard skips this step.

  1. Select the zones to provision the VM instances in.
  2. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure the Azure VM Autoscaling Group

Configure details about the Azure VM instances that will be provisioned.

  1. Configure the following properties:
    Azure VM Autoscaling Group Property Description
    Engine Instances Number of engine instances to deploy.
    For each engine instance, Azure Resource Manager provisions a VM instance in the VNet, and then deploys and launches one engine instance on each VM instance.
    Important: If your pipelines require external resources, you must set up an external resource archive that all engine instances can access before increasing the number of engine instances.
    VM Size Size to use for the provisioned VM instances.

    For more information about VM sizes, see the Azure Virtual Machines documentation.

    Managed Identity Managed identity to associate with the provisioned VM instances. Select the managed identity created as an environment prerequisite by your Azure administrator.

    If a default managed identity is defined for the Azure environment, you can accept the default or override it with a different managed identity.

    Resource Group Resource group that the provisioned VM instances are assigned to. Select the resource group created as an environment prerequisite by your Azure administrator.

    If a default resource group is defined for the Azure environment, you can accept the default or override it with a different resource group.

    Azure Tags Tags to apply to all Azure resources provisioned for this deployment.

    Enter the tags as key-value pairs. For tag naming requirements, see the Azure Resource Manager documentation.

    You can define the tags using simple or bulk edit mode. In simple edit mode, click Add Another to define additional tags. In bulk edit mode, configure tags in JSON format.

    Important: These tags are applied to Azure resources, not to Control Hub deployments.
    Init Script

    Initialization script to run on each provisioned instance.

    Use the script to set up provisioned instances with additional software as required by your organization. The script must be a valid shell script with a maximum size of 8 KB.

    Enter the script directly in the property or upload a shell script file that uses an .sh extension. After uploading, you can edit the contents of the script.

    Important: You must include a specific init script when the Azure VNet uses custom DNS servers.
  2. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Configure Azure VM SSH Access

Configure the SSH key to assigned to all provisioned VM instances.

  1. Configure the following properties:
    Azure VM SSH Access Property Description
    SSH Key Source Source for the SSH keys:
    • Public SSH Key - Assigns the full contents of a public SSH key to the provisioned VM instances.
    • Existing SSH Key Pair Name - Assigns an existing SSH key pair in your Azure account to the provisioned VM instances.
    Public SSH Key Full contents of the public SSH key to assign to the provisioned VM instances.

    Enter the contents of the key created as a deployment prerequisite.

    Key Pair Name Name of the existing Azure key pair to assign to the provisioned VM instances.

    Select the key pair created as a deployment prerequisite by your Azure administrator.

    Attach Public IP Attaches a public IP address to the provisioned VM instances.

    Select when you need to use SSH from a machine outside of the Azure VNet to access the VM instances.

  2. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Share the Deployment

By default, the deployment can only be seen by you. Share the deployment with other users and groups to grant them access to it.

  1. In the Select Users and Groups field, type a user email address or a group name.
  2. Select users or groups from the list, and then click Add.

    The added users and groups display in the User / Group table.

  3. Modify permissions as needed. By default, each added user or group is granted the following permissions:
    • Read - View the details of the deployment and of all engines managed by the deployment. Restart or shut down individual engines managed by the deployment in the Engines view.
    • Write - Edit, start, stop, and delete the deployment. Delete engines managed by the deployment. Also requires read access on the parent environment.
    • Execute - Start jobs on engines managed by the deployment. Starting jobs also requires execute access on the job and read access on the pipeline.

    For more information, see Deployment Permissions.

  4. Click one of the following buttons:
    • Back - Returns to the previous step in the wizard.
    • Save & Next - Saves the deployment and continues.
    • Save & Exit - Saves the deployment and exits the wizard, displaying the incomplete deployment in the Deployments view.

Review and Launch the Deployment

You've successfully finished creating the deployment.

  1. Click one of the following buttons:
    • Exit - Saves the deployment and exits the wizard, displaying the Deactivated deployment in the Deployments view. You can start the deployment at a later time.
    • Launch Deployment - Starts the deployment, provisions VM instances in your Azure account, and launches a StreamSets engine on each instance.
      Note: It can take a few minutes to provision the VM instances and then launch the StreamSets engine instances.
  2. If the deployment launches a Transformer engine that works with a Spark cluster, you must grant the Spark cluster access to Transformer.

    For instructions, see Granting the Spark Cluster Access to Transformer in the Transformer engine documentation.

Editing an Azure VM Deployment

You can edit an Azure VM deployment while it is deactivated or active.

When you stop a deployment, all existing VM instances are deleted. After you edit properties and then restart the deployment, Control Hub uses Azure Resource Manager to provision a new group of VM instances and launch a new StreamSets engine instance on each VM instance.

When you edit a deployment while it is active, existing VM instances might be deleted, depending on the following types of edited properties:

General deployment or engine properties
When you edit general deployment or engine properties while the deployment is active, Azure Resource Manager continues running the existing VM instances. Changes are replicated to all StreamSets engine instances on the next restart of the engines.
For example, let's say you edit the deployment to install additional stage libraries on the engine instances, and then you instruct Control Hub to restart all engine instances. Control Hub restarts the StreamSets engine instances on the running VM instances, which triggers the installation of the additional stage libraries and the engine property changes.
Azure VM properties
When you edit Azure VM properties while the deployment is active, Azure Resource Manager might replace all of the existing VM instances, depending on the change. If a replacement is needed, Resource Manager replaces all of the existing VM instances. This results in engine downtime while the new instances are being provisioned.
For example, if you edit a deployment to add or change the init script, Resource Manager does not replace the existing VM instances. You must restart the deployment so that Resource Manager provisions a new group of VM instances using the changed init script. If you edit a deployment to change the VM size, Resource Manager deletes all existing VM instances, and then provisions new VM instances to replace them.
Note: You cannot change the zones or resource group while the deployment is active. You must stop the deployment to change these properties.

To edit a deployment, locate the deployment in the Deployments view. In the Actions column, click the More icon () and then click Edit.

Tracking URL

When you view the details of an active Azure VM deployment, you can access a tracking URL to the Azure portal. Use the URL to view additional information about the Azure resources automatically provisioned for the deployment.

To access the tracking URL, click an Azure VM deployment name in the Deployments view and then locate the Tracking URL property in the deployment details.

Click the URL to open the Azure portal. The portal displays the overview page of the virtual machine scale set created for your StreamSets deployment. The overview page includes links to the VM instances and the resource group provisioned for the deployment, as follows:

Use the Azure portal to explore details about each resource and locate errors that might have occurred.
Important: Viewing details about provisioned resources in the portal can help you troubleshoot deployment configuration issues. However, StreamSets strongly advises against directly modifying the provisioned resources using the portal. Doing so may cause unexpected errors.

The following topics provide brief tips on finding the most useful information about the provisioned resources. For more details about monitoring an Azure virtual machine scale set, see the Microsoft Azure documentation.

Instances

In the overview page of the virtual machine scale set, click Settings > Instances. The Azure portal lists all VM instances provisioned for the deployment.

Click an instance name to view specific details about the VM instance, including the public and private IP addresses. For example, the following image displays a sample VM instance details page:

Resource Group

In the overview page of the virtual machine scale set, click the name of the resource group. The Azure portal displays the overview page for the resource group.

To view all deployments created for the resource group, click Settings > Deployments. The Azure portal lists each deployment with a Succeeded or Failed status, as follows:

Click the name of the first deployment in the list to view the deployment overview page. Expand the deployment details to access the public IP addresses, as follows: