Install StreamSets for Databricks on Azure

You can install StreamSets for Databricks on Microsoft Azure. StreamSets for Databricks includes both StreamSets Data Collector and Transformer.

Data Collector and Transformer are installed as RPM packages on a Linux virtual machine hosted on Microsoft Azure. Data Collector and Transformer are available as services on the instance after the deployment is complete.

Note: The StreamSets for Databricks offer includes the medium sized Transformer offer.
  1. Log in to the Microsoft Azure portal.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets, and then select StreamSets for Databricks.
  4. Click Create.
  5. On the Create virtual machine page, enter a name for the new virtual machine, the user name to log in to that virtual machine, and the authentication method to use for logins.
    Important: Do not use sdc or transformer as the user name to log in to the virtual machine. The sdc and transformer user accounts must be reserved as the system user accounts that run Data Collector and Transformer as services.

    You can create the virtual machine in a new or existing resource group.

    You can optionally change the virtual machine size, but the default size is sufficient in most cases. If you change the default, select a size that meets both the Data Collector installation requirements and the Transformer installation requirements.

    For example, the following configuration creates a Standard A2 size virtual machine named streamsets-databricks with a user named streamsets-user who can log in using password authentication. The virtual machine is created in a new resource group named streamsets-databricks:

  6. Click Next.
  7. On the Disks page under Advanced, verify that Use managed disks is enabled.
  8. On the Networking page, select an existing group or create a new network security group for the virtual machine.
  9. On the remaining pages, accept the defaults or configure the optional features.
  10. Verify the details in the Review and Create page, and then click Create.
    It can take several minutes for the resource to deploy and for Data Collector and Transformer to start as services.
  11. On the Overview page for the deployment, click the name of the network security group.
  12. In the Inbound security rules section for the security group, click the name of each of the following rules and then configure the range of IP addresses allowed for each port.
    Important: The default range of 0.0.0.0/0 gives all IP addresses access to Data Collector and Transformer. Be sure to modify the default values to restrict access to known IP addresses only.
    Inbound Security Rule Description
    Data_Collector Range of IP addresses that can access the Data Collector web-based UI on port 18630.
    Transformer Range of IP addresses that can access the Transformer web-based UI on port 19630.
    default-allow-ssh Range of IP addresses that can use SSH to access the virtual machine on port 22 to run the Data Collector or Transformer command line interface.
    Note: If you change the default port or enable HTTPS for Data Collector or Transformer after installation, you also need to modify the appropriate rule to reflect the changed port number.
  13. To access Data Collector, enter the following URL in the address bar of your browser:
    http://<virtual machine IP address>:18630
  14. To access Transformer, enter the following URL in the address bar of your browser:
    http://<virtual machine IP address>:19630
  15. Use the following default credentials to log in: admin/admin.
    Tip: If you are new to Data Collector, consider starting with the Databricks Delta Lake solutions. If you are new to Transformer, here are the basics.