Install StreamSets for Databricks on Amazon Web Services

You can install StreamSets for Databricks on Amazon Web Services (AWS). StreamSets for Databricks includes both StreamSets Data Collector and Transformer.

Data Collector and Transformer are installed as RPM packages on an Amazon Linux 2 machine hosted on EC2. Data Collector and Transformer are available as services on the instance after the deployment is complete.

Note: The StreamSets for Databricks offer includes the medium sized Transformer offer.

For more details about StreamSets for Databricks on AWS, see the AWS Marketplace listing.

  1. In the AWS Marketplace, search for StreamSets, and then subscribe to the StreamSets for Databricks offering.
  2. Accept the terms and conditions, and then click Continue to Configuration.
  3. Select the appropriate AWS fulfillment options, and then click Continue to Launch.
  4. To launch StreamSets for Databricks from the AWS marketplace website, choose Launch from Website and then complete the following steps:
    1. Select the recommended EC2 instance type or choose another instance type based on your expected workload.

      See the Data Collector installation requirements and the Transformer installation requirements for details.

    2. Select the appropriate VPC, subnet, and key pair settings.
    3. For the security group settings, click Create New Based on Seller Settings, enter a name for the new security group, and then configure the range of IP addresses allowed for each firewall rule.
      Important: The default range of 0.0.0.0/0 gives all IP addresses access to Data Collector and Transformer. Be sure to modify the default values to restrict access to known IP addresses only.
      Firewall Rule Description
      Rule for port 18630 Range of IP addresses that can access the Data Collector web-based UI on port 18630.
      Rule for port 19630 Range of IP addresses that can access the Transformer web-based UI on port 19630.
    4. Click Launch.
  5. To launch StreamSets for Databricks from the AWS EC2 console, choose Launch through EC2 and then complete the following steps:
    1. Click Launch.
    2. Select the recommended EC2 instance type or choose another instance type based on your expected workload.

      See the Data Collector installation requirements and the Transformer installation requirements for details.

    3. When configuring the security group for the instance, configure the range of IP addresses allowed for each firewall rule.
      Important: The default range of 0.0.0.0/0 gives all IP addresses access to Data Collector and Transformer. Be sure to modify the default values to restrict access to known IP addresses only.
      Firewall Rule Description
      Rule for port 18630 Range of IP addresses that can access the Data Collector web-based UI on port 18630.
      Rule for port 19630 Range of IP addresses that can access the Transformer web-based UI on port 19630.
    4. After reviewing the details, click Launch.
  6. When launching the instance, note the instance ID on the Launch Status page.

    The password to Data Collector and Transformer matches the instance ID.

    AWS might require a few minutes to launch an instance.

  7. To access Data Collector, enter the following URL in the address bar of your browser:
    http://<Public DNS of EC2 instance>:18630

    For example if your DNS is ec2-12-345-678-999.compute-1.amazonaws.com, enter:

    http://ec2-12-345-678-999.compute-1.amazonaws.com:18630
  8. To access Transformer, enter the following URL in the address bar of your browser:
    http://<Public DNS of EC2 instance>:19630

    For example if your DNS is ec2-12-345-678-999.compute-1.amazonaws.com, enter:

    http://ec2-12-345-678-999.compute-1.amazonaws.com:19630
  9. To log in to either Data Collector or Transformer, enter admin as the user name and the EC2 instance ID as the password.
    Tip: If you are new to Data Collector, consider starting with the Databricks Delta Lake solutions. If you are new to Transformer, here are the basics.