Enabling HTTPS

To secure communication within Data Collector, enable HTTPS for the following components:
Data Collector
Enable HTTPS for Data Collector to secure the communication to the Data Collector UI and REST API and to use the Data Collector as an authoring Data Collector in Control Hub.
For Control Hub cloud, an authoring Data Collector must use the HTTPS protocol because Control Hub cloud also uses the HTTPS protocol. For Control Hub on-premises, an authoring Data Collector must use the same protocol as the Control Hub on-premises installation.
Cluster pipelines
If you run cluster pipelines, enable HTTPS for cluster pipelines to secure the communication between the gateway and worker nodes in the cluster.
Pipeline stages that connect to external systems
During pipeline development, developers can enable specific stages to use SSL/TLS to secure the communication with an external system. For example, if designing a pipeline that writes to a Cassandra cluster enabled for HTTPS, the developer must configure the Cassandra destination to use SSL/TLS to connect to Cassandra.
For information about enabling HTTPS for pipeline stages, see SSL/TLS Encryption.

By default, Data Collector and cluster pipelines use the HTTP protocol. StreamSets recommends using HTTPS in a production environment. HTTPS requires SSL/TLS certificates.

Prerequisites

Before you enable HTTPS for Data Collector and cluster pipelines, complete the following requirements:

Obtain access to OpenSSL and Java keytool
If you do not have keystore files that include SSL/TLS certificates signed by a certificate authority (CA), you can request certificates and create the keystore files using the following tools:
  • OpenSSL - Use OpenSSL to create a Certificate Signing Request (CSR) that you send to the CA of your choice, as well as to create the keystore and truststore files. For more information, see the OpenSSL documentation.
  • Java keytool - You can also use Java keytool to create a CSR and to create keystore and truststore files. Java keytool is part of the Java Development Kit (JDK). For more information, see the keytool documentation.
Generate SSL/TLS certificate and private key pairs signed by a certificate authority (CA)
To enable HTTPS for Data Collector, generate a single private key and public certificate pair for Data Collector. Data Collector provides a self-signed certificate that you can use. However, web browsers generally issue a warning for self-signed certificates. StreamSets strongly recommends that you generate a key and certificate pair signed by a CA.
To enable HTTPS for cluster pipelines, each worker node requires a certificate. You can generate multiple private key and public certificate pairs - one for each worker node. Or you can generate a single Subject Alternative Names (SAN) certificate valid for all of the worker nodes. Data Collector runs on the gateway node in the cluster, so the gateway node uses the same certificate that you generate for Data Collector.
Important: Each signed certificate must include the fully qualified domain name (FQDN) for the Data Collector machine or worker node.
To obtain a certificate from a trusted CA, you must provide proof that you are the owner of the domain name for which you are requesting the certificate. Use OpenSSL or keytool to generate a key pair and then submit a Certificate Signing Request (CSR) to the CA. The exact procedure depends on the CA that you choose to use - see the documentation provided by the CA.

Step 1. Create Keystore Files

Create a keystore file that includes each private key and public certificate pair signed by the CA. A keystore is used to verify the identity of the client upon a request from an SSL/TLS server.

To enable HTTPS for Data Collector, create a single keystore file for Data Collector.

To enable HTTPS for cluster pipelines, each worker node requires a keystore file. If you generated a unique certificate for each worker node, create a unique keystore file for each of those certificates. Or if you generated a SAN certificate valid for all of the worker nodes, create a single keystore file that all the worker nodes can use. Data Collector runs on the gateway node in the cluster, so the gateway node uses the same keystore file that you create for Data Collector.

StreamSets recommends creating keystores in the PKCS #12 (p12 file) format. In most cases, a CA issues certificates in PEM format. Use OpenSSL to directly import the certificate into a PKCS #12 keystore.

  1. Use the following command to import the certificate and private key issued in PEM format to a PKCS #12 keystore for Data Collector:
    openssl pkcs12 -export -in <PEM certificate> -inkey <private key> -out <keystore filename> -name <keystore name> 

    You will be prompted to create a password for the keystore file.

    For example, the following command converts the certificate sdc_company_com.pem and private key sdc_company_com.key to the PKCS #12 keystore file named sdc_company_com.p12:
    openssl pkcs12 -export -in sdc_company_com.pem -inkey sdc_company_com.key -out sdc_company_com.p12 -name sdc_company_com
  2. Store the Data Collector keystore file in the Data Collector resources directory, $SDC_RESOURCES.
  3. Store the keystore password in a text file named keystore-password.txt in the Data Collector resources directory, $SDC_RESOURCES.
  4. If enabling HTTPS for cluster pipelines, repeat step 1 for each worker node in the cluster.
  5. Store each worker keystore file in the same absolute location on each worker node in the cluster.
    For example, if we generated worker keystore files named sdc_worker.p12, we'd store the files in the following directory on each worker node:
    /opt/security/sdc_worker.p12
  6. Store the worker keystore password in a password text file in the same absolute location on each worker node in the cluster.
    For example, we'd store the keystore-password.txt file in the following directory on each worker node:
    /opt/security/keystore-password.txt

Step 2. Create a Truststore File

A truststore file contains certificates from trusted CAs that an SSL/TLS client uses to verify the identity of an SSL/TLS server.

Data Collector requires a truststore file to verify the identity of the following SSL/TLS servers:
  • Secure LDAP server when Data Collector is configured for secure LDAP authentication.
  • Control Hub on-premises installation enabled for HTTPS when Data Collector is registered with Control Hub on-premises.
  • Worker node when Data Collector runs cluster pipelines enabled for HTTPS.

If you've enabled HTTPS for cluster pipelines, worker nodes require a truststore file to verify the identity of the gateway node where Data Collector is installed.

By default, Data Collector and worker nodes use the default Java truststore file located in $JAVA_HOME/jre/lib/security/cacerts. If your certificates are signed by a trusted CA that is included in the default Java truststore file, you do not need to create a truststore file for Data Collector or worker nodes and can skip this step.

If your certificates are signed by a private CA or not trusted by the default Java truststore, you must create a custom truststore file or modify a copy of the default Java truststore file to add the root and intermediate CA certificates to the Data Collector and worker node truststore file. For example, if your organization generates its own certificates, you must add the root and intermediate certificates for your organization to the truststore file.

You can create a single truststore file used by both Data Collector and worker nodes. Or you can create separate truststore files.

In these steps, we show how to modify a copy of the default truststore file to add an additional CA to the list of trusted CAs. We assume that the same CA signed our certificates used by Data Collector and by each worker node in the cluster. If multiple CAs signed your certificates, you'll need to add each CA to the truststore file.

If you prefer to create a custom truststore file, see the keytool documentation.

You can create the following types of truststores for Data Collector and worker nodes:
  • Java keystore file (JKS)
  • PKCS #12 (p12 file)
  1. Use the following command to set the JAVA_HOME environment variable:
    export JAVA_HOME=<Java home directory>
  2. Use the following command to set the SDC_CONF environment variable:
    export SDC_CONF=<Data Collector configuration directory>
    For example, for an RPM installation use:
    export SDC_CONF=/etc/sdc
  3. Use the following command to copy the default Java truststore file to the Data Collector configuration directory:
    cp "${JAVA_HOME}/jre/lib/security/cacerts" "${SDC_CONF}/truststore.jks"
  4. Use the following keytool command to import the CA certificate into the truststore file:
    keytool -import -file <CA certificate> -trustcacerts -noprompt -alias <CA alias> -storepass <password> -keystore "${SDC_CONF}/truststore.jks"
    For example:
    keytool -import -file  sdc_company_com.pem -trustcacerts -noprompt -alias MyCorporateCA -storepass changeit -keystore "${SDC_CONF}/truststore.jks"
  5. If you are enabling HTTPS for cluster pipelines, copy the modified truststore file to the same absolute location on each worker node in the cluster.
    For example, we'd store our truststore.jks file in the following directory on each worker node:
    /opt/security/truststore.jks

    Then store the truststore password in a password text file in the same absolute location on each worker node in the cluster.

    For our example, we'd store the password in a file named truststore-password.txt in the following directory on each worker node:
    /opt/security/truststore-password.txt

Step 3. Configure Data Collector to Use HTTPS

Modify Data Collector configuration properties to configure Data Collector to use a secure port and your keystore file. If you created a custom truststore file or modified a copy of the default Java truststore file, configure Data Collector to use that truststore file.

  1. To define the secure port and keystore file, configure the following properties in the Data Collector configuration file, sdc.properties:
    Data Collector HTTPS Property Description
    sdc.base.http.url

    Data Collector URL. If the property is uncommented and defined, modify to use the HTTPS protocol and the secure port number, for example:

    sdc.base.http.url=https://myhost:18636

    If the property is commented, you do not need to define it.

    https.port Secure port number for Data Collector. For example, 18636.

    Any number besides -1 enables the secure port number.

    Note: When both the HTTP and HTTPS port properties are defined, the HTTP port bounces to the HTTPS port.
    https.keystore.path

    Path and name of the keystore file. Enter an absolute path or a path relative to the $SDC_RESOURCES directory.

    For example: https.keystore.path=sdc_company_com.p12

    Note: Default is keystore.jks in the $SDC_CONF directory which provides a self-signed certificate that you can use. However, StreamSets strongly recommends that you generate a certificate signed by a trusted CA, as described in Prerequisites.
    https.keystore.password Password to open the keystore file. To protect the password, store the password in an external location and then use a function to retrieve the password.
    For example, if you added the password to a text file named keystore-password.txt, configure the property as follows:
    https.keystore.password=${file("keystore-password.txt")}
    https.require.hsts Requires Data Collector to include the HTTP Strict Transport Security (HSTS) response header.

    Set to true when Data Collector uses HTTPS to enable HSTS.

    Default is false.

  2. For an installation started as a service on operating systems that use the systemd init system, edit the sdc.socket file to use the same secure port that you just defined.
    The location of the sdc.socket file depends on how you installed Data Collector:
    • From the RPM package - /usr/lib/systemd/system/sdc.socket
    • From the tarball - /etc/systemd/system/sdc.socket
    For example, if you defined the Data Collector secure port number as 18636, modify these lines in the file as follows:
    [Socket]
    ListenStream=18636
    ListenStream=0.0.0.0:18636
  3. Use the following command to reload the systemd manager configuration:
    systemctl daemon-reload
  4. If you created a custom truststore file or modified a copy of the default Java truststore file for Data Collector to use, define the following options in the SDC_JAVA_OPTS environment variable:
    • javax.net.ssl.trustStore - Path to the truststore file on the Data Collector machine.
    • javax.net.ssl.trustStorePassword - Truststore password.

    Modify environment variables using the method required by your installation type.

    For example, define the options as follows:
    export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -Djavax.net.ssl.trustStore=/etc/sdc/truststore.jks -Djavax.net.ssl.trustStorePassword=mypassword -Xmx1024m -Xms1024m -server -XX:-OmitStackTraceInFastThrow"

    Or to avoid saving the password in the export command, save the password in a text file and then define the truststore password option as follows: -Djavax.net.ssl.trustStorePassword=$(cat passwordfile.txt)

    Then ensure that the password file is readable only by the user executing the export command.

  5. Restart Data Collector to enable the changes.

Step 4. Configure Cluster Pipelines to Use HTTPS

To enable HTTPS for cluster pipelines, configure the gateway and worker nodes in the cluster to use HTTPS. If you do not run cluster pipelines, you can skip this step.

Modify the Data Collector configuration file, sdc.properties, on the gateway node to configure the worker nodes to use the keystore file stored on each worker node. If you created a custom truststore file or modified a copy of the default Java truststore file, configure the gateway and worker nodes to use the truststore file.

  1. Verify that you've completed the previous step to configure Data Collector to use HTTPS.
    The gateway node uses the keystore file and keystore password configured for Data Collector.
  2. To define the keystore file used by the worker nodes, configure the following properties in the Data Collector configuration file, sdc.properties, on the gateway node:
    Cluster Pipeline Keystore Property Description
    https.cluster.keystore.path Absolute path and file name of the keystore file on worker nodes. The file must be in the same location on each worker node.

    In our example, we'd configure the property as follows:

    https.cluster.keystore.path=/opt/security/sdc_worker.p12
    https.cluster.keystore.password Absolute path and name of the file that contains the password to the keystore file on the worker nodes. The file must be in the same location on each worker node.
    In our example, we'd configure the property as follows:
    https.cluster.keystore.password=${file("/opt/security/keystore-password.txt")}
  3. If you created a custom truststore file or modified a copy of the default Java truststore file for cluster pipelines, uncomment and configure the following properties in the Data Collector configuration file, sdc.properties, on the gateway node:
    Cluster Pipeline Truststore Property Description
    https.truststore.path Path and name of the truststore file on the gateway node that stores certificates to trust the identity of the worker nodes. Enter the same path that you entered for the truststore file configured for Data Collector.

    In our example, we'd configure the property as follows:

    https.cluster.truststore.path=/etc/sdc/truststore.jks
    If you register Data Collector with Control Hub, set the path and name of the truststore file in the SDC_JAVA_OPTS environment variable with the Djavax.net.ssl.trustStore option. Run the following command:
    export SDC_JAVA_OPTS = "-Djavax.net.ssl.trustStore = <path to truststore file> ${SDC_JAVA_OPTS}"
    
    https.truststore.password Password or path and name of the file that contains the password to the truststore file on the gateway node. Enter the same truststore password that you configured for Data Collector.

    If configuring a path to a password file, enter an absolute path or a path relative to the $SDC_CONF directory.

    In our example, we'd configure the property as follows:
    https.cluster.truststore.password=mypassword
    https.cluster.truststore.path Absolute path and name of the truststore file on the worker nodes that stores the certificate to trust the identity of the gateway node. The file must be in the same location on each worker node.

    In our example, we'd configure the property as follows:

    https.cluster.truststore.path=/opt/security/truststore.jks
    https.cluster.truststore.password Absolute path and name of the file that contains the password to the truststore file on the worker nodes. The file must be in the same location on each worker node.
    In our example, we'd configure the property as follows:
    https.cluster.truststore.password=${file("/opt/security/truststore-password.txt")}
  4. Restart Data Collector to enable the changes.