Databricks Job Launcher

The Databricks Job Launcher executor starts a Databricks job each time it receives an event. You can run jobs based on notebooks or JARs. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Use the executor to start a Databricks job as part of an event stream. You can use the executor in any logical way, such as running Databricks jobs after the Hadoop FS, MapR FS, or Amazon S3 destination closes files.

Note that the Databricks Job Launcher executor starts a job in an external system. It does not monitor the job or wait for it to complete. The executor becomes available for additional processing as soon as it successfully submits a job.

Before you use the executor, perform the necessary prerequisites.

When you configure the executor, you specify the cluster base URL, job type, job ID, and user credentials. You can optionally configure job parameters and security such as an HTTP proxy and SSL/TLS details.

You can configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Prerequisites

Before you run a pipeline that starts jobs on Databricks, perform the following tasks in Databricks:
  1. Create the job.

    The Databricks Job Launcher executor can start jobs based on notebooks or JARs.

  2. Optionally configure the job to allow concurrent runs.

    By default, Databricks does not allow running multiple instances of a job at the same time. With the default, if the Databricks Job Launcher executor receives multiple events in quick succession, it starts multiple instances of the job, but Databricks queues those instances and runs them one by one.

    To enable parallel processing, in Databricks, configure the job to allow concurrent runs. You can configure the maximum number of concurrent runs through the Databricks API with the max_concurrent_runs parameter, or through the UI using the Jobs > Advanced menu and the Maximum Concurrent Runs property.

  3. Save the job and note the job ID.

    When you submit the job, Databricks generates a job ID. Use the job ID when you configure the Databricks Job Launcher executor.

Event Generation

The Databricks Job Launcher executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it starts a Databricks job.

Databricks Job Launcher executor events can be used in any logical way. For example:

Since the executor events include the run ID for each started job, you might generate events to keep a log of the run IDs.

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the Databricks Job Launcher executor have the following event-related record header attributes. Record header attributes are stored as String values:

Record Header Attribute Description
sdc.event.type Event type. Uses the following type:
  • AppSubmittedEvent - Generated when the executor starts a Databricks job.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
Event records generated by the Databricks Job Launcher executor have the following fields:
Event Field Name Description
app_id Run ID of the Databricks job.

Monitoring

Data Collector does not monitor Databricks jobs. Use your regular cluster monitor application to view the status of jobs.

Jobs started by the Databricks Job Launcher executor display using the job ID specified in the stage. The job ID is the same for all instances of the job. You can find the run ID for a particular instance in the Data Collector log.

The executor also writes the run ID of the job to the event record. To keep a record of all run IDs, enable event generation for the stage.

Configuring a Databricks Job Launcher Executor

Configure a Databricks Job Launcher executor to start a Databricks job each time the executor receives an event record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

  2. On the Job tab, configure the following properties:
    Job Property Description
    Cluster Base URL The Databricks URL for your company. The URL uses the following format:
    https://<your domain>.cloud.databricks.com
    Job Type Job type to run: Notebook or JAR.
    Job ID Job ID generated by Databricks after the job was submitted, as described in prerequisites.
    Parameters Parameters to pass to the job. Enter the parameters exactly as expected, and in the expected order. The executor does not validate the parameters.
    You can use the expression language in job parameters. For example, when performing post-processing on an Amazon S3 object, you can use the following expression to retrieve the object key name from the event record:
    ${record:field('/objectKey')}
    Use Proxy Enables using an HTTP proxy to connect to the system.
  3. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Credential Type Type of credential to use to connect to Databricks: Username/Password or Token.
    Username Databricks user name.
    Password Password for the account.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Token Personal access token for the account.
  4. To use an HTTP proxy, on the Proxy tab, configure the following properties:
    Proxy Property Description
    Proxy URI Proxy URI.
    Username Proxy user name.
    Password Proxy password.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
  5. To use SSL/TLS, on the TLS tab, configure the following properties:
    TLS Property Description
    Use TLS Enables the use of TLS.
    Use Remote Keystore Enables loading the contents of the keystore from a remote credential store or from values entered in the stage properties. For more information, see Remote Keystore and Truststore.
    Private Key Private key used in the remote keystore. Enter a credential function that returns the key or enter the contents of the key.
    Certificate Chain Each PEM certificate used in the remote keystore. Enter a credential function that returns the certificate or enter the contents of the certificate.

    Using simple or bulk edit mode, click the Add icon to add additional certificates.

    Keystore File

    Path to the local keystore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory:

    ${runtime:resourcesDirPath()}/keystore.jks

    By default, no keystore is used.

    Keystore Type Type of keystore to use. Use one of the following types:
    • Java Keystore File (JKS)
    • PKCS #12 (p12 file)

    Default is Java Keystore File (JKS).

    Keystore Password Password to the keystore file. A password is optional, but recommended.
    Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Keystore Key Algorithm Algorithm to manage the keystore.

    Default is SunX509.

    Use Remote Truststore Enables loading the contents of the truststore from a remote credential store or from values entered in the stage properties. For more information, see Remote Keystore and Truststore.
    Trusted Certificates Each PEM certificate used in the remote truststore. Enter a credential function that returns the certificate or enter the contents of the certificate.

    Using simple or bulk edit mode, click the Add icon to add additional certificates.

    Truststore File

    Path to the local truststore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory:

    ${runtime:resourcesDirPath()}/truststore.jks

    By default, no truststore is used.

    Truststore Type Type of truststore to use. Use one of the following types:
    • Java Keystore File (JKS)
    • PKCS #12 (p12 file)

    Default is Java Keystore File (JKS).

    Truststore Password Password to the truststore file. A password is optional, but recommended.
    Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Truststore Trust Algorithm Algorithm to manage the truststore.

    Default is SunX509.

    Use Default Protocols Uses the default TLSv1.2 transport layer security (TLS) protocol. To use a different protocol, clear this option.
    Transport Protocols TLS protocols to use. To use a protocol other than the default TLSv1.2, click the Add icon and enter the protocol name. You can use simple or bulk edit mode to add protocols.
    Note: Older protocols are not as secure as TLSv1.2.
    Use Default Cipher Suites Uses a default cipher suite for the SSL/TLS handshake. To use a different cipher suite, clear this option.
    Cipher Suites Cipher suites to use. To use a cipher suite that is not a part of the default set, click the Add icon and enter the name of the cipher suite. You can use simple or bulk edit mode to add cipher suites.

    Enter the Java Secure Socket Extension (JSSE) name for the additional cipher suites that you want to use.