Databricks ML Evaluator (deprecated)

The Databricks ML Evaluator processor uses a machine learning model exported with Databricks ML Model Export to generate evaluations, scoring, or classifications of data.

Important: This stage is deprecated and may be removed in a future release.

With the Databricks ML Evaluator processor, you can create pipelines that produce data-driven insights in real time. For example, you can design pipelines that detect fraudulent transactions or that perform natural language processing as data passes through the pipeline.

To use the Databricks ML Evaluator processor, you first build and train the model with Apache Spark MLlib. You then export the trained model with Databricks ML Model Export and save the exported model directory on the Data Collector machine that runs the pipeline.

When you configure the Databricks ML Evaluator processor, you specify the path to the exported model saved on the Data Collector machine. You also specify the root field in the input data to send the model, the output columns to return from the model, and the record field to store the model output.

Prerequisites

Before configuring the Databricks ML Evaluator processor, you must complete the following prerequisites:

Build and train a machine learning model with Apache Spark MLlib.
Export the trained model with Databricks ML Model Export. For more information, see the Databricks documentation.
Save the exported directory on the Data Collector machine that runs the pipeline. StreamSets recommends storing the model directory in the Data Collector resources directory, $SDC_RESOURCES.

Databricks Model as a Microservice

External clients can use a model exported with Databricks ML Model Export to perform computations when you include a Databricks ML Evaluator processor in a microservice pipeline.

For example, in the following microservice pipeline, a REST API client sends a request with input data to the REST Service origin. The Databricks ML Evaluator processor uses a machine learning model to generate predictions from the data. The processor passes records that contain the model's predictions to the Send Response to Origin destination, labeled Send Predictions, which sends the records back to the REST Service origin. The origin then transmits JSON-formatted responses back to the originating REST API client.

Example: Ground Cover Model

For example, suppose you use Apache Spark MLlib to build and train a model that predicts ground cover in a forest, and then you export the model with Databricks ML Model Export. The model predicts the ground cover based on inputs about soil types, topography, and tree coverage.

You can give the model the following inputs:

{
  "origLabel": -1.0,
  "features": {
    "type": 0,
    "size": 13,
    "indices": [0,2,3,4,6,7,8,9,10,11,12],
    "values": [74.0,2.0,120.0,269.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0]
  }
}

And the model generates a predicted ground cover type, the corresponding label, and the probability of each type, as shown in the following table:


Label	Prediction	Probability
Moss	0	0 – 0.86 1 – 0.14

To include this model in a pipeline, save the model on the Data Collector machine, add the Databricks ML Evaluator processor to the pipeline, and then configure the processor to use the saved model, to read the needed input, and to include the generated output columns in a field in the record. The following image shows the processor configuration:

Configuring a Databricks ML Evaluator Processor

Configure a Databricks ML Evaluator processor to generate evaluations, scoring, or classifications of data with a machine learning model exported with Databricks ML Model Export.

Important: This stage is deprecated and may be removed in a future release.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline. Not valid for cluster pipelines.

On the Databricks ML tab, configure the following properties:


Databricks ML Property	Description
Saved Model Path	Path to the saved model directory on the Data Collector machine. Specify either an absolute path or the path relative to the Data Collector resources directory. For example, if you saved a model directory named textAnalysis in the Data Collector resources directory /var/lib/sdc-resources, then enter either of the following paths: `/var/lib/sdc-resources/textAnalysis` `textAnalysis`
Model Output Columns	Model output columns to return to the record. By default, the processor includes the following columns common to many models: label prediction probability You can remove a column if not applicable, and you can add other columns from your model if necessary.
Input Root Field	Root field in the record passed as input to the model. From the drop-down list, select an input field from the record to pass that field and any child fields to the model, or enter `/` to pass all the fields to the model.
Output Field	Map field that stores model output in the record. Specify as a path.