Elasticsearch

The Elasticsearch destination writes data to an Elasticsearch cluster, including Elastic Cloud clusters (formerly Found clusters) and Amazon Elasticsearch Service clusters. The destination uses the Elasticsearch HTTP module to access the Bulk API and write each record to Elasticsearch as a document.

When you configure the Elasticsearch destination, you configure the HTTP URLs used to connect to the Elasticsearch cluster and specify whether security is enabled on the cluster. You specify the index to write to and can configure the destination to automatically create the index if it doesn't exist.

You specify the write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.

You can add advanced Elasticsearch properties as needed.

You can also use a connection to configure the destination.

Security

When security is enabled for the Elasticsearch cluster, you must specify the authentication method:
Basic
Use Basic authentication for Elasticsearch clusters outside of Amazon Elasticsearch Service. With Basic authentication, the stage passes the Elasticsearch user name and password.
AWS Signature V4
Use AWS Signature V4 authentication for Elasticsearch clusters within Amazon Elasticsearch Service. The stage must sign HTTP requests with Amazon Web Services credentials. For details, see the Amazon Elasticsearch Service documentation. Use one of the following methods to sign with AWS credentials:
Instance profile
When the execution engine - Data Collector or Transformer - runs on an Amazon EC2 instance that has an associated instance profile, the engine uses the instance profile credentials to automatically authenticate with AWS.
To use an instance profile, do not configure the Access Key ID and Secret Access Key properties.
For more information about associating an instance profile with an EC2 instance, see the Amazon EC2 documentation.
AWS access key pair
When the execution engine does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you must specify the Access Key ID and Secret Access Key properties.

Write Mode

The write mode determines how the Elasticsearch destination writes documents to Elasticsearch.

The Elasticsearch destination includes the following write modes:
Overwrite files
Removes all existing documents in the index before creating new documents.
To use this mode, do not configure Spark to allow overwriting data within a partition.
Overwrite related partitions
Removes all existing documents in a partition before creating new documents for the partition. Partitions with no data to be written are left intact.
For example, say you have ten partitions. If the processed data belongs in two partitions, the destination overwrites the two partitions with the new data. The other eight partitions remain unchanged.

To use this mode, Spark must be configured to allow overwriting data within a partition.

Write new files to new directory
Creates a new index and writes new documents to the index. Generates an error if the specified index exists when you start the pipeline.
To use this mode, you must also enable Index Auto Creation.
Write new or append to existing files
Creates new documents in the specified index. If a document of the same name exists in the index, the destination appends data to the document.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Elasticsearch, Spark creates one document for each partition. To change the number of partitions that the destination uses, add the Repartition processor before the destination.

Overwrite Partition Requirement

When writing to partitioned data, the Elasticsearch destination can overwrite data within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite data in the 03-2019 partition and leave all other partitions untouched.

To overwrite partitioned data, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned data, no action is needed.

To enable overwriting partitions, set the spark.sql.sources.partitionOverwriteMode Spark configuration property to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.

To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Configuring an Elasticsearch Destination

Configure an Elasticsearch destination to write data to Elasticsearch.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Elasticsearch tab, configure the following properties:
    Elasticsearch Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .

    HTTP URLs Comma-separated list of HTTP or HTTPS URLs used to connect to each Elasticsearch server in the cluster. Use the following format:

    http://<host1>,http://<host2>

    You can specify a port number in the URLs to override the default port defined in the HTTP Port property, as follows:

    http://<host1>:<port>,http://<host2>:<port>

    When a port number is defined in both this property and in the HTTP Port property, the port in this property takes precedence. For example, if you define this property as follows:

    http://server1,http://server2:1234

    And you define the default HTTP Port property as 9200, then server1 uses the default port of 9200 and server2 uses the port 1234.

    HTTP Port Default port number to use for URLs that do not include a port.

    The default HTTP port is 9200. The default HTTPS port is 443.

    Index Auto Creation Automatically creates the index if it doesn't exist.
    Use Security Specifies whether security is enabled on the Elasticsearch cluster.
    Index Index for the generated documents. Enter an index name or an expression that evaluates to the index name.

    For example, if you enter customer as the index, the destination writes the document within the customer index.

    When Index Auto Creation is enabled, the destination creates the index if it doesn't exist.

    Mapping Mapping type for the generated documents. Enter the mapping type, an expression that evaluates to the mapping type, or a field that includes the mapping type.

    For example, if you enter user as the mapping type, the destination writes the document with a user mapping type.

    Write Mode Mode to write documents:
    • Overwrite files - Removes all existing documents in the index before creating new documents. To use this mode, do not configure Spark to allow overwriting data within a partition.
    • Overwrite related partitions - Removes all existing documents in a partition before creating new documents for the partition. Partitions with no data to be written are left intact. To use this mode, Spark must be configured to allow overwriting data within a partition.
    • Write new or append to existing files - Creates new documents in the specified index. If a document of the same name exists in the index, the destination appends data to the object.
    • Write new files to new directory - Creates a new index and writes new documents to the index. Generates an error if the specified index exists when you start the pipeline.

      To use this mode, you must also enable Index Auto Creation.

    Additional Configurations

    Additional Elasticsearch configuration properties to use. To add properties, click Add and define the property name and value.

    Use the property names and values as expected by Elasticsearch.

  3. If you enabled security, on the Security tab, configure the following properties:
    Security Property Description
    Mode Authentication method to use:
    • Basic - Authenticate with Elasticsearch user name and password. Select this option for Elasticsearch clusters outside of Amazon Elasticsearch Service.
    • AWS Signature V4 - Authenticate with AWS. Select this option for Elasticsearch clusters within Amazon Elasticsearch Service.
    User Name Elasticsearch user name.

    Available when using Basic authentication.

    Password Password for the user account.

    Available when using Basic authentication.

    Region Amazon Web Services region that hosts the Elasticsearch domain.

    Available when using AWS Signature V4 authentication.

    Access Key ID AWS access key ID. Required when not using instance profile credentials.

    Available when using AWS Signature V4 authentication.

    Secret Access Key AWS secret access key. Required when not using instance profile credentials.

    Available when using AWS Signature V4 authentication.

    Enable SSL Enables the use of SSL.
    SSL Truststore Path Location of the truststore file.

    Configuring this property is the equivalent to configuring the shield.ssl.truststore.path Elasticsearch property.

    Not necessary for Elastic Cloud clusters.

    SSL Truststore Password Password for the truststore file.

    Configuring this property is the equivalent to configuring the shield.ssl.truststore.password Elasticsearch property.

    Not necessary for Elastic Cloud clusters.