Elasticsearch
The Elasticsearch destination writes data to an Elasticsearch cluster, including Elastic Cloud clusters (formerly Found clusters) and Amazon Elasticsearch Service clusters. The destination uses the Elasticsearch HTTP module to access the Bulk API and write each record to Elasticsearch as a document.
When you configure the Elasticsearch destination, you configure the HTTP URLs used to connect to the Elasticsearch cluster and specify whether security is enabled on the cluster. You specify the index to write to and can configure the destination to automatically create the index if it doesn't exist.
You specify the write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.
You can add advanced Elasticsearch properties as needed.
You can also use a connection to configure the destination.
Security
- Basic
- Use Basic authentication for Elasticsearch clusters outside of Amazon Elasticsearch Service. With Basic authentication, the stage passes the Elasticsearch user name and password.
- AWS Signature V4
- Use AWS Signature V4 authentication for
Elasticsearch clusters within Amazon Elasticsearch Service. The stage must
sign HTTP requests with Amazon Web Services credentials. For details, see the
Amazon Elasticsearch Service
documentation. Use one of the following methods to sign with AWS credentials:
- Instance profile
- When the execution engine - Data Collector or Transformer - runs on an Amazon EC2 instance that has an associated instance profile, the engine uses the instance profile credentials to automatically authenticate with AWS.
- AWS access key pair
- When the execution engine does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you must specify the Access Key ID and Secret Access Key properties.
Write Mode
The write mode determines how the Elasticsearch destination writes documents to Elasticsearch.
- Overwrite files
- Removes all existing documents in the index before creating new documents.
- Overwrite related partitions
- Removes all existing documents in a partition before creating new documents for the partition. Partitions with no data to be written are left intact.
- Write new files to new directory
- Creates a new index and writes new documents to the index. Generates an error if the specified index exists when you start the pipeline.
- Write new or append to existing files
- Creates new documents in the specified index. If a document of the same name exists in the index, the destination appends data to the document.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When writing data to Elasticsearch, Spark creates one document for each partition. To change the number of partitions that the destination uses, add the Repartition processor before the destination.
Overwrite Partition Requirement
When writing to partitioned data, the Elasticsearch destination can overwrite data within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite data in the 03-2019 partition and leave all other partitions untouched.
To overwrite partitioned data, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned data, no action is needed.
To enable overwriting partitions, set the
spark.sql.sources.partitionOverwriteMode
Spark configuration
property to dynamic
.
You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.
To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Configuring an Elasticsearch Destination
Configure an Elasticsearch destination to write data to Elasticsearch.