Elasticsearch

Data Collector

The Elasticsearch origin is a multithreaded origin that reads data from an Elasticsearch cluster, including Elastic Cloud clusters (formerly Found clusters) and Amazon Elasticsearch Service clusters. For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.

The origin generates a record for each Elasticsearch document.

When you configure the Elasticsearch origin, you configure the HTTP URLs used to connect to the Elasticsearch cluster and specify whether security is enabled on the cluster. When Data Collector shares the same network as the Elasticsearch cluster, you can enter one or more node URLs and automatically detect additional Elasticsearch nodes on the cluster.

You configure the origin to run in batch or incremental mode.

The origin uses the Elasticsearch scroll API to run a query that you define. A query can retrieve large numbers of documents from Elasticsearch. This allows the origin to run a single query and then read multiple batches of data from the scroll until no results are left. You configure a scroll timeout that defines the amount of time that the search context remains valid.

When the pipeline stops, the Elasticsearch origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped - as long as the scroll timeout has not been exceeded or the origin is not configured to delete the scroll when the pipeline stops. You can reset the origin to process all requested documents.

When you configure the Elasticsearch origin, you specify the maximum number of slices to split the scroll into. The number of slices determines how many threads the origin uses to read the data.

You can also use a connection connection connection to configure the origin.