Amazon S3

Supported pipeline types:
  • Data Collector

The Amazon S3 origin reads objects stored in Amazon S3. The object names must share a prefix pattern and should be fully written. To read messages from Amazon SQS, use the Amazon SQS Consumer origin. The Amazon S3 origin can process objects in parallel with multiple threads. For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.
Note: The Amazon S3 origin can be used in standalone pipelines only. To use a cluster pipeline to read from Amazon S3, use a Hadoop FS origin in a cluster EMR batch pipeline that runs on an Amazon EMR cluster. Or, use a Hadoop FS origin in a cluster batch pipeline that runs on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster. For more information, see Amazon S3 Requirements for cluster pipelines.

With the Amazon S3 origin, you define the region, bucket, prefix pattern, optional common prefix, and read order. These properties determine the objects that the origin processes. You configure the authentication method that the origin uses to connect to Amazon S3. You can optionally include Amazon S3 object metadata in the record as record header attributes.

After processing an object or upon encountering errors, the origin can keep, archive, or delete the object. When archiving, the origin can copy or move the object.

When the pipeline stops, the Amazon S3 origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested objects.

You can configure the origin to decrypt data stored on Amazon S3 with server-side encryption and customer-provided encryption keys. You can optionally use a proxy to connect to Amazon S3.

You can also use a connectionconnection to configure the origin.

The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.