Amazon S3

The Amazon S3 origin reads objects stored in Amazon Simple Storage Service, also known as Amazon S3. The objects must be fully written, include data of the same supported format, and use the same schema.

When reading multiple objects in a batch, the origin reads the oldest object first. Upon successfully reading an object, the origin can delete the object, move it to an archive directory, or leave it in the directory.

When the pipeline stops, the origin notes the last-modified timestamp of the last object that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. You can reset pipeline offsets to process all available objects.

The Amazon S3 origin reads from Amazon S3 using connection information stored in a Hadoop configuration file. Complete the prerequisite tasks before using the origin in a local pipeline.

When you configure the origin, you specify the authentication method to use. You define the bucket and path to the objects to read. The origin reads objects from the specified directory and its subdirectories. You also specify the name pattern for the objects to read. You can optionally configure another name pattern to exclude objects from processing and define post-processing actions for successfully read objects.

You can also use a connectionconnection to configure the origin.

You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties. You can also configure advanced properties such as performance-related properties and proxy server properties.

Note: The user that Transformer uses to access HDFS depends on the pipeline configuration. For more information, see Hadoop YARN.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.