The Amazon S3 origin reads objects stored in Amazon S3. The object names must share a prefix pattern and should be fully written. To read messages from Amazon SQS, use the Amazon SQS Consumer origin.
With the Amazon S3 origin, you define the region, bucket, prefix pattern, optional common prefix, and read order. These properties determine the objects that the origin processes. You can optionally include Amazon S3 object metadata in the record as record header attributes.
After processing an object or upon encountering errors, the origin can keep, archive, or delete the object. When archiving, the origin can copy or move the object.
When the pipeline stops, the Amazon S3 origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested objects.
You can configure the origin to decrypt data stored on Amazon S3 with server-side encryption and customer-provided encryption keys. You can optionally use a proxy to connect to Amazon S3.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
When Data Collector reads data from an Amazon S3 origin, it must pass credentials to Amazon Web Services.
Use one of the following methods to pass AWS credentials:
The Amazon S3 origin appends the common prefix to the prefix pattern to define the objects that the origin processes. You can specify an exact prefix pattern or you can use Ant-style path patterns to read multiple objects recursively.
Common Prefix: US/East/MD/ Prefix Pattern: **/*.log
Common Prefix: US/ Prefix Pattern: **/weblogs/*.log Common Prefix: Prefix Pattern: US/**/weblogs/*.log
When the Amazon S3 origin processes Avro data, it includes the Avro schema in an avroSchema record header attribute. You can also configure the origin to include Amazon S3 object metadata in record header attributes.
You can use the record:attribute or record:attributeOrDefault functions to access the information in the attributes. For more information about working with record header attributes, see Working with Header Attributes.
You can include Amazon S3 object metadata in record header attributes. Include metadata when you want to use the information to help process records. For example, you might include metadata if you want to route records to different branches of a pipeline based on the last-modified timestamp.
<bucket>/<prefix>/<object_name>
For more information about record header attributes, see Record Header Attributes.
The Amazon S3 origin reads objects in ascending order based on the object key name or the last modified timestamp. For best performance when reading a large number of objects, configure the origin to read objects based on the key name.
You can configure one of the following read orders:
1, 10, 11, 2, 3, 4... 9
Bucket: WebServer Common Prefix: 2016/ Prefix Pattern: **/web*.log
s3://WebServer/2016/February/web-10.log s3://WebServer/2016/February/web-11.log s3://WebServer/2016/February/web-5.log s3://WebServer/2016/February/web-6.log s3://WebServer/2016/February/web-7.log s3://WebServer/2016/February/web-8.log s3://WebServer/2016/February/web-9.log s3://WebServer/2016/January/web-1.log s3://WebServer/2016/January/web-2.log s3://WebServer/2016/January/web-3.log s3://WebServer/2016/January/web-4.log
s3://WebServer/2016/February/web-0005.log s3://WebServer/2016/February/web-0006.log ... s3://WebServer/2016/February/web-0010.log s3://WebServer/2016/February/web-0011.log s3://WebServer/2016/January/web-0001.log s3://WebServer/2016/January/web-0002.log s3://WebServer/2016/January/web-0003.log s3://WebServer/2016/January/web-0004.log
For example, you configure the origin to read from the ServerEast bucket, using LogFiles/ as the common prefix and *.log as the prefix pattern. You need to process the following log files from two different servers using ascending order based on the last modified timestamp:
s3://ServerEast/LogFiles/fileA.log 04-30-2016 12:03:23 s3://ServerEast/LogFiles/fileB.log 04-30-2016 15:34:51 s3://ServerEast/LogFiles/file1.log 04-30-2016 12:00:00 s3://ServerEast/LogFiles/file2.log 04-30-2016 18:39:44
s3://ServerEast/LogFiles/file1.log 04-30-2016 12:00:00 s3://ServerEast/LogFiles/fileA.log 04-30-2016 12:03:23 s3://ServerEast/LogFiles/fileB.log 04-30-2016 15:34:51 s3://ServerEast/LogFiles/file2.log 04-30-2016 18:39:44
If a new object arrives with a timestamp of 04-29-2016 12:00:00, the Amazon S3 origin does not process the object unless you reset the origin.
The Amazon S3 origin uses a buffer to read objects into memory to produce records. The size of the buffer determines the maximum size of the record that can be processed.
The buffer limit helps prevent out of memory errors. Decrease the buffer limit when memory on the Data Collector machine is limited. Increase the buffer limit to process larger records when memory is available.
Instead, the origin displays a message in Monitor mode indicating that a buffer overrun error occurred. The message includes the object and offset where the buffer overrun error occurred. The information displays in the pipeline history and displays as an alert when you monitor the pipeline.
If an error directory is configured for the stage, the origin moves the object to the error directory and continues processing the next object.
You can configure the origin to decrypt data stored on Amazon S3 with Amazon Web Services server-side encryption.
For information about implementing customer-provided encryption keys in the origin system, see the Amazon S3 documentation.
The Amazon S3 origin can generate events when it completes processing all available data and the configured batch wait time has elapsed.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Case Study: Stop the Pipeline.
For an example, see Case Study: Event Storage.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event type:
|
sdc.event.version | An integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The Amazon S3 origin can generate the following event record:
Event Record Field | Description |
---|---|
record-count | Number of records successfully generated since the pipeline started or since the last no-more-data event was created. |
error-count | Number of error records generated since the pipeline started or since the last no-more-data event was created. |
file-count | Number of objects that the origin attempted to process. Can include objects that were unable to be processed or were not fully processed. |