Web Client

The Web Client origin reads data from an HTTP endpoint. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

The Web Client origin requires that Data Collector use Java version 17. For more information, see Java Versions and Available Features.

Data Collector provides several HTTP origins to address different needs. For a quick comparison chart to help you choose the right one, see Comparing HTTP Origins.

When you configure the Web Client origin, you define the request endpoint, optional headers, and method to use for the requests. You can also use a connection to configure the origin.

You configure the origin to generate one request for each record or to generate a single request containing all records in the batch.

You can configure the actions to take based on the response status and configure pagination properties to enable processing large volumes of data from paginated APIs.

You can configure the timeout, request transfer encoding, and authentication type for both requests and responses.

You can optionally use a proxy server and configure TLS properties. You can also configure the origin to use the OAuth 2 protocol to connect to an HTTP service.

Note: This origin is a Technology Preview feature. It is not meant for use in production.

Ingestion Mode

The Web Client origin can use one of the following processing modes to read source data:

Streaming

The origin maintains a connection and processes data as it becomes available. Use to process streaming data in real time.

Polling
The origin polls the server at the specified interval for available data. Use to access data periodically, such as metrics and events at a REST endpoint.
Note: After the polling interval passes, the origin continues processing from where it stopped. For example, say that you configured the origin to use the polling mode with an interval of two hours and to use page number pagination. After the origin reads 25 pages of results, the 26th page returns no results and so the origin stops reading. After the two hour interval passes, the origin polls the server again, reading the results starting with page 26.
Batch

The origin processes all available data and then stops the pipeline. Use to process data as needed.

HTTP Method

You can use the following methods with the Web Client origin:
  • GET

  • POST

  • PUT

  • PATCH

  • DELETE

  • HEAD

  • Expression - An expression that evaluates to one of the other methods.

Expression Method

The Expression method allows you to write an expression that evaluates to a standard HTTP method. Use the Expression method to generate a workflow. For example, you can use an expression that passes data to the server using the PUT method based on the data in a field.

Headers

You can configure optional headers to include in the request made by the stage. Configure the headers in the following properties on the Request tab:
  • Security Headers
  • Common Headers

You can define headers in either property. However, only security headers support using credential functions to retrieve sensitive information from supported

If you define the same header in both properties, security headers take precedence.

Grouping Style

The Web Client origin can generate one HTTP request for each record, or it can generate a single request containing all records in the batch.

Configure the origin to generate requests in one of the following ways:

Multiple requests per batch

If you set the Grouping Style property to One Request per Record, the origin generates one HTTP request for each record in the batch and sends multiple requests at a time. To preserve record order, the origin waits until all requests for the entire batch are completed before processing the next batch.

Single request per batch

If you set the Grouping Style property to One Request per Batch, the origin generates a single HTTP request containing all records in the batch.

Event Generation

The Web Client origin can generate events that you can use in an event stream. With event generation enabled, the origin generates event records each time the origin completes processing all available data.

Events generated by the Web Client origin can be used in any logical way. For example:
  • With the Pipeline Finisher executor to stop the pipeline and transition the pipeline to a Finished state when the origin completes processing available data.

    When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.

    For an example, see Stopping a Pipeline After Processing All Available Data.

  • With a destination to store event information.

    For an example, see Preserving an Audit Trail of Events.

Event Records

Event records generated by the Web Client origin have the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute Description
sdc.event.type Event type. Uses the following type:
  • no-more-data - Generated when the origin completes processing all available data.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The Web Client origin can generate the following type of event record:
no-more-data
The Web Client origin generates a no-more-data event record when the origin completes processing all data returned by all queries.

The no-more-data event record generated by the origin has the sdc.event.type record header attribute set to no-more-data and does not include any additional fields.

Per-Status Actions

The Web Client origin accepts only responses that include a status code that has been configured to be read as successful by the stage. When the response includes any other status code, the origin generates an error and handles the record based on the error record handling configured for the stage.

You can configure the origin to perform one of several actions when it encounters an unsuccessful status code.

To configure a per-status action, you enter a HTTP status code, such as 504 for gateway timeouts, and then select one of the following actions for the stage to perform for that code:
  • Retry with constant backoff
  • Retry with linear backoff
  • Retry with exponential backoff
  • Generate output record
  • Generate error record
  • Abort pipeline

When defining the retry with a constant, linear, or exponential backoff action, you also specify the backoff interval to wait in milliseconds. When defining any of the retry actions, you specify the maximum number of retries and timeout failure response. If the stage receives a successful status code during a retry, then it processes the response. If the stage doesn't receive a successful status code after the maximum number of retries, then the stage performs the specified timeout failure action.

You can add multiple status codes and configure a specific action for each code.

Note: When using OAuth, all per-status actions configured for 401 Unauthorized and 403 Forbidden statuses are ignored. Instead, the stage generates a new OAuth token. If the same error occurs again, the stage generates a stage error.

Per-Timeout Actions

By default, the Web Client origin retries an operation five times before generating an error. You can configure the stage to use different timeout criteria and perform one of several actions when a specific type of timeout has reached its configured timeout limit.

To configure a per-timeout action, you select a timeout type, such as request, enter a timeout interval, and then select one of the following actions for the stage to perform for that code:
  • Retry with constant backoff
  • Retry with linear backoff
  • Retry with exponential backoff
  • Generate output record
  • Generate error record
  • Abort pipeline

When defining the retry with a constant, linear, or exponential backoff action, you also specify the backoff interval to wait in milliseconds. When defining any of the retry actions, you specify the maximum number of retries and timeout failure action. If the stage receives a response during a retry, then it processes the response. If the stage doesn't receive a response after the maximum number of retries, then the stage performs the specified timeout failure action.

You can add multiple timeout types and specify timeout criteria and actions for each of them.

Pagination

The Web Client origin can use pagination to retrieve a large volume of data from a paginated API.

When configuring the Web Client origin to use pagination, use the pagination type supported by the API of the HTTP client. You will likely need to consult the documentation for the origin system API to determine the pagination type to use and the properties to set.

The Web Client origin supports the following common pagination types:

Link in Header
After processing the current page, the stage uses the link in the HTTP header to access the next page. The link in the header can be an absolute URL or a URL relative to the next page link base URL configured for the stage. For example, let's say you configure the following next page link base URL for the stage:
https://myapp.com/api/objects?page=1
The next link in the HTTP header can include an absolute URL, as follows:
link:<https://myapp.com/api/objects?page=2>; rel="next"
Or the next link can include a URL relative to the resource URL, as follows:
link:<objects?page=2>; rel="next"
Link in Body
After processing the current page, the stage uses the link in a field in the response body to access the next page. The link in the response field can be an absolute URL or a URL relative to the next page link base URL configured for the stage. For example, let's say you configure the following next page link base URL for the stage:
http://myapp.com/api/tickets.json?start_time=138301982
The next link in the response field can include an absolute URL, as follows:
"next_page":"http://myapp.com/api/tickets.json?start_time=1389078385",

Or the next link can include a URL relative to the resource URL, as follows:
"next_page":"tickets.json?start_time=1389078385",

Page
The stage begins processing with the specified initial page, and then requests the following page. Use the ${startAt} variable in the resource URL as the value of the page number to request. You can optionally set a final page or offset for the stage to stop reading data.
Offset
The stage begins processing with the specified initial offset, and then requests the following offset. Use the ${startAt} variable in the resource URL as the value of the offset number to request.

Page or Offset Number

When using page or offset pagination, the API of the HTTP client typically requires that you include a page or offset parameter at the end of the response endpoint URL. The parameter determines the next page or offset of data to request.

The name of the parameter used by the API varies. For example, it might be offset, page, start, or since. Consult the documentation for the origin system API to determine the name of the page or offset parameter.

The Web Client origin provides a ${startAt} variable that you can use in the URL as the value of the page or offset. For example, your resource URL might be any of the following:

  • http://webservice/object?limit=15&offset=${startAt}
  • https://myapp.com/product?limit=5&since=${startAt}
  • https://myotherapp.com/api/v1/products?page=${startAt}

When the pipeline starts, the Web Client stage uses the value of the Initial Page or Initial Offset property as the ${startAt} variable value. After the stage reads a page of results, the stage increments the ${startAt} variable by one if using page pagination, or by the number of records read from the page if using offset pagination.

Example

Say that you configure offset pagination, set the initial offset to 0, and use the following response endpoint:
https://myapp.com/product?limit=5&since=${startAt}
When you start the pipeline, the stage resolves the response endpoint to:
https://myapp.com/product?limit=5&since=0
The first page of results includes items 0 through 4. After reading all 5 records from the first page, the stage increments the ${startAt} variable by 5, such that the next response endpoint is resolved to:
https://myapp.com/product?limit=5&since=5

The second page of results also includes 5 items, starting at the 5th item.

OAuth 2 Authentication

The Web Client origin can use the OAuth 2 protocol to connect to an HTTP service that uses basic or digest authentication, OAuth 2 client credentials, OAuth 2 username and password, or OAuth 2 access token.

The OAuth 2 protocol authorizes third-party access to HTTP service resources without sharing credentials. The Web Client origin uses credentials to request an access token from the service. The service returns the token to the origin, and then the origin includes the token in a header in each request to the request endpoint.

The credentials that you enter to request an access token depend on the credentials grant type required by the HTTP service. You can define the following OAuth 2 credentials grant types for Web Client stages:
Client credentials grant

The stage sends its own credentials - the client ID and client secret or the basic, or digest authentication credentials - to the HTTP service. For example, use the client credentials grant to process data from the Twitter API or from the Microsoft Azure Active Directory (Azure AD) API.

For more information about the client credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.4.

Access token grant

The stage sends an access token to an authorization service and obtains an access token for the HTTP service.

Owner credentials grant

The stage sends the credentials for the resource owner - the resource owner user name, password, client ID, and client secret - to the HTTP service. Or, you can use this grant type to migrate existing clients using basic or digest authentication to OAuth 2 by converting the stored credentials to an access token.

For example, you can use this grant to process data from the Getty Images API. For more information about using OAuth 2 to connect to the Getty Images API, see http://developers.gettyimages.com/api/docs/v3/oauth2.html.

For more information about the resource owner password credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.3.

Generated Records

The Web Client origin generates records based on the responses it receives.

Data in the response body is parsed based on the selected data format. For HEAD responses, when the response body contains no data, the origin creates an empty record. Information returned from the HEAD response appears in record header attributes. For all other methods, when the response body contains no data, and no records are created.

In generated records, all standard response header fields, such as Content-Encoding and Content-Type, are written to corresponding record header attributes. Custom response header fields are also written to record header attributes. Record header attribute names match the original response header names.

When you configure the origin to generate records for unsuccessful statuses that are not added as per-status actions, then the record might also include a field that contains the error response body.

Data Formats

The Web Client origin processes data differently based on the data format that you select.

The Web Client origin processes data formats as follows:

Avro
Generates a record for every message. Includes a precision and scale field attribute for each Decimal field.
The stage includes the Avro schema in an avroSchema record header attribute. You can use one of the following methods to specify the location of the Avro schema definition:
  • Message/Data Includes Schema - Use the schema in the message.
  • In Pipeline Configuration - Use the schema that you provide in the stage configuration.
  • Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry. Confluent Schema Registry is a distributed storage layer for Avro schemas. You can configure the stage to look up the schema in Confluent Schema Registry by the schema ID embedded in the message or by the schema ID or subject specified in the stage configuration.
Using a schema in the stage configuration or retrieving a schema from Confluent Schema Registry overrides any schema that might be included in the message and can improve performance.
Binary
Generates a record with a single byte array field at the root of the record.
When the data exceeds the user-defined maximum data size, the origin cannot process the data. Because the record is not created, the origin cannot pass the record to the pipeline to be written as an error record. Instead, the origin generates a stage error.
Datagram
Generates a record for every message. The origin can process collectd messages, NetFlow 5 and NetFlow 9 messages, and the following types of syslog messages:
  • RFC 5424
  • RFC 3164
  • Non-standard common messages, such as RFC 3339 dates with no version digit
When processing NetFlow messages, the stage generates different records based on the NetFlow version. When processing NetFlow 9, the records are generated based on the NetFlow 9 configuration properties. For more information, see NetFlow Data Processing.
Delimited
Generates a record for each delimited line.
The CSV parser that you choose determines the delimiter properties that you configure and how the stage handles parsing errors. You can specify if the data includes a header line and whether to use it. You can define the number of lines to skip before reading, the character set of the data, and the root field type to use for the generated record.
You can also configure the stage to replace a string constant with null values and to ignore control characters.
For more information about reading delimited data, see Reading Delimited Data.
JSON
Generates a record for each JSON object. You can process JSON files that include multiple JSON objects or a single JSON array.
When an object exceeds the maximum object length defined for the origin, the origin processes the object based on the error handling configured for the stage.
Log
Generates a record for every log line.
When a line exceeds the user-defined maximum line length, the origin truncates longer lines.
You can include the processed log line as a field in the record. If the log line is truncated, and you request the log line in the record, the origin includes the truncated line.
You can define the log format or type to be read.
Protobuf
Generates a record for every protobuf message. By default, the origin assumes messages contain multiple protobuf messages.
Protobuf messages must match the specified message type and be described in the descriptor file.
When the data for a record exceeds 1 MB, the origin cannot continue processing data in the message. The origin handles the message based on the stage error handling property and continues reading the next message.
For information about generating the descriptor file, see Protobuf Data Format Prerequisites.
Text
Generates a record for each line of text.
When a line exceeds the specified maximum line length, the origin truncates the line. The origin adds a boolean field named Truncated to indicate if the line was truncated.
XML
Generates records based on a user-defined delimiter element. Use an XML element directly under the root element or define a simplified XPath expression. If you do not define a delimiter element, the origin treats the XML file as a single record.
Generated records include XML attributes and namespace declarations as fields in the record by default. You can configure the stage to include them in the record as field attributes.
You can include XPath information for each parsed XML element and XML attribute in field attributes. This also places each namespace in an xmlns record header attribute.
Note: Field attributes and record header attributes are written to destination systems automatically only when you use the SDC RPC data format in destinations. For more information about working with field attributes and record header attributes, and how to include them in records, see Field Attributes and Record Header Attributes.
When a record exceeds the user-defined maximum record length, the origin skips the record and continues processing with the next record. It sends the skipped record to the pipeline for error handling.
Use the XML data format to process valid XML documents. For more information about XML processing, see Reading and Processing XML Data.
Tip: If you want to process invalid XML documents, you can try using the text data format with custom delimiters. For more information, see Processing XML Data with Custom Delimiters.