Web Client

The Web Client processor sends requests to a resource endpoint and writes responses to records. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

The Web Client processor requires that Data Collector use Java version 17. For more information, see Java Versions and Available Features.

The Web Client processor provides much of the same functionality as the HTTP Client processor. It also provides functionality not available in the HTTP Client processor. For more information, see Comparing Web Client and HTTP Client Processors.

For each request, the processor writes data from the response to the specified output field. When the response contains multiple values, the processor can write either the first value, all values to a list in a single record, or all values to separate records.

You can use the Web Client processor to perform a range of standard requests or you can use an expression to determine the request for each record.

When you configure the Web Client processor, you define the request endpoint, optional headers, and method to use for the requests. You can also use a connection to configure the processor.

You configure the processor to generate one request for each record or to generate a single request containing all records in the batch.

You define the pagination mode, optional status response actions, and an optional response endpoint for responses.

You can configure the timeout, request transfer encoding, and authentication type for both requests and responses.

You can optionally use a proxy server and configure TLS properties. You can also configure the processor to use the OAuth 2 protocol to connect to an HTTP service.

Note: This processor is a Technology Preview feature. It is not meant for use in production.

Comparing Web Client and HTTP Client Processors

Data Collector provides two processors that send requests to HTTP endpoints and write data to records. The HTTP Client processor was the first processor. The new Web Client processor includes key functionality available in the older processor, as well as improvements and new features.

The following is a list of key differences between the two processors:

  • The Web Client processor allows you to configure different data formats for request data and response data.

  • The Web Client processor supports parallel HTTP requests.
  • The Web Client processor allows you to configure per-timeout actions.

  • The HTTP Client processor can be configured to use Universal authentication. Both processors can be configured to use Basic, Digest, OAuth 1, and OAuth 2 authentication.

HTTP Method

You can use the following methods with the Web Client processor:

  • GET

  • POST

  • PUT

  • PATCH

  • DELETE

  • HEAD

  • Expression - An expression that evaluates to one of the other methods.

Expression Method

The Expression method allows you to write an expression that evaluates to a standard HTTP method. Use the Expression method to generate a workflow. For example, you can use an expression that passes data to the server using the PUT method based on the data in a field.

Headers

You can configure optional headers to include in the request made by the stage. Configure the headers in the following properties on the Request tab:

  • Security Headers

  • Common Headers

You can define headers in either property. However, only security headers support using credential functions to retrieve sensitive information from supported credential stores.

If you define the same header in both properties, security headers take precedence.

Grouping Style

The Web Client processor can generate one HTTP request for each record, or it can generate a single request containing all records in the batch.

Configure the processor to generate requests in one of the following ways:

Multiple requests per batch
If you set the Grouping Style property to One Request per Record, the processor generates one HTTP request for each record in the batch and sends multiple requests at a time. To preserve record order, the processor waits until all requests for the entire batch are completed before processing the next batch.
Single request per batch
If you set the Grouping Style property to One Request per Batch, the processor generates a single HTTP request containing all records in the batch.

Event Generation

The Web Client processor can generate events that you can use in an event stream. When you enable event generation, the processor generates event records each time the processor completes processing all available data.

You can use events generated by the Web Client processor in any logical way. For example: For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the Web Client processor include the following event-related record header attributes. Record header attributes are stored as String values:

Record Header Attribute Description
sdc.event.type Event type. Uses one of the following types:
  • finished - Generated when the processor finishes processing data.
  • start - Generated when the processor starts processing data.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.

The processor can generate the following types of event records:

finished
The processor generates a finished event record when the processor finishes processing data from the endpoint.

The finished event record generated by the processor has the sdc.event.type record header attribute set to finished and includes the following fields:

Field Description
webclient.pagination.calls.count The number of times the stage has paginated.
webclient.offset The pagination value for the current pagination.
webclient.last.page The last page number at which the finished event record is generated.
webclient.time The time the stage finished reading data.
start
The processor generates a start event record when the processor starts reading data from the instance.

The start event record generated by the processor has the sdc.event.type record header attribute set to start and includes the following fields:

Field Description
webclient.endpoint The instance the data is read from.
webclient.offset The pagination value for the current pagination.
webclient.method The HTTP method used.
webclient.body The body of the request.
webclient.time The time the request was made.

Per-Status Actions

The Web Client processor accepts only responses that include a status code that has been configured to be read as successful by the stage. When the response includes any other status code, the processor generates an error and handles the record based on the error record handling configured for the stage.

You can configure the processor to perform one of several actions when it encounters an unsuccessful status code.

To configure a per-status action, you enter a HTTP status code, such as 504 for gateway timeouts, and then select one of the following actions for the stage to perform for that code:
  • Retry with constant backoff
  • Retry with linear backoff
  • Retry with exponential backoff
  • Generate output record
  • Generate error record
  • Abort pipeline

When defining the retry with a constant, linear, or exponential backoff action, you also specify the backoff interval to wait in milliseconds. When defining any of the retry actions, you specify the maximum number of retries and a status failure response. If the stage receives a successful status code during a retry, then it processes the response. If the stage doesn't receive a successful status code after the maximum number of retries, then the stage performs the specified status failure action. You can only specify a status failure action for a retry action.

You can add multiple status codes and configure a specific action for each code.

Note: When using OAuth, all per-status actions configured for 401 Unauthorized and 403 Forbidden statuses are ignored. Instead, the stage generates a new OAuth token. If the same error occurs again, the stage generates a stage error.

Per-Timeout Actions

By default, the Web Client processor retries an operation five times before generating an error. You can configure the stage to use different timeout criteria and perform one of several actions when a specific type of timeout has reached its configured timeout limit.

To configure a per-timeout action, you select a timeout type, such as request, enter a timeout interval, and then select one of the following actions for the stage to perform for that code:
  • Retry with constant backoff
  • Retry with linear backoff
  • Retry with exponential backoff
  • Generate output record
  • Generate error record
  • Abort pipeline

When defining the retry with a constant, linear, or exponential backoff action, you also specify the backoff interval to wait in milliseconds. When defining any of the retry actions, you specify the maximum number of retries and timeout failure action. If the stage receives a response during a retry, then it processes the response. If the stage doesn't receive a response after the maximum number of retries, then the stage performs the specified timeout failure action.

You can add multiple timeout types and specify timeout criteria and actions for each of them.

Pagination

The Web Client processor can use pagination to retrieve a large volume of data from a paginated API.

When configuring the Web Client processor to use pagination, use the pagination type supported by the API of the HTTP client. You will likely need to consult the documentation for the origin system API to determine the pagination type to use and the properties to set.

The Web Client processor supports the following common pagination types:

Link in Header
After processing the current page, the stage uses the link in the HTTP header to access the next page. The link in the header can be an absolute URL or a URL relative to the next page link base URL configured for the stage. For example, let's say you configure the following next page link base URL for the stage:
https://myapp.com/api/objects?page=1
The next link in the HTTP header can include an absolute URL, as follows:
link:<https://myapp.com/api/objects?page=2>; rel="next"
Or the next link can include a URL relative to the resource URL, as follows:
link:<objects?page=2>; rel="next"
Link in Body
After processing the current page, the stage uses the link in a field in the response body to access the next page. The link in the response field can be an absolute URL or a URL relative to the next page link base URL configured for the stage. For example, let's say you configure the following next page link base URL for the stage:
http://myapp.com/api/tickets.json?start_time=138301982
The next link in the response field can include an absolute URL, as follows:
"next_page":"http://myapp.com/api/tickets.json?start_time=1389078385",

Or the next link can include a URL relative to the resource URL, as follows:
"next_page":"tickets.json?start_time=1389078385",

Page
The stage begins processing with the specified initial page, and then requests the following page. Use the ${startAt} variable in the resource URL as the value of the page number to request. You can optionally set a final page or offset for the stage to stop reading data.
Offset
The stage begins processing with the specified initial offset, and then requests the following offset. Use the ${startAt} variable in the resource URL as the value of the offset number to request.

Page or Offset Number

When using page or offset pagination, the API of the HTTP client typically requires that you include a page or offset parameter at the end of the response endpoint URL. The parameter determines the next page or offset of data to request.

The name of the parameter used by the API varies. For example, it might be offset, page, start, or since. Consult the documentation for the origin system API to determine the name of the page or offset parameter.

The Web Client processor provides a ${startAt} variable that you can use in the URL as the value of the page or offset. For example, your resource URL might be any of the following:

  • http://webservice/object?limit=15&offset=${startAt}
  • https://myapp.com/product?limit=5&since=${startAt}
  • https://myotherapp.com/api/v1/products?page=${startAt}

When the pipeline starts, the Web Client stage uses the value of the Initial Page or Initial Offset property as the ${startAt} variable value. After the stage reads a page of results, the stage increments the ${startAt} variable by one if using page pagination, or by the number of records read from the page if using offset pagination.

Example

Say that you configure offset pagination, set the initial offset to 0, and use the following response endpoint:
https://myapp.com/product?limit=5&since=${startAt}
When you start the pipeline, the stage resolves the response endpoint to:
https://myapp.com/product?limit=5&since=0
The first page of results includes items 0 through 4. After reading all 5 records from the first page, the stage increments the ${startAt} variable by 5, such that the next response endpoint is resolved to:
https://myapp.com/product?limit=5&since=5

The second page of results also includes 5 items, starting at the 5th item.

OAuth 2 Authentication

The Web Client processor can use the OAuth 2 protocol to connect to an HTTP service that uses basic or digest authentication, OAuth 2 client credentials, OAuth 2 username and password, or OAuth 2 access token.

The OAuth 2 protocol authorizes third-party access to HTTP service resources without sharing credentials. The Web Client processor uses credentials to request an access token from the service. The service returns the token to the processor, and then the processor includes the token in a header in each request to the request endpoint.

The credentials that you enter to request an access token depend on the credentials grant type required by the HTTP service. You can define the following OAuth 2 credentials grant types for Web Client stages:
Client credentials grant

The stage sends its own credentials - the client ID and client secret or the basic, or digest authentication credentials - to the HTTP service. For example, use the client credentials grant to process data from the Twitter API or from the Microsoft Azure Active Directory (Azure AD) API.

For more information about the client credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.4.

Access token grant

The stage sends an access token to an authorization service and obtains an access token for the HTTP service.

Owner credentials grant

The stage sends the credentials for the resource owner - the resource owner user name, password, client ID, and client secret - to the HTTP service. Or, you can use this grant type to migrate existing clients using basic or digest authentication to OAuth 2 by converting the stored credentials to an access token.

For example, you can use this grant to process data from the Getty Images API. For more information about using OAuth 2 to connect to the Getty Images API, see http://developers.gettyimages.com/api/docs/v3/oauth2.html.

For more information about the resource owner password credentials grant, see https://tools.ietf.org/html/rfc6749#section-4.3.

Generated Output

For each request that returns a successful status code, the Web Client processor writes the response to the specified output field. The processor parses data in the response body into values based on the selected data format. You configure how the processor writes multiple values. The processor can write either the first value to a single record, all values to a list in a single record, or all values to separate records.

For HEAD responses, the response body contains no data. Therefore, the processor writes output only to record header attributes, leaving the output field empty.

Data Formats

The Web Client processor parses each server response based on the selected data format and writes the response to the specified output field in the selected format.

You configure how the processor writes parsed responses that contain multiple values. The processor can write either the first value to a single record, all values to a list in a single record, or all values to separate records.

Available data formats include:

Avro
Generates a record for every message. Includes a precision and scale field attribute for each Decimal field.
The stage includes the Avro schema in an avroSchema record header attribute. You can use one of the following methods to specify the location of the Avro schema definition:
  • Message/Data Includes Schema - Use the schema in the message.
  • In Pipeline Configuration - Use the schema that you provide in the stage configuration.
  • Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry. Confluent Schema Registry is a distributed storage layer for Avro schemas. You can configure the stage to look up the schema in Confluent Schema Registry by the schema ID embedded in the message or by the schema ID or subject specified in the stage configuration.
Using a schema in the stage configuration or retrieving a schema from Confluent Schema Registry overrides any schema that might be included in the message and can improve performance.
Binary
Generates a record with a single byte array field at the root of the record.
When the data exceeds the user-defined maximum data size, the processor cannot process the data. Because the record is not created, the cannot pass the record to the pipeline to be written as an error record. Instead, the generates a stage error.
Datagram
Generates a record for every message. The processor can process collectd messages, NetFlow 5 and NetFlow 9 messages, and the following types of syslog messages:
  • RFC 5424
  • RFC 3164
  • Non-standard common messages, such as RFC 3339 dates with no version digit
When processing NetFlow messages, the stage generates different records based on the NetFlow version. When processing NetFlow 9, the records are generated based on the NetFlow 9 configuration properties. For more information, see NetFlow Data Processing.
Delimited
The processor parses each line in the response as a value, and either writes only the first delimited line to a single record, writes all delimited lines to a single record with each line written to a list item, or writes each delimited line to separate records.
The CSV parser that you choose determines the delimiter properties that you configure and how the stage handles parsing errors. You can specify if the data includes a header line and whether to use it. You can define the number of lines to skip before reading, the character set of the data, and the root field type to use for the generated record.
You can also configure the stage to replace a string constant with null values and to ignore control characters.
For more information about reading delimited data, see Reading Delimited Data.
JSON
The processor parses each object in the response into a value, and either writes only the first object to a single record, writes all objects to a list in a single record, or writes each object to separate records.
When an object exceeds the specified maximum object length, the processor processes the object based on the error handling configured for the stage.
Log
Generates a record for every log line.
When a line exceeds the user-defined maximum line length, the truncates longer lines.
You can include the processed log line as a field in the record. If the log line is truncated, and you request the log line in the record, the includes the truncated line.
You can define the log format or type to be read.
Protobuf
Generates a record for every protobuf message. By default, the assumes messages contain multiple protobuf messages.
Protobuf messages must match the specified message type and be described in the descriptor file.
When the data for a record exceeds 1 MB, the cannot continue processing data in the message. The handles the message based on the stage error handling property and continues reading the next message.
For information about generating the descriptor file, see Protobuf Data Format Prerequisites.
Text
If you specify a custom delimiter, the processor parses the data into values based on the delimiter. Otherwise, the processor parses each line into a value. Then, the processor either writes only the first value to a single record, writes all values to a list in a single record, or writes each value to separate records.
When a line exceeds the specified maximum line length, the processor truncates the line and adds a Boolean field named Truncated.
XML
If you specify a delimiter element, the processor uses the delimiter element to parse the response into values. The processor either writes only the first delimited element to a single record, writes all delimited elements to a list in a single record, or writes each delimited element to separate records.
If you do not specify a delimiter element, the processor writes the entire response to single record.
When a record exceeds the specified maximum record length, the processor skips the record and continues processing with the next record. It sends the skipped record to the pipeline for error handling.

Configuring a Web Client Processor

Configure a Web Client processor to perform requests against a resource endpoint.

This processor is a Technology Preview feature. It is not meant for use in production.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Connection tab, configure the following properties:
    Connection Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    Authentication Scheme

    Determines the authentication type used to connect to the server:

    • None - Performs no authentication.

    • Basic - Uses basic authentication. Requires a username and password.

      Use with HTTPS to avoid passing unencrypted credentials.

    • Digest - Uses digest authentication. Requires a username and password.

    • Bearer - Uses bearer authentication. Requires a username and password.

    • OAuth 1 - Uses OAuth 1.0 authentication. Requires OAuth credentials.

    • OAuth 2 - Uses OAuth 2.0 authentication. Requires OAuth credentials.

    Request Endpoint

    URL of the request resource.

    Data Interchange Pattern

    Determines whether the request and response share an endpoint:

    • One-Step - The request and response have the same endpoint.

    Keystore Management Determines the authentication and encryption used to connect to the HTTP server.
    • Automatic (for most HTTP and HTTPS requests) - Uses automatically-generated keystore and truststore configurations.
    • Manual (for manually configured HTTPS requests) - Manually configure the keystore and truststore to use.
    Keystore Location

    Where to load the keystore from:

    • Local - Loads the keystore from a local file.

    • Remote - Loads the keystore from the provided key and certificate chain.

    Required for manual keystore management.

    Keystore File

    Path to the local keystore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory:

    ${runtime:resourcesDirPath()}/keystore.jks

    Available for manual keystore management.

    Keystore Type Type of keystore to use. Use one of the following types:
    • Java Keystore File (JKS)
    • PKCS #12 (p12 file)

    Default is Java Keystore File (JKS).

    Keystore Password

    Password to the keystore file. A password is optional, but recommended.

    Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Private Key Private key used in the remote keystore. Enter a credential function that returns the key or enter the contents of the key.
    Certificate Chain Each PEM certificate used in the remote keystore. Enter a credential function that returns the certificate or enter the contents of the certificate.

    Using simple or bulk edit mode, click the Add icon to add additional certificates.

    Keystore Key Algorithm

    Algorithm to manage the keystore.

    Default is SunX509.

    Truststore File

    Path to the local truststore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory:

    ${runtime:resourcesDirPath()}/truststore.jks

    By default, no truststore is used.

    Truststore Type
    Type of truststore to use. Use one of the following types:
    • Java Keystore File (JKS)
    • PKCS #12 (p12 file)

    Default is Java Keystore File (JKS).

    Truststore Password

    Password to the truststore file. A password is optional, but recommended.

    Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Truststore Key Algorithm

    Algorithm to manage the truststore.

    Default is SunX509.

    Required for manual keystore management.

    Default Protocol Versions

    Use only modern default secure protocol versions.

    Available for manual keystore management.

    Default Cipher Suites

    Use only modern default cipher suites.

    Available for manual keystore management.

    Use Proxy Server Enables using a proxy server to connect to the system.
    Proxy Server Proxy server endpoint.
    Proxy User User name for proxy credentials.
    Proxy Password Password for proxy credentials.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
  3. For Basic or Digest authentication, on the Connection tab, configure the following properties:
    Connection Property Description
    User

    Authentication username.

    Password

    Authentication password.

    Tip: To secure sensitive information such as the JWT signing key, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
  4. For Bearer authentication, on the Connection tab, configure the following property:
    Bearer Property Description
    Token Authentication token to access the requested resource.
  5. For OAuth 1 authentication, on the Connection tab, configure the following properties:
    OAuth 1 Property Description
    Consumer Key

    Name of the OAuth consumer key.

    Consumer Secret

    OAuth consumer secret.

    Access Token

    OAuth 1.0 access token.

    Token Secret

    OAuth 1.0 token secret.

  6. For OAuth 2 authentication, on the Connection tab, configure the following properties.
    For more information about OAuth 2, see OAuth 2 Authentication.
    OAuth 2 Property Description
    Grant Type

    Grant type required by the HTTP service.

    Token Endpoint

    URL to request the access token.

    Client ID

    Client ID that the HTTP service uses to identify the HTTP client.

    Enter for the client credentials grant that uses a client ID and secret for authentication. Or, for the resource owner password credentials grant that requires a client ID and secret.

    Required for the Client Credentials grant.

    Client Secret

    Client secret that the HTTP service uses to authenticate the HTTP client.

    Enter for the client credentials grant that uses a client ID and secret for authentication. Or, for the resource owner password credentials grant that requires a client ID and secret.

    Tip: To secure sensitive information such as the JWT signing key, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

    Required for the Client Credentials grant.

    Signing Algorithm Algorithm used to sign the access token.

    Required for the Access Token grant.

    Signing Key Private key that the selected signing algorithm uses to sign the access token.
    Tip: To secure sensitive information such as the JWT signing key, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

    Required for the Access Token grant using a signing algorithm.

    Token Headers Headers to include in the access token

    Available for the Access Token grant.

    Token Claims Claims to include in the access token. Specify in JSON format. Enter each claim required to obtain an access token. You can include the expression language in the token claims.

    For example, to request an access token to read from Google service accounts, enter the following claims with the appropriate values:

    {
      "iss":"my_name@my_account.iam.gserviceaccount.com",
      "scope":"https://www.googleapis.com/auth/drive",
      "aud":"https://oauth2.googleapis.com/token",
      "exp":${(time:dateTimeToMilliseconds(time:now())/1000) + 50 * 60},
      "iat":${time:dateTimeToMilliseconds(time:now())/1000}
    }

    Required for the Access Tokens grant.

    Owner User Resource owner user name.

    Required for the Owner Credentials grant.

    Owner Password Resource owner password.

    Required for the Owner Credentials grant.

    Owner Client ID Resource owner client ID.

    Available for the Owner Credentials grant.

    Owner Client Secret Resource owner client secret.

    Required for the Owner Credentials grant.

    Additional Parameters Optional parameters to send to the token endpoint when requesting an access token. For example, you can define the OAuth 2 scope request parameter.

    Using simple or bulk edit mode, click the Add icon to add additional key-value pairs.

  7. On the Requests tab, configure the following properties:
    Request Property Description
    Grouping Style How to group records to generate requests.
    • One Request Per Record
    • One Request Per Batch
    Method HTTP request method. Use one of the standard HTTP methods.
    Security Headers Security headers to include in the request. Using simple or bulk edit mode, click Add to add additional security headers.

    You can use credential functions to retrieve sensitive information from supported credential stores.

    Note: If you define the same header in the Common Headers property, security headers take precedence.
    Request Body Request data to use with the specified method. Available for the Post, Put, Patch, and Head methods.

    You can use time functions and datetime variables, such as ${YYYY()}, in the request body.

    Wait Time Between Requests (ms) Milliseconds to wait between requests.
    Maximum Parallel Requests Maximum number of requests to make simultaneously.
    Common Headers Common headers to include in the request. Using simple or bulk edit mode, click Add to add additional common headers.
    Note: If you define the same header in the Security Headers property, security headers take precedence.
    Default Request Content Type Request content type to set if not specified as a header.
    Request Time Zone Time zone to use in time expressions.
    Log Requests Include HTTP requests in the Data Collector log.
    Note: If you enable this property, Data Collector may log sensitive data.
  8. On the Request Data tab, configure the following property:
    Request Data Property Description
    Request Data Format

    Format to use to generate HTTP output data. Use one of the following data formats:

    • Avro

    • Binary

    • Datagram
    • Delimited

    • JSON

    • Protobuf

    • Text

    • XML

  9. For Avro data, on the Request Data tab, configure the following properties:
    Avro Property Description
    Avro Schema Location Location of the Avro schema definition to use when writing data:
    • In Pipeline Configuration - Use the schema that you provide in the stage configuration.
    • In Record Header - Use the schema in the avroSchema record header attribute. Use only when the avroSchema attribute is defined for all records.
    • Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry.
    Avro Schema Avro schema definition used to write the data.

    You can optionally use the runtime:loadResource function to load a schema definition stored in a runtime resource file.

    Register Schema Registers a new Avro schema with Confluent Schema Registry.
    Schema Registry URLs Confluent Schema Registry URLs used to look up the schema or to register a new schema. To add a URL, click Add and then enter the URL in the following format:
    http://<host name>:<port number>
    Basic Auth User Info User information needed to connect to Confluent Schema Registry when using basic authentication.

    Enter the key and secret from the schema.registry.basic.auth.user.info setting in Schema Registry using the following format:

    <key>:<secret>
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Look Up Schema By Method used to look up the schema in Confluent Schema Registry:
    • Subject - Look up the specified Avro schema subject.
    • Schema ID - Look up the specified Avro schema ID.
    Schema Subject Avro schema subject to look up or to register in Confluent Schema Registry.

    If the specified subject to look up has multiple schema versions, the processor uses the latest schema version for that subject. To use an older version, find the corresponding schema ID, and then set the Look Up Schema By property to Schema ID.

    Schema ID Avro schema ID to look up in Confluent Schema Registry.
    Include Schema Includes the schema in each file.
    Note: Omitting the schema definition can improve performance, but requires the appropriate schema management to avoid losing track of the schema associated with the data.
    Avro Compression Codec The Avro compression type to use.

    When using Avro compression, do not enable other compression available in the processor.

  10. For binary data, on the Request Data tab, configure the following property:
    Binary Property Description
    Binary Field Path Field that contains the binary data.
  11. For delimited data, on the Request Data tab, configure the following properties:
    Delimited Property Description
    Header Line Indicates whether to create a header line.
    Delimiter Format Format for delimited data:
    • Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
    • RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
    • MS Excel CSV - Microsoft Excel comma-separated file.
    • MySQL CSV - MySQL comma-separated file.
    • Tab-Separated Values - File that includes tab-separated values.
    • PostgreSQL CSV - PostgreSQL comma-separated file.
    • PostgreSQL Text - PostgreSQL text file.
    • Custom - File that uses user-defined delimiter, escape, and quote characters.
    Replace New Line Characters Replaces new line characters with the configured string.

    Recommended when writing data as a single line of text.

    New Line Character Replacement String to replace each new line character. For example, enter a space to replace each new line character with a space.

    Leave empty to remove the new line characters.

    Charset Character set to use when writing data.
  12. For JSON data, on the Request Data tab, configure the following properties:
    JSON Property Description
    JSON Content Method to write JSON data:
    • JSON Array of Objects - Each file includes a single array. In the array, each element is a JSON representation of each record.
    • Multiple JSON Objects - Each file includes multiple JSON objects. Each object is a JSON representation of a record.
    Charset Character set to use when writing data.
  13. For protobuf data, on the Request Data tab, configure the following properties:
    Protobuf Property Description
    Protobuf Descriptor File Descriptor file (.desc) to use. The descriptor file must be in the Data Collector resources directory, $SDC_RESOURCES.

    For more information about environment variables, see Data Collector Environment Configuration in the Data Collector documentation. For information about generating the descriptor file, see Protobuf Data Format Prerequisites.

    Message Type Fully-qualified name for the message type to use when writing data.

    Use the following format: <package name>.<message type>.

    Use a message type defined in the descriptor file.
  14. For text data, on the Request Data tab, configure the following properties:
    Text Property Description
    Text Field Path Field that contains the text data to be written. All data must be incorporated into the specified field.
    Record Separator Characters to use to separate records. Use any valid Java string literal. For example, when writing to Windows, you might use \r\n to separate records.

    By default, the processor uses \n.

    On Missing Field When a record does not include the text field, determines whether the processor reports the missing field as an error or ignores the missing field.
    Insert Record Separator if No Text When configured to ignore a missing text field, inserts the configured record separator string to create an empty line.

    When not selected, discards records without the text field.

    Charset Character set to use when writing data.
  15. For XML data, on the Request Data tab, configure the following properties:
    XML Property Description
    Pretty Format Adds indentation to make the resulting XML document easier to read. Increases the record size accordingly.
    Validate Schema Validates that the generated XML conforms to the specified schema definition. Records with invalid schemas are handled based on the error handling configured for the destination.
    Important: Regardless of whether you validate the XML schema, the destination requires the record in a specific format. For more information, see Record Structure Requirement.
    XML Schema The XML schema to use to validate records.
  16. On the Response tab, configure the following properties:
    Response Property Description
    Pagination Mode Method of pagination to use. Use a method supported by the API of the HTTP client.
    Continue Without Data Continues pagination even when a page returns empty results.

    Available when using pagination.

    Next Page Link Base Base URL to use for next page relative links.

    For link in header and link in body pagination.

    Next Page Link Header Name of the response header that contains the link to the next page.

    For link in header pagination.

    Next Page Link Field Path Field path in the response that contains the URL to the next page.

    For link in body pagination.

    Stop Condition Condition that evaluates to true when there are no more pages to process.
    For example, let's say that the API of the HTTP client includes a count property that determines the number of items displayed per page. If the count is set to 1000 and a page returns with less than 1000 items, it is the last page of data. So you'd enter the following expression to stop processing when the count is less than 1000:
    ${record:value('/count') < 1000}

    For link in header and blink in body pagination.

    Final Offset Offset from which the stage will stop processing records.

    Use -1 to opt out of this property.

    For page pagination.

    Initial Page Initial page number for pagination.

    For page pagination.

    Final Page Page from which the stage will stop processing records.

    Use -1 to opt out of this property.

    For page pagination.

    Initial Offset Initial offset for pagination.

    For offset pagination.

    Result Field Path

    Field path in the response that contains the data that you want to process. Must be a list or array field.

    Required when using pagination.

    Keep All Fields Includes all fields from the response in the resulting record when enabled.

    Available when using pagination.

    Per-Status Actions Actions to apply to specified HTTP status codes. Click Add to add per-status actions.
    Per-Timeout Actions Actions to apply to specified timeout types. Click Add to add per-timeout actions.
    Error Field Name of the field to store the error response body in when generating protocol error records.
    Log Responses Include HTTP responses in the Data Collector log.
    Note: If you enable this property, Data Collector may log sensitive data.
  17. On the Response Data tab, configure the following properties:
    Response Data Property Description
    Collect Mode Method for collecting response data.
    Response Data Format
    Format to use to read HTTP response data. Use one of the following data formats:
    • Avro

    • Binary

    • Datagram
    • Delimited

    • JSON

    • Protobuf

    • Text

    • XML

  18. For Avro data, on the Response Data tab, configure the following properties:
    Avro Property Description
    Avro Schema Location Location of the Avro schema definition to use when processing data:
    • Message/Data Includes Schema - Use the schema in the message.
    • In Pipeline Configuration - Use the schema provided in the stage configuration.
    • Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry.

    Using a schema in the stage configuration or in Confluent Schema Registry can improve performance.

    Avro Schema Avro schema definition used to process the data. Overrides any existing schema definitions associated with the data.

    You can optionally use the runtime:loadResource function to load a schema definition stored in a runtime resource file.

    Schema Registry URLs Confluent Schema Registry URLs used to look up the schema. To add a URL, click Add and then enter the URL in the following format:
    http://<host name>:<port number>
    Schema Registry Security Option Authentication and encryption used to connect to the schema registry.
    Truststore Type
    Type of truststore to use. Use one of the following types:
    • Java Keystore File (JKS)
    • PKCS #12 (p12 file)

    Default is Java Keystore File (JKS).

    Truststore File

    Path to the local truststore file. Enter an absolute path to the file or enter the following expression to define the file stored in the Data Collector resources directory:

    ${runtime:resourcesDirPath()}/truststore.jks

    By default, no truststore is used.

    Truststore Password

    Password to the truststore file. A password is optional, but recommended.

    Tip: To secure sensitive information such as passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Basic Auth User Info User information needed to connect to Confluent Schema Registry when using basic authentication.

    Enter the key and secret from the schema.registry.basic.auth.user.info setting in Schema Registry using the following format:

    <key>:<secret>
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Lookup Schema By Method used to look up the schema in Confluent Schema Registry:
    • Subject - Look up the specified Avro schema subject.
    • Schema ID - Look up the specified Avro schema ID.
    • Embedded Schema ID - Look up the Avro schema ID embedded in each message.
    Overrides any existing schema definitions associated with the message.
    Schema Subject Avro schema subject to look up in Confluent Schema Registry.

    If the specified subject has multiple schema versions, the stage uses the latest schema version for that subject. To use an older version, find the corresponding schema ID, and then set the Look Up Schema By property to Schema ID.

    Schema ID Avro schema ID to look up in Confluent Schema Registry.
    Skip Union Indexes Omits header attributes identifying the index number of the element in a union that data is read from.

    If a schema contains many unions and the pipeline does not depend on index information, you can enable this property to avoid long processing times associated with storing a large number of indexes.

  19. For datagram data, on the Response Data tab, configure the following properties:
    Datagram Properties Description
    Charset Character encoding of the messages to be processed.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  20. For delimited data, on the Response Data tab, configure the following properties:
    Delimited Property Description
    Header Line Indicates whether to create a header line.
    Lines to Skip Number of lines to skip before reading data.
    CSV Parser Parser to use to process delimited data:
    • Apache Commons - Provides robust parsing and a wide range of delimited format types.
    • Univocity - Can provide faster processing for wide delimited files, such as those with over 200 columns.

    Default is Apache Commons.

    Max Record Length (chars) Maximum length of a record in characters. Longer records are not read.

    This property can be limited by the Data Collector parser buffer size. For more information, see Maximum Record Size.

    Available when using the Apache Commons parser.

    Root Field Type Root field type to use:
    • List-Map - Generates an indexed list of data. Enables you to use standard functions to process data. Use for new pipelines.
    • List - Generates a record with an indexed list with a map for header and value. Requires the use of delimited data functions to process data. Use only to maintain pipelines created before 1.1.0.
    Parse NULLs Replaces the specified string constant with null values.
    Charset Character set to use when writing data.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  21. For JSON data, on the Response Data tab, configure the following properties:
    JSON Property Description
    JSON Content Type of JSON content. Use one of the following options:
    • Multiple JSON Objects
    • JSON Array of Objects
    Compression Format The compression format of the files:
    • None - Processes only uncompressed files.
    • Compressed File - Processes files compressed by the supported compression formats.
    • Archive - Processes files archived by the supported archive formats.
    • Compressed Archive - Processes files archived and compressed by the supported archive and compression formats.
    Maximum Object Length (chars) Maximum number of characters in a JSON object.

    Longer objects are diverted to the pipeline for error handling.

    This property can be limited by the Data Collector parser buffer size. For more information, see Maximum Record Size.

    Charset Character encoding of the files to be processed.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  22. For log data, on the Response Data tab, configure the following properties:
    Log Property Description
    Log Format Format of the log files. Use one of the following options:
    • Common Log Format
    • Combined Log Format
    • Apache Error Log Format
    • Apache Access Log Custom Format
    • Regular Expression
    • Grok Pattern
    • Log4j
    • Common Event Format (CEF)
    • Log Event Extended Format (LEEF)
    Compression Format The compression format of the files:
    • None - Processes only uncompressed files.
    • Compressed File - Processes files compressed by the supported compression formats.
    • Archive - Processes files archived by the supported archive formats.
    • Compressed Archive - Processes files archived and compressed by the supported archive and compression formats.
    Max Line Length Maximum length of a log line. The processor truncates longer lines.

    This property can be limited by the Data Collector parser buffer size. For more information, see Maximum Record Size.

    Retain Original Line Determines how to treat the original log line. Select to include the original log line as a field in the resulting record.

    By default, the original line is discarded.

    Charset Character encoding of the files to be processed.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  23. For protobuf data, on the Response Data tab, configure the following properties:
    Protobuf Property Description
    Protobuf Descriptor File Descriptor file (.desc) to use. The descriptor file must be in the Data Collector resources directory, $SDC_RESOURCES.

    For information about generating the descriptor file, see Protobuf Data Format Prerequisites. For more information about environment variables, see Data Collector Environment Configuration in the Data Collector documentation.

    Message Type The fully-qualified name for the message type to use when reading data.

    Use the following format: <package name>.<message type>.

    Use a message type defined in the descriptor file.
    Delimited Messages Indicates if a message might include more than one protobuf message.
    Compression Format The compression format of the files:
    • None - Processes only uncompressed files.
    • Compressed File - Processes files compressed by the supported compression formats.
    • Archive - Processes files archived by the supported archive formats.
    • Compressed Archive - Processes files archived and compressed by the supported archive and compression formats.
  24. For text data, on the Response Data tab, configure the following properties:
    Text Property Description
    Compression Format The compression format of the files:
    • None - Processes only uncompressed files.
    • Compressed File - Processes files compressed by the supported compression formats.
    • Archive - Processes files archived by the supported archive formats.
    • Compressed Archive - Processes files archived and compressed by the supported archive and compression formats.
    Charset Character encoding of the files to be processed.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  25. For XML data, on the Response Data tab, configure the following properties:
    XML Property Description
    Delimiter Element
    Delimiter to use to generate records. Omit a delimiter to treat the entire XML document as one record. Use one of the following:
    • An XML element directly under the root element.

      Use the XML element name without surrounding angle brackets ( < > ) . For example, msg instead of <msg>.

    • A simplified XPath expression that specifies the data to use.

      Use a simplified XPath expression to access data deeper in the XML document or data that requires a more complex access method.

      For more information about valid syntax, see Simplified XPath Syntax.

    Compression Format The compression format of the files:
    • None - Processes only uncompressed files.
    • Compressed File - Processes files compressed by the supported compression formats.
    • Archive - Processes files archived by the supported archive formats.
    • Compressed Archive - Processes files archived and compressed by the supported archive and compression formats.
    Preserve Root Element Includes the root element in the generated records.

    When omitting a delimiter to generate a single record, the root element is the root element of the XML document.

    When specifying a delimiter to generate multiple records, the root element is the XML element specified as the delimiter element or is the last XML element in the simplified XPath expression specified as the delimiter element.

    Include Field XPaths Includes the XPath to each parsed XML element and XML attribute in field attributes. Also includes each namespace in an xmlns record header attribute.

    When not selected, this information is not included in the record. By default, the property is not selected.

    Namespaces Namespace prefix and URI to use when parsing the XML document. Define namespaces when the XML element being used includes a namespace prefix or when the XPath expression includes namespaces.

    For information about using namespaces with an XML element, see Using XML Elements with Namespaces.

    For information about using namespaces with XPath expressions, see Using XPath Expressions with Namespaces.

    Using simple or bulk edit mode, click the Add icon to add additional namespaces.

    Output Field Attributes Includes XML attributes and namespace declarations in the record as field attributes. When not selected, XML attributes and namespace declarations are included in the record as fields.

    By default, the property is not selected.

    Max Record Length (chars)

    The maximum number of characters in a record. Longer records are diverted to the pipeline for error handling.

    This property can be limited by the Data Collector parser buffer size. For more information, see Maximum Record Size.

    Charset Character encoding of the files to be processed.
    Ignore Control Characters Removes all ASCII control characters except for the tab, line feed, and carriage return characters.
  26. On the Output tab, configure the following properties:
    Output Property Description

    Output Field

    Field to use for the response. You can use a new or existing field.

    Multiple Values Behavior

    Action to take when responses contain multiple values:
    • First value only - Write the first value.

    • All values as a list - Write all values to a list in a single record.

    • Split into multiple records - Write all values, each to a separate record.

    When the processor uses pagination, set to All values as a list or Split into multiple records.

    Missing Values Behavior

    Action to take upon finding no return values in fields with no default value defined:
    • Send to error - Sends the record to error.

    • Pass the record along the pipeline unchanged - Passes the record without a lookup return value.