Regular Expressions
Regular Expressions Overview
A regular expression, also known as regex, describes a pattern for a string.
This appendix provides some details and tips specific to using regular expressions with the Data Collector. For a thorough description of how to define regular expressions, use a manual or online reference such as: https://docs.oracle.com/javase/tutorial/essential/regex/index.html.
For testing regular expressions, you might find the following website helpful: https://regex101.com/.
Regular Expressions in the Pipeline
Though generally not required, you can use Java-based regular expressions at various locations within a pipeline to define, search for, or manipulate strings.
For example, in the Field Masker processor, you can use a fixed length, variable length, or custom static mask to mask data in a field. When that doesn't work for you, you can use a regular expression to define a specific custom mask. Similarly, you can use a regular expression to define the format of a log line if it does not use one of the listed formats.
Location | Description |
---|---|
Directory origin | Optionally use to define the pattern of the file name. |
File Tail origin | Use to define the ${PATTERN} constant when you use the Files Matching a Pattern naming
option. Use to define the structure of a log line or text. |
Origins that process log data Log Parser processor |
Optionally use to define the pattern of the log line. |
Field Masker processor | Optionally use to define the field mask. |
regexCapture function | Use to define the groups and pattern of the string so you can specify the group to return. |
replaceAll function | Use to define the string to replace. |
Quick Reference
Character | Description | Examples |
---|---|---|
[ ] | Use brackets to define character classes. | [0-9][0-9][0-9] represents 3 digits ranging from 0 through 9, inclusive. |
- | Use the hyphen to define ranges. |
[a-z] defines one lowercase letter from a to z. [A-Z] defines one uppercase letter from A to Z. |
| | Indicates an alternate option to the character or group being defined. | [a-z | A-Z] represents a single upper or lowercase letter. |
( ) | Use parentheses to create groups, atomic groups, or lookarounds. | ([0-9][0-9][0-9])(-|.)([0-9][0-9][0-9])(-|.)([0-9][0-9][0-9][0-9]) represents phone numbers with area codes that are separated by dashes or periods as follows: 415-555-5555 and 415.555.5555. |
< > | Use angle brackets to define named capture groups. Use the following syntax:
to set up a named field extraction. |
|
^ | Use a carat to negate a character class. | [^A-G] defines a character that is not an uppercase letter from A to G. |
. | A wildcard that represents any single character except newline or other special characters. | |
&& | Use two ampersands to indicate the union of two ranges. | [\w&&[^1-9] represents all word characters except 1-9. |
? | A quantifier that represents zero or one instance of the preceding character or group. | B-?7 represents B7 or B-7. |
+ | A quantifier that represents one or more instances of the preceding character or group. | ([0-9][0-9][0-9])-+([0-9][0-9][0-9][0-9]) represents phone numbers that area codes: 415-555-5555 and 555-5555. |
* | A quantifier that represents zero or more instances of the preceding character or group. | ([0-9][0-9][0-9][0-9][0-9])-([0-9][0-9][0-9][0-9])* represents 5- and 9-digit zip codes. |
\ | Use the backslash as an escape character. | |
\\ | Represents a single backslash | |
\w | Represents a word character - includes alphanumeric characters and the underscore. | \w\w\w-\w\w\w\w\w can represent an error code, such as SVR-30243. |
\W | Represents a non-word character - includes everything except alphanumeric characters and the underscore. | |
\d | Represents a digit. Shorthand for 0-9. | (\d\d\d\d\d)-(\d\d\d\d)* represents a 9-digit zip code. |
\D | Represents a non-digit character. | \D&&\S represents the alphabet in either case. |
\s | Represents a whitespace character - includes space, tab, line break and form feed. | |
\S | Represents a non-space character - includes everything except the space, tab, line break, and form feed characters. | |
\t | Tab character. | |
\r | Return character. | |
\n | Line break or newline character. | |
\f | Form feed character. |
Regex Examples
- Masking credit card numbers, except for one group
- You can use the following regular expression in the Field Masker processor to mask all
numbers in a credit or debit card except for the last 4 digits:
(.*)([0-9]{4})