What is a Regex ?
Regex stands for "regular expression," and it's a powerful tool for pattern matching and text manipulation. It's a sequence of characters that defines a search pattern. Think of it as a specialized language for describing patterns in strings. Oncrawl uses Regex for our segmentation tool or within the data explorer to filter your results.
Let's see how to use Regex.
Standard operators
Anchoring
Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^
to indicate the beginning or $
to indicate the end.
Howeer, Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string abcde
:
ab.* # match
abcd # no match
Allowed characters
Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:
. ? + * | { } [ ] ( ) " \
Any reserved character can be escaped with a backslash \*
including a literal backslash character: \\
Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:
john"@smith.com"
Match any character
The period .
can be used to represent any character. For string abcde
:
ab... # match
a.c.e # match
One-or-more
The plus sign +
can be used to repeat the preceding shortest pattern once or more times. For string aaabbb
:
a+b+ # match
aa+bb+ # match
a+.+ # match
aa+bbb+ # match
Zero-or-more
The asterisk *
can be used to match the preceding shortest pattern zero-or-more times. For string aaabbb
:
a*b* # match
a*b*c* # match
.*bbb.* # match
aaa*bbb* # match
Zero-or-one
The question mark ?
makes the preceding shortest pattern optional. It matches zero or one times. For string aaabbb
:
aaa?bbb? # match
aaaa?bbbb? # match
.....?.? # match
aa?bb? # no match
Min-to-max
Curly brackets {}
can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:
{5} # repeat exactly 5 times
{2,5} # repeat at least twice and at most 5 times
{2,} # repeat at least twice
For string aaabbb
:
a{3}b{3} # match
a{2,4}b{2,4} # match
a{2,}b{2,} # match
.{3}.{3} # match
a{4}b{4} # no match
a{4,6}b{4,6} # no match
a{4,}b{4,} # no match
Grouping
Parentheses ()
can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string ababab
:
(ab)+ # match
ab(ab)+ # match
(..)+ # match
(...)+ # no match
(ab)* # match
abab(ab)? # match
ab(ab)? # no match
(ab){3} # match
(ab){1,2} # no match
Alternation
The pipe symbol |
acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string aabb
:
aabb|bbaa # match
aacc|bb # no match
aa(cc|bb) # match
a+|b+ # no match
a+b+|b+a+ # match
a+(b|c)+ # match
Character classes
Ranges of potential characters may be represented as character classes by enclosing them in square brackets []
. A leading ^
negates the character class. The allowed forms are:
[abc] # 'a' or 'b' or 'c'
[a-c] # 'a' or 'b' or 'c'
[-abc] # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc] # any character except 'a' or 'b' or 'c'
[^a-c] # any character except 'a' or 'b' or 'c'
[^-abc] # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'
Note that the dash -
indicates a range of characters, unless it is the first character or if it is escaped with a backslash.
For string abcd
:
ab[cd]+ # match
[a-d]+ # match
[^a-d]+ # no match