Lucene REGEX Cheat Sheet

In this article you'll learn how to use Regex with Oncrawl. This article is based on the Elastic Search Article

Updated over a week ago

What is a Regex ?

Regex stands for "regular expression," and it's a powerful tool for pattern matching and text manipulation. It's a sequence of characters that defines a search pattern. Think of it as a specialized language for describing patterns in strings. Oncrawl uses Regex for our segmentation tool or within the data explorer to filter your results.

Let's see how to use Regex.

Standard operators

Anchoring

Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.

Howeer, Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string abcde:

ab.*     # match
abcd     # no match

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \

Any reserved character can be escaped with a backslash \* including a literal backslash character: \\

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"

Match any character

The period . can be used to represent any character. For string abcde:

ab...   # match
a.c.e   # match

One-or-more

The plus sign + can be used to repeat the preceding shortest pattern once or more times. For string aaabbb:

a+b+        # match
aa+bb+      # match
a+.+        # match
aa+bbb+     # match

Zero-or-more

The asterisk * can be used to match the preceding shortest pattern zero-or-more times. For string aaabbb:

a*b*        # match
a*b*c*      # match
.*bbb.*     # match
aaa*bbb*    # match

Zero-or-one

The question mark ? makes the preceding shortest pattern optional. It matches zero or one times. For string aaabbb:

aaa?bbb?    # match
aaaa?bbbb?  # match
.....?.?    # match
aa?bb?      # no match

Min-to-max

Curly brackets {} can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5}     # repeat exactly 5 times
{2,5}   # repeat at least twice and at most 5 times
{2,}    # repeat at least twice

For string aaabbb:

a{3}b{3}        # match
a{2,4}b{2,4}    # match
a{2,}b{2,}      # match
.{3}.{3}        # match
a{4}b{4}        # no match
a{4,6}b{4,6}    # no match
a{4,}b{4,}      # no match

Grouping

Parentheses () can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string ababab:

(ab)+       # match
ab(ab)+     # match
(..)+       # match
(...)+      # no match
(ab)*       # match
abab(ab)?   # match
ab(ab)?     # no match
(ab){3}     # match
(ab){1,2}   # no match

Alternation

The pipe symbol | acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string aabb:

aabb|bbaa   # match
aacc|bb     # no match
aa(cc|bb)   # match
a+|b+       # no match
a+b+|b+a+   # match
a+(b|c)+    # match

Character classes

Ranges of potential characters may be represented as character classes by enclosing them in square brackets []. A leading ^ negates the character class. The allowed forms are:

[abc]   # 'a' or 'b' or 'c'
[a-c]   # 'a' or 'b' or 'c'
[-abc]  # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc]  # any character except 'a' or 'b' or 'c'
[^a-c]  # any character except 'a' or 'b' or 'c'
[^-abc]  # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'

Note that the dash - indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

For string abcd:

ab[cd]+     # match
[a-d]+      # match
[^a-d]+     # no match

Did this answer your question?