Lucene REGEX Cheat Sheet
This article is based on the Elastic Search Article
Rebecca Berbel avatar
Written by Rebecca Berbel
Updated over a week ago

Standard operators

Anchoring

Most regular expression engines allow you to match any part of a string. If you want the regexp pattern to start at the beginning of the string or finish at the end of the string, then you have to anchor it specifically, using ^ to indicate the beginning or $ to indicate the end.

Howeer, Lucene’s patterns are always anchored. The pattern provided must match the entire string. For string abcde:

ab.*     # match
abcd     # no match

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \

Any reserved character can be escaped with a backslash \* including a literal backslash character: \\

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"

Match any character

The period . can be used to represent any character. For string abcde:

ab...   # match
a.c.e   # match

One-or-more

The plus sign + can be used to repeat the preceding shortest pattern once or more times. For string aaabbb:

a+b+        # match
aa+bb+      # match
a+.+        # match
aa+bbb+     # match

Zero-or-more

The asterisk * can be used to match the preceding shortest pattern zero-or-more times. For string aaabbb:

a*b*        # match
a*b*c*      # match
.*bbb.*     # match
aaa*bbb*    # match

Zero-or-one

The question mark ? makes the preceding shortest pattern optional. It matches zero or one times. For string aaabbb:

aaa?bbb?    # match
aaaa?bbbb?  # match
.....?.?    # match
aa?bb?      # no match

Min-to-max

Curly brackets {} can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:

{5}     # repeat exactly 5 times
{2,5}   # repeat at least twice and at most 5 times
{2,}    # repeat at least twice

For string aaabbb:

a{3}b{3}        # match
a{2,4}b{2,4}    # match
a{2,}b{2,}      # match
.{3}.{3}        # match
a{4}b{4}        # no match
a{4,6}b{4,6}    # no match
a{4,}b{4,}      # no match

Grouping

Parentheses () can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group. For string ababab:

(ab)+       # match
ab(ab)+     # match
(..)+       # match
(...)+      # no match
(ab)*       # match
abab(ab)?   # match
ab(ab)?     # no match
(ab){3}     # match
(ab){1,2}   # no match

Alternation

The pipe symbol | acts as an OR operator. The match will succeed if the pattern on either the left-hand side OR the right-hand side matches. The alternation applies to the longest pattern, not the shortest. For string aabb:

aabb|bbaa   # match
aacc|bb     # no match
aa(cc|bb)   # match
a+|b+       # no match
a+b+|b+a+   # match
a+(b|c)+    # match

Character classes

Ranges of potential characters may be represented as character classes by enclosing them in square brackets []. A leading ^ negates the character class. The allowed forms are:

[abc]   # 'a' or 'b' or 'c'
[a-c]   # 'a' or 'b' or 'c'
[-abc]  # '-' or 'a' or 'b' or 'c'
[abc\-] # '-' or 'a' or 'b' or 'c'
[^abc]  # any character except 'a' or 'b' or 'c'
[^a-c]  # any character except 'a' or 'b' or 'c'
[^-abc]  # any character except '-' or 'a' or 'b' or 'c'
[^abc\-] # any character except '-' or 'a' or 'b' or 'c'

Note that the dash - indicates a range of characters, unless it is the first character or if it is escaped with a backslash.

For string abcd:

ab[cd]+     # match
[a-d]+      # match
[^a-d]+     # no match

You can also find this article by searching for:
ayuda con regex, expresiones regulares
aide avec regex, expression régulière

Did this answer your question?