Oncrawl's Data Explorer is very powerful, it contains all the data collected and calculated by Oncrawl for all datasets (pages, links, logs,...).
Data Explorer uses the Oncrawl Query Language which allows to create filters to select data, and report the matching URLs.
These filters can use regular expressions, this whole language can be complex and yet it is very powerful.
This article will allow you to take stock of the possibilities of REGEX (Regular Expressions) and will provide you with good practices to explore further and easier.
Oncrawl uses the Lucene REGEX language.
You can:
Read our Lucene Cheat Sheet to understand more about rule options or
Check out the Elastic Search documentation for a full reference guide.
Principle of Regular Expressions
The basic principle of regular expressions is the search for patterns within a group of characters.
For this purpose you have access to a set of rules that use two main categories of objects: characters and quantifiers
Characters represent special letters, numbers or chars (or sets of them)
Quantifiers give information on the number of characters to be searched in the patterns
Use REGEX in Filters
You will find the filters in two important Oncrawl interfaces:
Data Explorer - to dig into your data
Segmentations - to group your pages
The interface is simplified but very powerful, there you will find:
the field to which the filter is applied
the type of match you want: is, is not, contains, starts/ends with
two modifiers: "aZ" case insensitive, " * " use regex
the value field: this one can be a string and thus a regex
Regex Examples
Most use URL pattern (match all URLs)
Field: URL path
Matching type: is
Modifier: case insensitive + regex
Pattern: [a-z0-9\-\_\/]+
Only lowercase URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-z0-9\-\_\/]+
Only uppercase URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [A-Z0-9\-\_\/]+
Only first directory pattern (case insensitive a-z & A-Z)
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-zA-Z0-9\-\_]+\/
Only subdirectory URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-z0-9\-\_]+\/[a-z0-9\-\_]+\/
Only URLs that start by a number pattern
Field: URL path
Matching type: start with
Modifier: regex
Pattern: [0-9]+
Only URLs that contain an id number pattern
Ex: /lorem/ipsum/product-name-03938.html
Field: URL path
Matching type: contains
Modifier: regex
Pattern: -[0-9]{5}\.html