How to use REGEX in Oncrawl

Use pattern detection in fields to get to the essentials faster. Use regular expressions to create filters (Data Explorer & Segmentations)

Updated over a week ago

Oncrawl's Data Explorer is very powerful, it contains all the data collected and calculated by Oncrawl for all datasets (pages, links, logs,...).

Data Explorer uses the Oncrawl Query Language which allows to create filters to select data, and report the matching URLs.

These filters can use regular expressions, this whole language can be complex and yet it is very powerful.

This article will allow you to take stock of the possibilities of REGEX (Regular Expressions) and will provide you with good practices to explore further and easier.

Oncrawl uses the Lucene REGEX language.

You can:

  1. Read our Lucene Cheat Sheet to understand more about rule options or

Principle of Regular Expressions

The basic principle of regular expressions is the search for patterns within a group of characters.

For this purpose you have access to a set of rules that use two main categories of objects: characters and quantifiers

  • Characters represent special letters, numbers or chars (or sets of them)

  • Quantifiers give information on the number of characters to be searched in the patterns

Use REGEX in Filters

You will find the filters in two important Oncrawl interfaces:

  • Data Explorer - to dig into your data

  • Segmentations - to group your pages 

The interface is simplified but very powerful, there you will find:

  • the field to which the filter is applied

  • the type of match you want: is, is not, contains, starts/ends with

  • two modifiers: "aZ" case insensitive, " * " use regex

  • the value field: this one can be a string and thus a regex 

Regex Examples

Most use URL pattern (match all URLs)
Field: URL path
Matching type: is
Modifier: case insensitive + regex
Pattern: [a-z0-9\-\_\/]+

Only lowercase URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-z0-9\-\_\/]+

Only uppercase URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [A-Z0-9\-\_\/]+

Only first directory pattern (case insensitive a-z & A-Z)
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-zA-Z0-9\-\_]+\/

Only subdirectory URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-z0-9\-\_]+\/[a-z0-9\-\_]+\/

Only URLs that start by a number pattern
Field: URL path
Matching type: start with
Modifier: regex
Pattern: [0-9]+

Only URLs that contain an id number pattern
Ex: /lorem/ipsum/product-name-03938.html
Field: URL path
Matching type: contains
Modifier: regex
Pattern: -[0-9]{5}\.html

Did this answer your question?