OnCrawl's Data Explorer is very powerful, it contains all the data collected and calculated by OnCrawl for all datasets (pages, links, logs,...)

Data Explorer uses the OnCrawl Query Language which allows to create filters to select data, and report the matching URLs. These filters can use regular expressions, this whole language can be complex and yet it is very powerful.

This article will allow you to take stock of the possibilities of REGEX (Regular Expressions) and will provide you with good practices to explore further and easier.

OnCrawl uses the Lucene REGEX language, read our Lucene Cheat Sheet to understand more about rule options.

Principle of Regular Expressions

The basic principle of regular expressions is the search for paterns within a group of characters. For this purpose you have access to a set of rules that use two main categories of objects: characters and quantifiers

  • Characters represent special letters, numbers or chars (or sets of them)
  • Quantifiers give information on the number of characters to be searched in the patterns

Use REGEX in filters

You will find the filters in two important OnCrawl interfaces:

  • Data Explorer - to dig into your data
  • Segmentations - to group your pages 

The interface is simplified but very powerful, there you will find:

  • the field to which the filter is applied
  • the type of match you want: is, contains, starts/ends with
  • two modifiers: "aZ" case insensitive, " * " use regex
  • the value field: this one can be a string and thus a regex 

Regex exemples

Most use URL pattern (match all URLs)
Field: URL path
Matchnig type: is
Modifier: case insensitive + regex
Pattern: [a-z0-9\-\_\/]+

Only lowercase URLs pattern
Field: URL path
Matchnig type: is
Modifier: regex
Pattern: [a-z0-9\-\_\/]+

Only uppercase URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [A-Z0-9\-\_\/]+

Only first directory pattern (case insensitive a-z & A-Z)
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-zA-Z0-9\-\_]+\/

Only subdirectory URLs pattern
Field: URL path
Matching type: is
Modifier: regex
Pattern: [a-z0-9\-\_]+\/[a-z0-9\-\_]+\/

Only URLs that start by a number pattern
Field: URL path
Matching type: start with
Modifier: regex
Pattern: [0-9]+

Only URLs that contain an id number pattern
Ex: /lorem/ipsum/product-name-03938.html
Field: URL path
Matching type: contains
Modifier: regex
Pattern: -[0-9]{5}\.html


You can also find this article by searching for:
cómo utilizar una regex en OnCrawl
comment utiliser une regex dans OnCrawl

Did this answer your question?