All Collections
How to get started
Crawl setup
How to define rules of inclusion/exclusion for URL patterns in your crawl setup
How to define rules of inclusion/exclusion for URL patterns in your crawl setup

How to tell the crawler what part of your site to analyze by including/excluding groups of URLs

Updated over a week ago

When you set up a crawl you can define what the crawler should crawl or not crawl by respecting a pattern of URLs.

For example, you might only want to crawl pages with β€œ/product/” in the URL, or you might want to skip any pages including a certain type of ID in the URL.

Setting up crawl filters based on patterns in URLs

To only crawl some of your URLs, based on a patter in the URL, go to Set up a new crawl.

In your crawl configuration, there is a section URL pattern filtering.

Click on this section and then check Enable these filters.

Next, click on the button Add rules to add regex rules that will include or exclude the URLs that match them.

  1. A new window opens where you can inform the different rules of inclusion or exclusion.

  2. Click on the + icon to add more or - to remove one that already exists.

  3. You can then provide a sample (a list of URLs) to check how your rules will work.

4 . Once you have listed some URLs, click on the button Check filters. This will show you the results for each sample URL. If it is taken into account in the crawl you will see a icon. If it is ignored in the crawl, you will see a icon.

When defining the rule list, in the system the exclusions can be evaluated first:

  1. A URL will be included in this crawl if no rules exclude it.

  2. If you provide any include rules, then the URL must also match an include rule.

Effects of URL pattern filtering

What types of crawl can URL pattern filter be applied to?

URL pattern filtering works with all modes of crawl.

However, your pattern filter will not apply to any Start URL when in Spider mode.

When in the crawl process are pattern filtering rules applied?

URL pattern filtering rules are applied before robots.txt or virtual robots.txt rules and before a page is requested by the crawler.

Can you see URLs excluded by pattern filtering in crawl results?

When you use URL patterns to filter, only URLs that are crawled are recorded in crawl results. The crawler has no knowledge of the URLs that were filtered out.

URLs that were not crawled will not be visible in the Data Explorer page or links datasets unless you connected an additional source of data that contains information about them (such as sitemaps, GA4 or GSC).

Did this answer your question?