Regular expressions (regex) are used to find a certain type of expression based on the pattern of types of characters in expression you're looking for.

You can use regex to create page groups.

We'll walk you through creating page groups for the following examples:

You can adapt these examples to the types of expressions on your own website.

Any metric with text content can be used to create a page group using regex. OnCrawl uses Lucene Regex (here's a cheat sheet for the syntax).

Example 1: Page group for pages with a numerical slug

Our site contains a section of pages whose URLs contain slug composed of numbers, rather than human-readable text. Here are some examples:

We can create a page group for these pages.

Prepare the page group:

  1. From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
  2. Click on the "+ Create segmentation" button at the top of the page.
  3. Select "From scratch" and click "Continue".
  4. Enter a name for the segmentation. Let's call it "Numerical slugs".
  5. Click "Create segmentation." This will take you to the segmentation page for your new segmentation.
  6. Click on "+ Create page group" at the top right of the page.
  7. Enter a name for the group of pages. Let's call it "Has numerical slug".
  8. Choose a color that will represent this group in all of the OnCrawl charts.
  9. Place the group last in the series of page groups.
  10. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.

Set the definition using a regular expression:

  1. Choose the metric to apply the regular expression to. Since we're looking for a series of numbers anywhere in the URL, select "Full URL".
  2. Choose the operator. Since we're looking for a part of the URL, choose "contains". If the product ID only occurs at the end of the URL, you can also choose "ends with".
  3. Click the ".*" button to activate regular expressions.
  4. Enter the following regular expression in the final field: /[0-9]+/
    This expression searches for a slash followed by one or more digits, followed by another slash.
  5. Click "Refresh matching URLs" to make sure everything is working right.
  6. Click "Save changes".

Example 2: Page group for pages containing a numerical product ID

Our site contains a section of pages whose URLs contain a product name, followed by a hyphen and a product ID composed of 5 numbers. Here are some examples:

We can create a page group for these pages.

Prepare the page group:

  1. From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
  2. Click on the "+ Create segmentation" button at the top of the page.
  3. Select "From scratch" and click "Continue".
  4. Enter a name for the segmentation. Let's call it "Numerical product IDs".
  5. Click "Create segmentation." This will take you to the segmentation page for your new segmentation.
  6. Click on "+ Create page group" at the top right of the page.
  7. Enter a name for the group of pages. Let's call it "Is Product".
  8. Choose a color that will represent this group in all of the OnCrawl charts.
  9. Place the group last in the series of page groups.
  10. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.

Set the definition using a regular expression:

  1. Choose the metric to apply the regular expression to. Since we're looking for a pattern anywhere in the URL, select "Full URL".
  2. Choose the operator. Since we're looking for a part of the URL, choose "contains".
  3. Click the ".*" button to activate regular expressions.
  4. Enter the following regular expression in the final field: /[a-zA-Z\-\_]+-[0-9]{5}/
    This expression searches for a slash followed by a product name that is not case sensitive and might contain letters, underscores or hypens, followed by a hyphen, followed by five digits, followed by another slash.
  5. Click "Refresh matching URLs" to make sure everything is working right.
  6. Click "Save changes".\

Example 3: Page groups for pages with htm or html extensions, all other extensions, and no extensions

Our site contains a section of pages whose URLs sometimes contain file extensions.

We'll create a segmentation with a page group for each of the following cases:

Prepare the first page group

  1. From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
  2. Click on the "+ Create segmentation" button at the top of the page.
  3. Select "From scratch" and click "Continue".
  4. Enter a name for the segmentation. Let's call it "Extensions".
  5. Click "Create segmentation." This will take you to the segmentation page for your new segmentation.
  6. Click on "+ Create page group" at the top right of the page.
  7. Enter a name for the first group of pages. Let's call it "html / htm".
  8. Choose a color that will represent this group in all of the OnCrawl charts.
  9. Place the group last in the series of page groups.
  10. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.

Set the definition of the first page group using a regular expression

  1. Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
  2. Choose the operator. Since we're looking for a pattern at the end of the URL, choose "ends with".
  3. Click the ".*" button to activate regular expressions.
  4. Enter the following regular expression in the final field: \.html?
    This expression searches for a point, followed by htm, followed by an optional l.
  5. Click "Refresh matching URLs" to make sure everything is working right.
  6. Click "Save changes".

Create a second page group and set its definition using a regular expression

  1. Click on "+ Create page group" at the top right of the page.
  2. Enter a name for the second group of pages. Let's call it "other extensions".
  3. Choose a color that will represent this group in all of the OnCrawl charts.
  4. Place the group last in the series of page groups.
  5. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.
  6. Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
  7. Choose the operator. Since we're looking for a pattern at the end of the URL, choose "ends with".
  8. Click the ".*" button to activate regular expressions.
  9. Enter the following regular expression in the final field: \.[a-zA-Z]+
    This expression searches for a point, followed by one or more upper- or lowercase letters.

At this point, we have a page group that lists all of the URLs with extensions, including htm and html extensions. If you click "Refresh matching URLs" and then, above the URL lists below, click on "Conflicting URLs", you will see "Conflicts with html / htm". All of the pages in our page group "html / htm" filter are listed here.

Let's fix that.

Correct the definition of this page group to exclude html and htm extensions:

  1. In the OnCrawl Query Language block, click "Add field."
  2. Make sure the "And" operator at the top is selected.
  3. Choose "Full URL" from the "Select field" drop-down menu.
  4. Choose the operator. Since we're looking for a pattern at the end of the URL that we want to exclude, choose "not ends with".
  5. Click the ".*" button to activate regular expressions.
  6. Enter the following regular expression in the final field: \.html?
    This expression searches for a point, followed by htm, followed by an optional l.
  7. Click "Refresh matching URLs" to make sure everything is working right.

At this point, you should no longer see the list "Conflicting URLs".

  1. Click "Save changes".

Create the third page group and set its definition using a regular expression

Technically, all of the pages with an extension are already accounted for, so the pages left over in "other" should be pages with no extensions.

However, to remove all doubt and to limit the use of the automatic "other" category, we can create a page group for pages with no extensions:

  1. Click on "+ Create page group" at the top right of the page.
  2. Enter a name for the second group of pages. Let's call it "no extension".
  3. Choose a color that will represent this group in all of the OnCrawl charts.
  4. Place the group last in the series of page groups.
  5. Click "Create page group". This will take you to the page where you can indicate which pages to put in this group.
  6. Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
  7. Choose the operator. Since we're looking for a pattern at the end of the URL that we don't want to find, choose "not ends with".
  8. Click the ".*" button to activate regular expressions.
  9. Enter the following regular expression in the final field: \.[a-zA-Z]+
    This expression searches for a point, followed by one or more upper- or lowercase letters.
  10. Click "Refresh matching URLs" to make sure everything is working right. You should not see any conflicts.
  11. Click "Save changes".

In the page repartition preview on the right, all of your pages should be accounted for by a colored filter. No pages should be found in the "other" category.

Using these segmentations

Return to your project home page and click on any analysis to view the analysis results.

At the top of the page, use the "Segmentation" drop-down menu to select one of the segmentations you just created.

You can also focus on one of the page groups in the segmentation by changing the group selected in the second drop-down menu, "Base filter".

Going further

You may want to look at:

Or, if you still have questions, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

Did this answer your question?