Regular expressions (regex) are used to find a certain type of expression based on the pattern of types of characters in expression you're looking for.
You can use regex to create page groups.
We'll walk you through creating page groups for the following examples:
Example 1: Pages with a numerical slug, such as https://www.mysite.com/articles/5473/
Example 2: Pages with a product ID after their title, such as https://www.mysite.com/products/shoes-12345/
Example 3: Pages with an html or an htm extension, such as https://www.mysite.com/page.htm
You can adapt these examples to the types of expressions on your own website.
Any metric with text content can be used to create a page group using regex. Oncrawl uses Lucene Regex (here's a cheat sheet for the syntax).
Example 1: Page group for pages with a numerical slug
Our site contains a section of pages whose URLs contain slug composed of numbers, rather than human-readable text. Here are some examples:
We can create a page group for these pages.
Prepare the page group:
From the project home page, scroll down to the "Analysis" section and click on the "Configure Segmentation" button. This will take you to the Segmentation page.
Click on the "+ Create a segmentation" button at the top of the page.
Select "Manual".
Enter a name for the segmentation. Let's call it "Numerical slugs".
Click "Create segmentation" This will take you to the segmentation page for your new segmentation.
Click on the new group icon at the top of the list of Groups on the left of the page.
Enter a name for the group of pages. Let's call it "Has numerical slug".
When you click "Apply", the group will show up. You'll see the part on the right of the page now lists the rules that determine the pages to put in this group.
Set the definition using a regular expression:
Choose the metric to apply the regular expression to. Since we're looking for a series of numbers anywhere in the URL, select "Full URL".
Choose the operator. Since we're looking for a part of the URL, choose "contains". If the product ID only occurs at the end of the URL, you can also choose "ends with".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: /[0-9]+/
This expression searches for a slash followed by one or more digits, followed by another slash.Click "Refresh matching URLs" to make sure everything is working right.
Example 2: Page group for pages containing a numerical product ID
Our site contains a section of pages whose URLs contain a product name, followed by a hyphen and a product ID composed of 5 numbers. Here are some examples:
We can create a page group for these pages.
Create a new segmentation, or just add a new group to an existing one.
Click on your new group to create the rules using a regular expression:
On the right-hand side where you can set the rules for this group, choose the metric to apply the regular expression to. Since we're looking for a pattern anywhere in the URL, select "Full URL".
Choose the operator. Since we're looking for a part of the URL, choose "contains".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: /[a-zA-Z\-\_]+-[0-9]{5}/
This expression searches for a slash followed by a product name that is not case sensitive and might contain letters, underscores or hypens, followed by a hyphen, followed by five digits, followed by another slash.Click "Refresh matching URLs" to make sure everything is working right.
Click "Save changes".
Example 3: Page groups for pages with htm or html extensions, all other extensions, and no extensions
Our site contains a section of pages whose URLs sometimes contain file extensions.
We'll create a segmentation with a page group for each of the following cases:
Pages with an html or an htm extension, such as https://www.mysite.com/page.htm
Pages with an extension other than htm or html, such as https://www.mysite.com/media.jpg
Pages with no extension, such as https://www.mysite.com/page/
First, create your segmentation. You can use the "Configure Segmentation" button on the project home page.
You'll want to give it a meaningful name, like "Extensions".
Prepare the first page group
On the left, click the blue button next to the list of Groups to create your first group.
Enter a name for the first group of pages. Let's call it "html / htm".
Click "OK". This will create the group. Its information will show up on the right, where you can indicate which pages to put in this group.
Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
Choose the operator. Since we're looking for a pattern at the end of the URL, choose "ends with".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: \.html?
This expression searches for a point, followed by htm, followed by an optional l.Click "Refresh matching URLs" to make sure everything is working right.
Create a second page group and set its definition using a regular expression
Click on the new group button in the Groups list on the left.
Enter a name for the second group of pages. Let's call it "other extensions".
Click "Apply". This will create the group. Its information will show up on the right, where you can indicate which pages to put in this group.
Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
Choose the operator. Since we're looking for a pattern at the end of the URL, choose "ends with".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: \.[a-zA-Z]+
This expression searches for a point, followed by one or more upper- or lowercase letters.
At this point, we have a page group that lists all of the URLs with extensions, including htm and html extensions. If you click "Refresh matching URLs" and then, above the URL lists below, click on "Group Overlaps", you will see "Overlaps with html / htm". All of the pages in our page group "html / htm" filter are listed here.
Let's fix that.
Correct the definition of this page group to exclude html and htm extensions:
In the Oncrawl Query Language block, click "Add field."
Make sure the "And" operator at the top is selected.
Choose "Full URL" from the "Select field" drop-down menu.
Choose the operator. Since we're looking for a pattern at the end of the URL that we want to exclude, choose "not ends with".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: \.html?
This expression searches for a point, followed by htm, followed by an optional l.Click "Refresh matching URLs" to make sure everything is working right.
At this point, you should no longer see any URLs in the list "Group overlaps".
Create the third page group and set its definition using a regular expression
Technically, all of the pages with an extension are already accounted for, so the pages left over in "other" should be pages with no extensions.
However, to remove all doubt and to limit the use of the automatic "other" category, we can create a page group for pages with no extensions:
Click on the new group button in the Groups list on the left.
Enter a name for the second group of pages. Let's call it "no extension".
Click "OK". This will create the group. Its information will show up on the right, where you can indicate which pages to put in this group.
Choose the metric to apply the regular expression to. Since we're looking for a pattern in the URL, select "Full URL".
Choose the operator. Since we're looking for a pattern at the end of the URL that we don't want to find, choose "not ends with".
Click the ".*" button to activate regular expressions.
Enter the following regular expression in the final field: \.[a-zA-Z]+
This expression searches for a point, followed by one or more upper- or lowercase letters.Click "Refresh matching URLs" to make sure everything is working right. You should not see any conflicts.
Click "Save changes".
In the crawl breakdown preview on the left, all of your pages should be accounted for by a colored filter. No pages should be found in the gray "other" category.
Using these segmentations
Return to your project home page and click on any analysis to view the analysis results.
At the top of the page, use the "Segmentation" drop-down menu to select one of the segmentations you just created.
You can also focus on one of the page groups in the segmentation by changing the group selected in the second drop-down menu, "Base filter".
Going further
You may want to look at: