URLs appear in many formats. Our OnCrawl crawler resolve certain types of URLs encountered during a crawl in order to process them and to avoid errors in identifying duplicate content.

Relative URL paths

OnCrawl will regularize slashes in the URL where they are meant to indicate a relative path, according to the following rules:

  • Replace /xx/../ with / where xx does not contain / and is not .
  • Replace leading /../ with /
  • Replace /./ with /
  • Replace // with / except when it follows the protocol (https://)

OnCrawl resolves all relative URLs. A relative and a complete URL are counted as a single URL, rather than two different URLs on your site.

For example:

/contact/

will be recorded as:

https://www.example.com/contact/

URLs containing session IDs

OnCrawl's built-in system removes session IDs if they are present in your URLs. This cannot be disabled.

URLs with parameters

By default, parameters are not filtered or removed, except for session IDs mentioned above.

However, in the crawl profile settings, you can configure the crawler to filter on or to remove certain parameters.

Additionally, OnCrawl may change the order of a the parameters in a query string in a URL for the purpose of comparison and analysis. The order of the parameters' values will not be changed. This can be disabled if necessary.

By default, this means that you might find URLs with parameters in the crawl results that don't exist on your site in exactly the same form, character for character. However, the URLs used in the crawl results will always be functionally equivalent to the actual URLs found on your site.

For example:

https://www.example.com/product?q=search&utm=email&parameter1=value1

may be treated as:

https://www.example.com/product?parameter1=value1&q=search&utm=email

This allows us to treat all of the URLs with the same parameters in a different order (q=search&utm=email&parameter1=value1  or parameter1=value1&q=search&utm=email  or utm=email&q=search&parameter1=value1  or ...)
as the same page.

Note: We always keep multiple values for the same parameter in their original order. For example, if your query string contains parameter1=value1&parameter1=value2  and we change the order of the parameters, the re-ordered URL will still contain the exact string parameter1=value1&parameter1=value2.

URLs with page anchors or hash symbols (#)

OnCrawl truncates (removes and ignores) the content in a URL following a hash (#).

For example:

https://www.example.com/product#specs

will be recorded as:

https://www.example.com/product

Note: We keep hashbangs (#!) and the content that follows a hashbang. This is used as a render indicator in certain types of JavaScript.

URLs with special characters vs encoded HTML entities

We encode special and non-ASCII characters as HTML entities starting with a % symbol in the URL.

For example:

https://www.example.com/5ways-to-encode- URL

will be treated as the same as:

https://www.example.com/5ways-to-encode-%20URL

Trailing slashes

OnCrawl adds a missing trailing slash.

For example:

https://www.example.com

will be treated as the same as:

https://www.example.com/

Default ports

OnCrawl removes default ports from the URL. Default ports are 80  for HTTP and 443  for HTTPS. All other ports are left in the URL.

For example:

https://www.example.com:443/

will be treated as the same as:

https://www.example.com/

Lower- and upper-case characters

OnCrawl's analysis is not case sensitive where host names are concerned.

For example:

https://www.mysite.com/

is the same as:

https://www.MySite.com/

However, uppercase and lowercase characters do make a difference in the URL path after the domain name.

For example:

https://www.example.com/EBooks/

is different from:

https://www.example.com/ebooks/
Did this answer your question?