URLs appear in many formats.
Oncrawl crawler resolves certain types of URLs encountered during a crawl in order to process them and to avoid errors in identifying duplicate content.
Relative URL paths
Oncrawl will regularize slashes in the URL where they are meant to indicate a relative path, according to the following rules:
Replace
/xx/../
with/
wherexx
does not contain/
and is not.
Replace leading
/../
with/
Replace
/./
with/
Replace
//
with/
except when it follows the protocol (https://
)
Oncrawl resolves all relative URLs. A relative and a complete URL are counted as a single URL, rather than two different URLs on your site.
For example:
/contact/
will be recorded as:
https://www.example.com/contact/
URLs containing session IDs
Oncrawl's built-in system removes session IDs if they are present in your URLs.
This cannot be disabled.
URLs with parameters
By default, parameters are not filtered or removed, except for session IDs mentioned above.
However, in the crawl profile settings, you can configure the crawler to filter on or to remove certain parameters.
Additionally, Oncrawl may change the order of a the parameters in a query string in a URL for the purpose of comparison and analysis.
The order of the parameters' values will not be changed. This can be disabled if necessary.
By default, this means that you might find URLs with parameters in the crawl results that don't exist on your site in exactly the same form, character for character.
However, the URLs used in the crawl results will always be functionally equivalent to the actual URLs found on your site.
For example:
https://www.example.com/product?q=search&utm=email¶meter1=value1
may be treated as:
https://www.example.com/product?parameter1=value1&q=search&utm=email
This allows us to treat all of the URLs with the same parameters in a different order (q=search&utm=email¶meter1=value1
or parameter1=value1&q=search&utm=email
or utm=email&q=search¶meter1=value1
or ...)
as the same page.
Note: We always keep multiple values for the same parameter in their original order. For example, if your query string contains parameter1=value1¶meter1=value2
and we change the order of the parameters, the re-ordered URL will still contain the exact string parameter1=value1¶meter1=value2
.
URLs with page anchors or hash symbols (#)
Oncrawl truncates (removes and ignores) the content in a URL following a hash (#).
For example:
https://www.example.com/product#specs
will be recorded as:
https://www.example.com/product
Note: We keep hashbangs (#!) and the content that follows a hashbang. This is used as a render indicator in certain types of JavaScript.
URLs with special characters vs encoded HTML entities
We encode special and non-ASCII characters as HTML entities starting with a % symbol in the URL.
For example:
https://www.example.com/5ways-to-encode- URL
will be treated as the same as:
https://www.example.com/5ways-to-encode-%20URL
Trailing slashes
Oncrawl adds a missing trailing slash.
For example:
https://www.example.com
will be treated as the same as:
https://www.example.com/
Default ports
Oncrawl removes default ports from the URL.
Default ports are 80
for HTTP and 443
for HTTPS.
All other ports are left in the URL.
For example:
https://www.example.com:443/
will be treated as the same as:
https://www.example.com/
Lower- and upper-case characters
Oncrawl's analysis is not case sensitive where host names are concerned.
For example:
https://www.mysite.com/
is the same as:
https://www.MySite.com/
However, uppercase and lowercase characters do make a difference in the URL path after the domain name.
For example:
https://www.example.com/EBooks/
is different from:
https://www.example.com/ebooks/