During a crawl, a bot moves through a website. Some URLs won't show up in the crawl results at all; some pages are known, but aren't crawled. And some are hit by the OnCrawl bot, but not analyzed.
Below, we'll talk about how to know what pages the OnCrawl bot can find, what pages it will crawl, and what pages it analyzes.
What types of pages does the OnCrawl bot know about?
The OnCrawl bot knows the following pages exist:
Any pages that are linked to from a known page, even if the link is a nofollow link.
These are all pages in the site structure created by internal links.
Any pages that are listed in additional sources: sitemaps, data ingestion files, or connected datasets (Google Analytics, Adobe Analytics, AT Internet, Majestic backlinks, Google Search Console, log data...).
Some of these pages might not be part of the site structure created by internal links. In this case, they are called orphan pages.
Unless they appear in additional sources, the OnCrawl bot does not know about pages in directories that are denied to robots in the robots.txt file.
What types of pages are crawled by the OnCrawl bot?
Just because the OnCrawl bot has learned a page exists doesn't mean it will be crawled.
We crawl pages that fall within the scope of the crawl setup and that don't forbid bots via robots.txt (or a virtual robots.txt, if you use one). This might include:
HTML pages in your domains' site structure, up to the page depth and number of URLs used as crawl limits
HTML pages in subdomains, if you checked "crawl subdomains"
HTML pages with a meta "noindex" tag (as a reminder, these pages can be crawled, but not indexed by search engine bots), pages with HTTP error status codes, canonicalized pages...
This behavior can be modified when setting up a crawl, in the Crawler behavior section.
We fetch, or retrieve information for all of the pages we crawl. In the Data Explorer, you can find information about the crawl by adding columns related to the OnCrawl bot, such as:
Fetch status: whether or not the bot received a response
Fetch date: the date and time of the bot hit on the page
What types of pages are analyzed in the crawl results dashboards and charts?
Most of the OnCrawl charts are based on compliant pages:
The page's meta robots tags allow the page to be indexed.
The page's HTTP status code is OK (200).
The page has no canonical URL or is its own canonical URL.
The page is an HTML page (as opposed to a resource, such as a CSS file or an image file).
You can check what pages are included in a chart by clicking on the chart to see the OnCrawl Query Language filter in the Data Explorer that corresponds to the chart.
What information is retrieved for an analyzed page?
All of the information OnCrawl obtains about a page is available on the URL details page. You can find this page under Tools at the bottom of the sidebar in any crawl report.
If you still have questions, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.
This article can also by found by searching for:
pages connues, pages encontrées, combien de pages, pages dans résultats
páginas conocidas, paginas encontradas, cuántos páginas