Orphan pages are pages that aren't linked to in your site structure. They can’t be reached from anywhere on the website, even if they appear in your XML sitemap. This means that users can’t find them unless they know the URL.
We use the word "orphan" to indicate the lack of parent pages. A parent page is a page with an outgoing link to another page, called a child page. An orphan page, therefore, has no parents. Orphan pages are often impossible to find for search engines, as bots follow links when they crawl a website.
This won't always prevent search engine bots that already know your orphan pages exist from visiting them.
OnCrawl is well aware of this, and this is one reason you may want to combine your crawl data with log data using our cross-analysis. You can discover all your pages, both the ones present in your website structure (discovered by the crawl) and the orphan pages (often not crawled by OnCrawl or by Google).
OnCrawl also displays your orphan page distribution by page group so that you can determine trends in the location of your orphan pages.
Why do we get orphan pages?
Here are a few reasons you might end up with orphan pages:
Pages linked to from external websites. Google has indexed a page that isn't part of your site structure, it's often because of a link from an external website. This produces an active page (a page that receives SEO hits) that is also an orphan page.
Redirected pages. When you redirect a page, you remove it from your site structure. Internal links should always go directly to the correct page.
Non-canonical pages. When you successfully tell Google to index a different page using rel=canonical, the non-canonical page can become an orphan page.
Expired pages on a website with many pages that have a short lifespan. These pages often actually expire during the crawling time so it can become dangerous if they remain orphans for too long.
Pages returning errors that have been corrected but that Google still crawls for a few moments.
How to find orphan pages
To find your orphan pages you will need:
A list of pages with parents.
You can obtain this list by running a crawl. Make sure your crawl settings (max URLs and max depth) will allow our bots to discover all of your site. Don't run your crawl yet, because you'll also need:
A list of all pages, with or without parents.
You can find almost all of these by adding non-crawl data to your crawl analysis. This might include things like:
- Adding sitemap analysis in the crawl settings before running your crawl. This will find pages you listed in your sitemaps but don't include in your site structure.
- Adding server log data to your crawl. You'll need to enable this option and upload as many days of log data as you can. (We suggest 45 days.) This will show pages requested by bots and users that aren't included in your site structure.
- Adding traffic data, such as Google Analytics by enabling the option in the crawl settings. This will show pages with a Google Analytics tracker on them that receive visits by users, but that aren't included in your site structure.
- Adding ranking data through Google Search Console by enabling the option in the crawl settings. This will show pages that can be found in Google search results but that aren't included in your site structure.
- Adding backlink data by enabling the option in the crawl settings. This will show pages with "follow" backlinks that aren't part of your site structure.
- Using data ingestion to add other page lists that you might have (business indicators, Excel lists...). This will show you pages you track that aren't included in your site structure.
Some of these are paid options. Use as many or as few as works for your site.
Once you've added other sources, you can launch the crawl. We'll take care of comparing the results from different sources. You can discover the orphan pages we found in your crawl results.
If you need a single report with all of your orphan pages you can create a custom dashboard and adding the charts in the "Orphans" category:
Best practices for orphan pages
Link all pages that could possibly generate traffic to your website’s structure (like category pages or internal search result pages).
Avoid syntax errors when creating canonical tags as it creates incorrect URLs (HTTP 200 or errors).
Make sure that your expired content delivers the appropriate status code (a 404 or a redirection to a newer version).
Be careful when setting up your sitemap in order to avoid any syntax errors.
Reattach known orphan pages and pages that bring the most value to your website structure.
Be aware that when you correct an orphan page by redirecting traffic, it may take a while for Google's bots to stop testing it.
Make sure you are not wasting some valuable organic traffic!
If you have any questions regarding orphan pages, feel free to drop us a line @OnCrawl_CS or click on the blue Intercom button at the bottom of the page to chat with us.