Our data

Duplicate content leads to SEO issues that can hurt your rankings.

Duplicate content refers to content that appears in more than one place, whether on a website or on multiple websites.

In fact, duplicate content causes trouble for crawlers since it is impossible to tell which URL is the most relevant for a given query.

Taking UX into consideration, search engines will often not display multiple pages for the same query and are forced to choose the one likely to be the best.

This leads to an important loss of relevant results on search engine results and therefore to a loss of traffic.

Duplicate content can lead to three main issues:

confusion between versions to index
trouble to direct the link metrics (authority, trust, anchor text, link juice) to the right page or share it between different versions
inability to rank the right version for queries

With Oncrawl, you can easily find your groups of duplicate pages and near duplicates.

You will also be able to see if the canonicals strategy you have put in place is able to manage your duplicate content, or if there are still problems.

We split your problematic duplicate content based on whether there are multiple canonical URLs for the cluster, or whether canonical URLs are simply not set.

You can filter your clusters by number of pages and also by content similarity. Read more about this graph here.

management of pages with content duplication issues

By clicking on a specific cluster, you will access further details about the URLs in this cluster.

Also you can examine what type of content is duplicated:

In the image above, for instance, you can see that 3,738 pages have a duplicate description.

General considerations about duplicate content

Contrary to popular belief, there isn't a specific penalty for duplicate content, at least not in the way most people envision it.

However, there are certain practices related to content duplication that can incur penalties.

For example, scraping and republishing content from other sites without adding value, or creating multiple pages with almost identical content, are discouraged according to Webmaster Guidelines.

Also, many concerns around duplicate content revolve around instances like having multiple URLs on a domain pointing to the same content, which is often a result of how Content Management Systems (CMS) handle content by default.

While this type of non-malicious duplication is common, it doesn't lead to penalties. Instead, it's more about how search engines like Google handle and present content.

Search engines aim to provide diverse search results, filtering out duplicate documents to reduce redundancy for users.

What are the best practices ?

In order to avoid those duplicate issues there are some best practices you can follow.

Redirecting duplicate content to the canonical URL
Adding a canonical link element to the duplicate page
Adding an HTML link from the duplicate page to the canonical page
Using parameter handling tools in Google Webmaster Central.

Let's see them more in detail.

301 redirect

301 redirect is in most cases the most relevant solution and especially for URLs issues. It tells search engines which version of the pages is the original and links the duplicate one to the primary one.

Moreover, when multiple well ranked pages are linked to a single one, they are not competitors anymore and create a stronger relevancy and popularity signal. Those pages are thus better ranked.

Rel=canonical

Rel=canonical works slightly the same way as 301 redirect except it is easier to implement.

It can be used for copied pieces of content from other websites.

It will tell search engines that you know the article copied has been intentionally placed on your website and that all the weight of that page should pass to the original one.

If you need further details about how rel=canonical works, we previously wrote an article on that subject.

NoIndex, NoFollow

This combined tags is useful for pages which should not appear in search engine’s index.

Bots can crawl the pages but will not index them.

Parameter handling

Google Webmaster Tool offers different services.

One of them is to set a preferred domain for your site and handle URL parameters differently.

However, this just applies to Google.

Your changes will not be taken into account for Bing or other search engine settings.

Further methods which can be implemented

Preferred domain

This is a very basic setting that should be implement on every site. It just tells search engines whether a site should be displayed with the www or not in the search engine result pages.

Internal linking

Be careful when internally linking. If you decide that the canonical version of a website is www.mywebsite.com/, then all the internal links should go to http://www.mywebsite.com/website.html and not to http://mywebsite.com/page.html

Merging content

When regrouping content, be sure to add a link back to the original one.

Write unique product descriptions

It might take more time, but if you write your own descriptions instead of taking the manufacturer ones, it might help you to rank above those other sites with duplicated descriptions.

How to improve your content and avoid duplicate content issues?

Here are the main situations where duplicate content happen.

This is what you should avoid:

URL issues

Parameters like click tracking or analytics code can lead to duplicate content issues. Actually, similar URLs pointing to identical pages will have problems. Google regards www, non-www, .com, com/index.html, http or https as different pages even if they are the same. It is thus seen as duplicate content.
Exemple:
www.mywebsite.com/red-item?color=red
www.mywebsite.com/red-item

Printer-friendly

Printer-friendly versions of content can cause duplicate content issues when multiple versions of the pages get indexed.
Example:
www.mywebsite.com/red-item
www.mywebsite.com/print/red-item

Session IDs

This common issue happens when each user that comes on a website is assigned a different session ID that is stored in the URL.
Example:
www.mywebsite.com/red-item?SESSID=142
www.mywebsite.com/red-item

Copied or syndicated information

If you want to share an article, a quote or a comment of someone you worship or just to illustrate your articles, it will be seen as duplicate content, even if you have linked back to its website or URL.

Indeed, Google will poorly value this pieces of content and it will certainly lead to an overall domain score quality drop.

Duplicate product information

If you own an ecommerce website, you have probably met this problem. It occurs when you use manufacturers’ item descriptions hosted on their websites to describe your products.

The problem is that these manufacturers may sell this product to many different sellers and thus the description is appearing on many different websites. This is just pure duplicate content.

Sorting and multi-pages lists

An ecommerce website like Amazon offers filter options that generate unique URLs. It has a large number of product pages in most categories which can change orders depending on how the list is ordered.

For example, if you range 30 items by price or by alphabetical order, you will end up with two pages with the same content but with different URLs.

HTML tags reports in Oncrawl

Understanding Status Codes in SEO

Canonical URLs

Understanding SEO Visits in Oncrawl

Using the Management of pages with content duplication issues chart