All Collections
How does Oncrawl’s Duplicate Content Cluster Map work?
How does Oncrawl’s Duplicate Content Cluster Map work?
How we detect duplicate blocks of text inside your pages and across your site, what we show in the cluster map, and how to read it.
Rebecca Berbel avatar
Written by Rebecca Berbel
Updated over a week ago

Oncrawl’s Content Cluster Map lets you find blocks of similar content across your website, analyse their uniqueness ratio, evaluate your current strategy for managing duplicate content, and determine a plan of action to improve your editorial quality. 

How we build the content cluster map

Once we are done crawling your website, each web page is split into smaller blocks of text.

What's a block?

  • To qualify as a block, text in an HTML node or its children must be at least 50 characters long. Anchor nodes are treated as blocks, regardless of text length.

  • For each block of text, we compute an occurrence ratio to let you understand how often a block of text appears across your website.

  • These text blocks will be displayed on top of your page through our Chrome Extension but data can also be found in your Crawl Report under the Content tab and in the URL Details. 

We then use these blocks of text, n-grams, HTML tags, and other indicators to calculate each page's similarity fingerprint, or unique profile, using the SimHash algorithm. This is the same method that Google uses.

We group pages into page clusters based on similarity. Each cluster of similar pages appears as one unit in the cluster map. The size of the unit and the number in it indicate how many pages are included in the cluster.

We analyze each cluster's canonicalization: is a rel="canonical" link set? Is it the same for each page in the cluster?

And we look at the use of hreflang in each cluster: are the pages in the cluster translations or localizations of one another?

Units are coded by color based on these analyses.

How to use the content cluster map

The cluster map is designed to give you one chart that helps to answer questions such as:

  • How bad is my duplicate content?

  • Is it spread all over my website or concentrated on just a few pages? 

  • Is my content too thin if we only count unique blocks?

  • Does my current duplicate content strategy (canonicalization and hreflang use) fit the types of duplication on my site?

The number of clusters, the size of the clusters, their degree of similarity, and the coverage of your current duplicate content strategy provide answers these questions.

Use the cluster map to concentrate on the aspects that are most impactful on your site.

Two sliders at the top of the cluster map allow you to focus the cluster map:

  • Cluster size: by default, all clusters are shown, from the smallest clusters at the left to the largest clusters at the right of the slider. Use the slider to adjust the view to concentrate on only small clusters, only large clusters, or any other range of cluster sizes.

  • Average cluster similarity: by default, all clusters are shown, no matter how similar the pages within them are. The degree of similarity is expressed as a percentage averaged across all of the pages in the cluster. You may want to focus first on clusters with a high degree of similarity. If so, use the slider to adjust the view

The number at the top next to the sliders shows how many clusters are visible out of the total number of clusters found by Oncrawl.

Hover over a cluster to find out how similar its pages are.

Clusters are grouped by size and by color.

  • Green clusters show managed duplicate content. They already indicate to Google that the pages within them are similar to one another. All pages with the cluster indicate the same canonical page, or they all reference each other through hreflang declarations.

  • Red clusters show problematic duplicate content. They might indicate to Google that the pages within them share content with other pages--but the pages in the cluster indicate different canonical pages (cluster canonical conflicts) or have a problem with their hreflang implementation (hreflang errors). Or they might not indicate to Google that their content is similar, even though it is (canonical not set).

Click on a cluster to go to the Data Explorer and view the pages in that cluster.

Best practices

  • Examine edge cases first: use the similarity slider to look at clusters with less than 20% similarity and with maximum similarity.

  • Hide green clusters to focus on duplicate content that isn't correctly handled by your duplicate content strategy.

  • Resolve giant clusters.

  • Pay special attention to clusters with canonical conflicts and hreflang errors.

Going further

Use our chrome extension to view page's blocks.

If you still have questions about using the duplicate content cluster map, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.

Happy crawling!

You can also find this article by searching for:
contenu dupliqué, near-duplicates, contenido duplicado, pages similaires

Did this answer your question?