OnCrawl’s Content Cluster Map lets you find blocks of similar content across your website, analyse their uniqueness ratio, evaluate your current strategy for managing duplicate content, and determine a plan of action to improve your editorial quality.
How we build the content cluster map
Once we are done crawling your website, each web page is split into smaller blocks of text.
What's a block?
- To qualify as a block, text in an HTML node or its children must be at least 50 characters long. Anchor nodes are treated as blocks, regardless of text length.
- For each block of text, we compute an occurrence ratio to let you understand how often a block of text appears across your website.
- These text blocks will be displayed on top of your page through our Chrome Extension but data can also be found in your Crawl Report under the Content tab and in the URL Details.
We then use these blocks of text, n-grams, HTML tags, and other indicators to calculate each page's similarity fingerprint, or unique profile, using the SimHash algorithm. This is the same method that Google uses.
We group pages into page clusters based on similarity. Each cluster of similar pages appears as one unit in the cluster map. The size of the unit and the number in it indicate how many pages are included in the cluster.
We analyze each cluster's canonicalization: is a rel="canonical" link set? Is it the same for each page in the cluster? Units are coded by color based on this analysis.
How to use the content cluster map
The cluster map is designed to give you one chart that helps to answer questions such as:
- How bad is my duplicate content?
- Is it spread all over my website or concentrated on just a few pages?
- Is my content too thin if we only count unique blocks?
- Does my current duplicate content strategy fit the types of duplication on my site?
The number of clusters, the size of the clusters, their degree of similarity, and the coverage of your current canonicalization strategy provide answers these questions.
Use the cluster map to concentrate on the aspects that are most impactful on your site.
Two sliders at the top of the cluster map allow you to focus the cluster map:
- Cluster size: by default, all clusters are shown, from the smallest clusters at the left to the largest clusters at the right of the slider. Use the slider to adjust the view to concentrate on only small clusters, only large clusters, or any other range of cluster sizes.
- Average cluster similarity: by default, all clusters are shown, no matter how similar the pages within them are. The degree of similarity is expressed as a percentage averaged across all of the pages in the cluster. You may want to focus first on clusters with a high degree of similarity. If so, use the slider to adjust the view
The number at the top right shows how many clusters are visible out of the total number of clusters found by OnCrawl.
Hover over a cluster to find out how similar its pages are.
Clusters are grouped by size and by color.
- Green clusters already indicate to Google that the pages within them are similar to one another. All pages with the cluster indicate the same canonical page.
- Orange clusters indicate to Google that the pages within them share content with other pages. However, the pages in the cluster indicate different canonical pages.
- Red clusters don't indicate to Google that their content is similar.
Click on a cluster to go to the Data Explorer and view the pages in that cluster.
- Examine edge cases first: use the similarity slider to look at clusters with less than 20% similarity and with maximum similarity.
- Hide green clusters to focus on duplicate content that isn't correctly handled by your canonical strategy.
- Resolve giant clusters.
- Pay special attention to orange clusters.
Use our chrome extension to view page's blocks.
If you still have questions about using the duplicate content cluster map, drop us a line at @oncrawl_cs or click on the Intercom button at the bottom right of your screen to start a chat with us.
You can also find this article by searching for:
contenu dupliqué, near-duplicates, contenido duplicado