Oncrawl’s Management of pages with content duplication issues chart lets you find groups of similar content across your website, analyse their uniqueness ratio, evaluate your current strategy for managing duplicate content, and determine a plan of action to improve your editorial quality.

How does Oncrawl decide if pages are similar?

Once Oncrawl is done crawling your website, each web page is split into smaller blocks of text.

What's a block?

To qualify as a block, text in an HTML node or its children must be at least 50 characters long. Anchor text is treated as its own block, regardless of text length.
For each block of text, we compute an occurrence ratio to let you understand how often a block of text appears across your website.

Viewing the blocks on a page

If you're interested in visualizing these blocks, you can see them in the URL details for each individual page, under the Content section:

This lists blocks on the page by how many other pages on the website have the same block. Click on a percentage to view a cached version of the page with the corresponding blocks highlighted.

Calculating a similarity footprint

Oncrawl then uses these blocks, along with n-grams, HTML tags, and other indicators to calculate each page's similarity fingerprint, or unique profile, using the SimHash algorithm. This is the same method that Google uses.

What are clusters?

When each page's similarity fingerprint has been determined, Oncrawl then group pages into page clusters based on similarity. A cluster is a group of pages whose similarity footprints are identical or nearly identical. This means that clusters are simply groups of similar pages.

Each cluster of similar pages appears as one rectangle in the Management of pages with content duplication issues chart. The size of the rectangle and the number in it indicate how many pages are included in the cluster.

How does Oncrawl analyze clusters?

Oncrawl examines each cluster to determine whether or not there are signals in place that explain why the pages in the cluster are similar.

Oncrawl colors the rectangle green if there is a clear explanation. Otherwise, the rectangle remains red.

Canonical declarations

Oncrawl looks at each cluster's canonicalization: is a rel="canonical" link set? Is it the same for each page in the cluster?

If the answers are "yes", Oncrawl considers that the use of canonicals explains why these pages look so similar, and the rectangle for that cluster is green.

Hreflang declarations

Oncrawl also look at the use of hreflang in each cluster: are the pages in the cluster translations or localizations of one another?

If the answers are "yes", Oncrawl considers that the use of canonicals explains why these pages look so similar, and the rectangle for that cluster is green.

How to use this chart

This chart is designed to answer questions lots of questions in one place:

How bad is my duplicate content?
Is it spread all over my website or concentrated on just a few pages?
Is my unique content too thin, compared to templated content on the page?
Does my current duplicate content strategy (canonicalization and hreflang use) fit the types of duplication on my site?

The number of clusters, the size of the clusters, their degree of similarity, and the coverage of your current duplicate content strategy provide answers these questions.

Use the cluster chart to concentrate on the aspects that are most impactful on your site.

Adjusting the type of cluster show on the chart

Two sliders at the top of the cluster map allow you to focus the cluster map:

Cluster size: by default, all clusters are shown, from the smallest clusters at the left to the largest clusters at the right of the slider. Use the slider to adjust the view to concentrate on only small clusters, only large clusters, or any other range of cluster sizes.
Average cluster similarity: by default, all clusters are shown, no matter how similar the pages within them are. The degree of similarity is expressed as a percentage averaged across all of the pages in the cluster. You may want to focus first on clusters with a high degree of similarity. If so, use the slider to adjust the view.

The number at the top next to the sliders shows how many clusters are visible out of the total number of clusters found by Oncrawl.

Reading information about a specific cluster

Hover over a cluster to find out how similar its pages are.

Clusters are grouped by size and by color.

Green clusters show managed duplicate content. They already indicate to Google that the pages within them are similar to one another. All pages with the cluster indicate the same canonical page, or they all reference each other through hreflang declarations.
Red clusters show problematic duplicate content. They might indicate to Google that the pages within them share content with other pages--but the pages in the cluster indicate different canonical pages (cluster canonical conflicts) or have a problem with their hreflang implementation (hreflang errors). Or they might not indicate to Google that their content is similar, even though it is (canonical not set).

Click on a cluster to go to the Data Explorer and view the pages in that cluster.

Exploring duplicate content in the Data Explorer

The Data Explorer includes many fields that relate to duplicate content:

Cluster ID: a unique ID for this group of pages assigned by Oncrawl
Has near-duplicate content: is true if the page's similarity fingerprint is close to that of another page
Has problematic near-duplicate content: is true if no canonical or hreflang strategy is correctly in place for this page and ones similar to it
Near-duplicate status: gives the status of the page's cluster in the Management of pages with content duplication issues chart
- Cluster canonical conflict: this page belongs to a cluster where different pages declare different (conflicting) canonical URLs.
- Managed with canonicals: this page belongs to a cluster that explains its similarity by indicating a single canonical URL for all pages in the cluster.
- No management strategy: this page belongs to a cluster that doesn't indicate a reason for its similarity through canonical or hreflang declarations.
- Managed with hreflangs: this page belongs to a cluster that explains its similarity through correct hreflang declarations.
- Hreflang errors: this page belongs to a cluster where hreflang declarations are used, but hreflang errors do not make it possible to explain the similarity.
- No duplication: this page does not belong to a cluster.
Cluster canonical evaluation: this page belongs to a cluster with canonicals that match, conflict with one another, or might include pages missing a canonical declaration.
Page canonical evaluation: whether this page's canonical declaration matches its URL, provides a different URL, or isn't present.
Content similarity ratio: this page belongs to a cluster with this average similarity between them (given as a percent).
Similar pages: provides a quick link to pages that seem to be identical to this one.
Canonicals: lists all of the page's rel canonical declarations
Hreflang cluster ID: the unique ID assigned by Oncrawl to a group of pages that reference one another through hreflang declarations.
Hreflang error details and hreflang errors: explain errors in hreflang declarations involving this page.

You can also look at duplicate status and content for H1s, titles, and meta descriptions.

Best practices for duplicate content

Examine edge cases first: use the similarity slider to look at clusters with less than 20% similarity and with maximum similarity.
Hide green clusters to focus on duplicate content that isn't correctly handled by your duplicate content strategy.
Resolve giant clusters.
Pay special attention to clusters with canonical conflicts and hreflang errors.

Dealing with Duplicate Content: SEO Best Practices

Hreflangs and translated pages

Glossary of terms used in Oncrawl