What is Duplicate Content?
Duplicate content is the same content that appears on 2 or unique URLs. Same content is defined as blocks of content that is "appreciably similar" which can range from exact copies to content that contains chunks of copied text.
Duplicate content can refer to content published all on one domain or across multiple domains. So you could have accidentally duplicated pages on your own site, or someone could take content from your page and publish it as their own. Or both.
Now, "appreciably similar" can be a bit tricky to define. Luckily, we can look to Google’s Search Quality Evaluator Guidelines to see how they view copied content.
Google has 3 things they consider "duplicate content":
Content copied from an identifiable source. The easiest type of duplicate content to catch, this is text that’s just copied and pasted, word for word, from one page to another.
Content that’s changed slightly from the original. A bit harder to catch, this content has been slightly rewritten from the original, usually by using a "find and replace" function for individual words or even whole sentences. Google refers to this as content “copied with minimal alteration”.
Content copied from a source that has changed, or changes frequently. Again, a bit harder for your/Google to catch, this content is copied from a page that updates or contains dynamic content. Think a news site or Wikipedia page.
Google considers copied content that doesn’t provide some sort of added value for users to be the Lowest rated page.
Why does Google Care About Duplicate Content?
Google has issues with duplicate content for 3 main reasons:
It can be hard to figure which page is the original.
They don’t want to show content more than once in the search results.
It can confuse them when trying to follow links or crawl and index URLs.
Think about it from the perspective of Google and its users. When you’re searching for something on Google, if they show 3 or 4 different pages all hosting the same article, that’s pretty frustrating for you.
That’s why Google will show its famous "we have omitted some entries very similar to the 150 already displayed."
How to Curb Plagiarized Content
You can’t stop people from scraping your content, but there are steps you can take to tell Google that you’re the original source.
If you are concerned with people copying your content, you can sign your name into a piece of content using structured data markup (or Schema) since rel=author is no longer supported. You can also verify your Google My Business listing and link it to your website.
The content can become more visible in search results, and research by Catalyst Search Marketing shows that rich snippets increase CTR by up to 150%.
It is also a good way to make your content more authoritative and trustworthy.
Accidental Duplicate Content
Content deliberately copied and/or scraped from other websites is the most obvious instance of duplicate content. However, there are other ways you could wind up accidentally duplicating your own content across your site.
This duplicate content is often caused by problems in setting up a website’s content management or ecommerce platform.
People running online retail shops tend to use the product description provided by the manufacturer. At first glance this makes sense for ecommerce retailers: it saves a ton of time and manufacturers know their products best. However, there are 2-3 million ecommerce companies around the world.
That’s a lot of websites all thinking the same way about reusing manufacturer product descriptions.
Your ecommerce platform can also cause duplicate content due to products with different color and size options. Some platforms will use the same descriptions but create different pages for each variation of the product.
Content that is continued onto a second page can cause Google to see content that is "appreciably similar". Paginated content will often use the same title tags and meta descriptions for each page. While this makes perfect sense from a user perspective, from Google’s point of view, that’s 2 or pages that look the same.
c) WWW resolve
Also referred to as a preferred domain, the WWW resolve is when a website redirects visitors from the non-WWW version to the WWW version of the domain or vice versa.
The WWW resolve matters because search engines don’t necessarily know that the two versions of the URL are the same website, so they wind up seeing copies of the pages at two unique URLs.
This can also result is losing link juice because not everyone will link to the same version of your URLs.
Duplicate content and social media
The activity on social media can affect your SEO. It’s common practice to post the same content on different social platforms. But does this count as duplicate content?
There are very diverse views when it comes to the above issues. One school of thought maintains that this is a bad idea since the audiences on the platforms may be different with different interests.
A second school of thought is that sharing the same content on different platforms helps you reach wider audiences. Another point is that content has a longer "shelf life" on some platforms (like LinkedIn) than on others (Twitter for instance). This helps make your content more findable for your audience.
It’s probably ok for you to post the same content across multiple social media channels.
The fact that most social media platforms use nofollow links means that Google isn’t too concerned with people trying to manipulate PageRank this way.
If you’re still concerned about posting the same content on different social platforms, you may want to get a bit creative and repackage your content. You may post an article on Facebook, an image or video representing it on Instagram, and an infographic containing the same information on Twitter.
Usually, different content types work differently on different social platforms. Over time, you will be able to tell what content works best for what platform.
Is there a Duplicate Content Penalty?
There is no duplicate content penalty applied by Google, at least not in the way people typically understand a Google penalty. Instead, Google simply chooses not to rank pages it detects are copied from other places in lieu of recommending just the original version of the content.
Google understands that, in a way, nothing is 100% unique. Matt Cutts mentioned that Google knows at least 26 to 30% of the internet will be duplicate. It’s simply not feasible to always create nothing but 100% unique content on every site on the internet.
However, it’s one of the most common and enduring myths in SEO that if your site has duplicate content, you will be automatically penalized by Google.
This myth stems from the first days of Google’s Panda filter. When Panda first went live, a lot of sites were relying on copied and short content. This caused a lot of websites, including some really big, well-known brands, to lose a huge amount of traffic in a very dramatic fashion:
These sites relied heavily on duplicated and/or "spun" (slight rewritten), content to build huge websites with lots of pages. Other sites would use these “content farms” as a way to promote their articles and build some links.
Avoid all scrapers
Scrapers are sites that literally copy and paste another site’s content, including links. Google tends to see these scrapers as irrelevant, understands you don’t control them and therefore won’t hold it against you. Google’s Penguin 4.0 algorithm also generally ignores the low-value links from these sites.
Therefore, you don’t need to spend all your time tracking known scrapers on Google Search Console.
You should still care about duplicate content
Even though Google doesn’t penalize instances of duplicate content on your site, you can’t quite sit back and totally relax.
Let’s look at why you need to deal with issues brought about by duplicate content:
a) Duplicate content dilutes the benefits of link building
Having the same content available on multiple URLs disperses potential link juice instead of concentrating it in one place. People who want to share your content aren’t going to go looking for the original version, they’re going to link to the one they found.
So instead of having 1 page with lots of links, you can have lots of pages with only 1 or 2 links, and go from ranking at the top of Google search results to being buried on page 20.
The other potential issue that even if a version of your content manages to rank well, it might not be the best page. All the visitors in the world aren’t worth much if they’re visiting a page that doesn’t convert.
b) Duplicate content discourages regular crawling
When you have duplicate content, search engine bots "waste" their resources crawling the same content. Those resources could have been used to crawl other pages instead. Google doesn’t want to waste its resources like this so it decides your website doesn’t need to be crawled very often.
What does that mean for your site’s SEO?
It’s going to make it harder for your new pages to appear in search results. It will not be crawled, thus not indexed. It might never be found despite the optimization efforts you may have put. It’s hard to be successful at marketing your business if it takes Google a week (or longer) to find your new page.
How to Identify Duplicate Content Issues
For you to fix a problem, you need to identify it. Let’s dive into how to identify duplicate content.
1. Use Google Search Console
To do this, log in to Google Search Console. Since the report we’re going to be looking at isn’t available in the new Search Console, you’ll click the "go back to the old version" link at the bottom of sidebar navigation:
Click on "Search Appearance" and then “HTML improvements”:
This report details instances of duplicate title tags and meta descriptions. Since titles and meta descriptions are supposed to summarize the page’s content having these identical HTML tags could be interpreted as duplicate content by Google.
Something else you need to look out for is crawler metrics. You will find these under the "Crawl" option on your dashboard. Choose the “crawl stats” on the expanded menu. These show you the number of pages crawled on your site. If there are hundreds of pages being crawled, and your site does not have hundreds of pages, it shows that some duplicate content is being crawled over and over.
2. Use Crawler Tools
Depending on the crawler, you’ll be able to detect:
- Duplicate titles
- Duplicate descriptions
- Duplicate main body text
With Site Crawl, you’ll find all 3 indications of content duplicated across your domain.
All crawlers are limited to just your domain, meaning they won’t find places where other people have copied your content, or times when you’ve (accidentally, of course) published something a bit too similar to someone else.
For that, you’ll need a plagiarism detection program. One such software that you can use is Copyscape. It will show you the exact source of duplicate content and tell you whether it is internal or not. It’s easier to fix if the duplicate content is internal.
3. Manual Google search
If you don’t have access to a crawler or Google Search Console (in which case we recommend SEOPrompts to help get you set up), you can use manual Google searches to find duplicate pages.
Use the site: search operation specifying your domain and any keyword or chunk of text you want to find. Something like
site:example.com string of text I think might be duplicated.
The more specific you are (up to 30 characters, the limit of Google search queries), the more accurate your results.
It’s impossible to discuss duplicate content without also covering the issue of thin content. These 2 issues are often considered closely linked since they’re both considered part of how Google views a website’s quality and both were dealt with by the Panda update.
Where you have different categories, for instance, product categories, you may not describe each product fully to a point of getting enough content for a web page that can be indexed as a separate page. Google may see such pages as having thin content which may cause issues with rankings.
There’s no official word count to define thin content. Some in the SEO industry advise that you need at least 250 words per page to avoid thin content issues, and studies have shown top Google search results average more than 1,000 words per article.
You can’t take this as gospel though, as some pages with less than 100 words take the top spot, outranking content that’s 10 times longer.
The real test you should apply when looking for thin content should be to ask yourself "does this cover what is needed?" and “does this meet the need of the users?”.
If you answer yes to those questions, you probably don’t have much to worry about.