From the Vault

 

 

Your SEO Fix

 

Popular
Add to Technorati Favorites


 

Duplicate Content Print E-mail
Sunday, 14 January 2007

– One more time

Why do engines care? - In order to make a search more relevant to a user, search engines use a filter that removes the duplicate content pages from the search results, Another is that they don’t want to spend the resources in indexing pages that are substantially similar.

That said, there still seems to be some confusion out in the SEO world over ‘duplicate content’ and how search engines treat and deal with them. Right away I would like to say - RELAX -. If you are doing sneaky things like filling up a site with dodgy content that YOU KNOW is duplicate, then worry. Most people that may have duplicate content issues are honest web site owners and aren’t at risk of any penalization.

Is it really a Penalty?

That is the crux of the main misconceptions. Generally speaking it is a ‘duplicate content filter’ not a ‘penalty’ per se, as many in the SEO world seem to agonize over. At various points in the indexation and retrieval process, various documents (web pages) are scored and ultimately removed from the results for a given query.
Certainly, if one satisfies other ‘spam’ factors or has an entire site of duplication (aggregate) then it can certainly migrate into a penalty. As I said, for the average website owner you shouldn’t have to worry about penalties, most that migrate to this ground, know the path they are on. Penalties can also be incurred for what is known as a "mirror" - the penalty for a site that is more or less substantially duplicating another single site.


How does it work

When a search engine robot crawls the web it reads the pages and stores the information within its database. At various stages of the indexing and retrieval process, it checks the document against the existing index(es) for potential duplication issues. It is scored on a variety of factors including descriptions, authority, document age, content structuring (phrase scoring) and more.

For example when Google uses Phrasing to determine duplications one method is outlined below;

“For a given document, each sentence of the document has a count of how many related phrases are present in the sentence. The sentences of document can be ranked by this count, and a number of the top ranking sentences (e.g., five sentences) are selected to form a document description.
This description is then stored in association with the document. During indexing, a newly crawled document is processed in the same manner to generate the document description. The new document description can be matched (e.g.,hashed) against previous document descriptions, and if a match is found, then the new document is a duplicate. Similarly, during preparation of the results of a search query, the documents in the search result set can be processed to eliminate duplicates.”

When the user (searcher), queries the index it then attempts to further filter out any possible duplication and serve up the document it feels is the best resource/authority for the submitted query.

Types of Dupes and what to do;

There are a variety of ways the average website can run into duplicate content filtering problems without even knowing it.

Here are some common ones;

Websites with Identical Pages – Sometimes a company/individual will try to actually compete with themselves by creating other versions of their sites on a different domain name. Not a good idea. Affiliate sites with the same look and feel which contain identical content, are certainly not a good idea either. Regardless if it’s one site o many, create unique content throughout.

Scraped Content – this is content directly taken ‘verbatim’ from another site. This is obviously not a good idea.
E-Commerce Product Descriptions – another common problem is ecommerce sites that use the same product descriptions from their manufacturers site. Once again – unique content. Write your own product descriptions. Also - if you have product pages with nothing substantially different from other pages: …then add fresh content.

Distribution of Articles – Do not publish articles you are using for distribution on your site. Some people will say to let the SEs index it first and then distribute it – this will not work. If you have 2 articles, put one on your site and the other into circulation.
Mirror sites: 1 website, 2 domains – If you are trying to utilize 2 domains, simply forward the secondary domain to the primary at the registrar level. Do not build 2 sites that are indentical

Home Page URLs.- Having multiple home page naming conventions and Back Links to those multiple root domains. The best way to tackle these is via 301 redirects.

Here are some examples of what I mean;

http://www.example.com
https://www.example.com
http://www.example.com/index.htm (asp/php)
https://www.example.com/index.htm
http://example.com
https://example.com
http://example.com/index.htm
https://example.com/index.htm
http://www.example.com/home/
https://example.com/home/

Canonicalization issues,- confusing the bot: these are caused by dynamic URL’s. Often the bots may be returning a different URL with the same content…These are the long URLs with parameter strings the you see with many dynamic applications. These are not good as well – use the 301 methodology to ensure page naming conventions are tight.

Print-friendly pages; -. believe it or not, our little bot friends can follow the links to the printer friendly page and consider it duplicate content. For this be sure to use the robots.txt to forbid them from said pages.


Boiler plates; - Pages with too many common elements that are very similar, including title, meta descriptions, headings, navigation and globally occurring text/copy. Boilerplate content is considerable amounts of text repeated on a substantial number of pages on a site.


… and I shall let Google’s Adam Lasnik finish up

"Our algorithms take a look at their pages and (computerwise) ask, "What value is this site providing that users can't get from other sites or even the 'mothership'? (originator of content)"

"The fact that duplicate content isn't very cut and dry for us either (e.g., it's not "if more than [x]% of words on page A match page B...") makes this a complicated prospect."

I hope this has at very least given you a better idea about what all the fuss is over duplicate content and learned a few ways to avoid it. A fun tool for checking a site/document for duplicate content is www.copyscape.com.. Give it a whirl if you are concerned about content you may put on your site.

Also see read up on Google and Duplicate content - guidelines right from the Plex

Related Google Patent applications; Detecting Duplicate and near duplicate files - Detecting Query specific duplicate documents and the more recent Methods and apparatus for estimating similarity

 

 
< Prev   Next >

Knowledge Base
Link building ideas for 2008

Call me a freak.. I can take it…. Or call me old… cause I am getting there, but once upon a time links meant more than ToolBar PageRank and SERP referrers, they meant actual surf-in traffic. I wanted to start out by stating that it is still a consideration. Don’t focus obsessively on building links willy-nilly in an attempt to rocket up the ranks and become a gazillionaire!! Don’t fool yourself, some links can actually bring in some pretty good traffic all on their own folks - don't get myopic.

"It's like a finger pointing away to the moon. Don't concentrate on the finger, or you will miss all the heavenly glory." Bruce Lee in Enter The Dragon (1974)

 

… next, some basics;

 

Read more...