Duplicate content is the latest buzz word making its rounds. Many fear duplicate content penalties because it has the potential of devaluing your web pages. Chances are, while a few legitimate webmasters may get caught in the cross fire, Google is mainly trying to make the returned pages more relevant to searchers and targeting the people who use duplicate content as a way to manipulate the search engines.
What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. …What isn’t duplicate content?
…you shouldn’t worry about occasional snippets (quotes and otherwise) being flagged as duplicate content.What does Google do about it?
…if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. …in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.…Don’t worry be happy: Don’t fret too much about sites that scrape (misappropriate and republish) your content. Though annoying, it’s highly unlikely that such sites can negatively impact your site’s presence in Google.
The likelihood of becoming tagged for duplicate content if you properly structure your site, provide a valid sitemap for Google (and all other robots) to follow, properly use your robots.txt file, are slim.
Site Structure
Certain elements of your site will help human, as well as robot, visitors easily get around your site and distinguish each element of the site.
Have clear navigation throughout each page of the site. Make sure to maintain continuity and organization, leading your visitors where they need to go.
Be mindful of internal linking because not everyone will land on your site’s homepage or start reading your site from the very first entry. Link to other parts of your site within the content.
Shorten your descriptions on archive, category, and search pages. None of these pages require that you have the full content displayed, just a brief excerpt and a link to continue reading.
Respond with proper header codes. Header codes are what tell browsers and robots what type of page they’ve landed on, whether it’s a 404 error (very common, unfortunately) that tells people the page doesn’t exist, 200 OK (more common than 404) which tells people the page does exist, or 301 — the page moved permanently.
Code with some standards. Your code will reveal the type of page you’re serving up, if you tell it to. Make sure that each of your HTML pages have a proper DOCTYPE and that your feed pages are served in proper XML format for the type of feed (atom or rss).
Create Site Maps
Site maps aren’t just for search engines, they are to help your visitors easily find individual pages on your site with ease. With that, site maps also make it easy for search engine robots to find your individual pages.
There are a couple different types of site maps, but first, and foremost, you need to have a human readable sitemap. A page with links to each individual page and the sitemap should be linked from each of your pages using descriptive anchor text (sitemap, site overview, and such phrases). The page should have a distinct structure.
Once you have your human readable sitemap in place, then you can consider adding an XML sitemap. Google initially launched their sitemaps program; later Yahoo! and MSN have added support for the Google sitemap protocol. The XML sitemap tells the search engines specific information about each page, even how much weight to give them.
To have all your sitemap bases covered, you could also use a text sitemap. It’s basically just a list of all your site’s URLs in a plain text document that can be submitted to the search engines. Google and Yahoo! both support plain text sitemap format, but you’re probably alright if you have a human readable and XML sitemap on your site.
Update Your Robots File
The robots file tells robots what they’re not allowed to view on their site. If you don’t have a robots.txt file, it means they have free reign to follow and index every link on your site. (You can override this on individual pages by adding the robots meta, but it’s not something you should rely solely upon.)
Creating a robots file is fairly simple. You start with a blank, plain text, document and you can give all or specific robots instructions for viewing your site.
User-agent: * Disallow: /
The first line of the file, User-agent, tells which robot should follow the instructions following. In the example, it signifies that all robots should heed the instructions. The second line, Disallow, tells the robots what they’re not allowed to crawl and index. In the example, all robots are not allowed to follow or index anything.
If you plan to create different instructions for different robots, you will need to specify a new User-agent line with a new set of instructions. (You can set basic instructions for all robots, then get more specific with individual robots.)
User-agent: * Disallow: /stats/ Disallow: /cgi-bin/ User-agent: psbot Disallow: /images/ User-agent: googlebot Disallow: /nogoogle/ Disallow: /nogoogle.html
In the above example, there are rules for all robots, and then there are specific instructions for psbot and googlebot. You disallow entire directories (/dir/) or individual pages (page.html). Be careful not to omit the final element whether it’s a slash or a page extension because partial matches count. Disallowing /a will stop robots from crawling or indexing everything that begins with ‘a’ on your site.
Ultimately, you’ll want to disallow any directories or files that aren’t important in the search index. For instance, if your CMS uses query strings in the URL (/?p=xyz) and you’ve set it to display clean URLs using .htaccess (/page-xyz.html), then you’ll want to tell the search engines not to index any pages that start with /?p= if it should come across any. Also, consider disallowing feed, print, archive, and search pages. The goal is to have your individual pages in the search engines and not pages that may cause a duplicate content trigger.
After you’ve created your robots.txt file, make sure it’s uploaded to your website’s root folder - i.e. www.yoursite.com/robots.txt. While Google is making some ground breaking changes, most search engines will ignore robots files in directories, so all your rules across all your directories for a domain need to be in a single file at the root of the domain.
Listen To Google
After explaining what duplicate content actually is, the Google team gave you some additional tips for making sure your site doesn’t fall victim. Two of the most important tips there are to syndicate carefully and understand your CMS. Make sure the sites syndicating your content link back to the original source and that you note any pages your CMS will show the same content on over and over again, then limit it with your robots.txt file or .htaccess rewrites.
Bottom line, if you take the time to make sure you take proper care of your website, tell the robots what they’re allowed to index, and follow the webmaster guidelines, you shouldn’t have much to worry about.





