Is your web analytics solution causing duplicate content?

Analytics, SEO September 10th, 2008

Duplicate content is a pervasive problem for a lot of websites, and can be difficult to understand for the average webmaster. Recently, I’ve come across problems caused by a couple of big name web analytics packages.

The issue stems from the use of various URL parameters to track a visitor’s session on a website. This is slightly different to the old session ID parameter problem, in that it doesn’t create quite as many duplicate pages; but it is important nonetheless.

I’ll work through a couple of examples of this; firstly I’ll look at Omniture’s SiteCatalyst product, using Channel 4‘s website as a guinea pig. SiteCatalyst’s tracking code appears in a URL as intcmp, which stands for ‘Internal Campaign’, and can be identified quickly in the SERPs using a search like this:

Search results page for Channel 4

As you can see from the snapshot above, there are nearly 10,000 results with this parameter added to the URL. If you visit any of these URLs you can delete the ‘?intcmp=blah’ section and get exactly the same page. Internal campaigns come and go, and as the campaign disappears, the link to the campaign disappears, but leaves behind an extra URL that the content can be accessed from.

A nice example for Channel 4 is this Internal campaign URL:

  • http://www.channel4.com/bigbrother/?intcmp=homepage_flash (PageRank 3)

Clearly this link was indexed through a Flash based link on the home page. When this link disappears from the home page, it leaves this URL behind, sapping authority and causing duplicate content from the original page:

  • http://www.channel4.com/bigbrother/ (PageRank 6)

Now moving on to the second culprit, WebTrends, there is a very similar issue here. This time the nasty campaign parameter is WT.[something] – this is broken down in the following image:

This time the chosen victim is a site called LuxuryLink – let’s take a look at the SERP:

LuxuryLink search results

Again, a couple of examples from this – the tracked version:

  • http://www.luxurylink.com/LL/home_win_trip.php?WT.ac=MOENON03 (PageRank 5)

And the untracked:

  • http://www.luxurylink.com/LL/home_win_trip.php (PageRank 4)

So this time the tracked URL has actually gathered more authority than the non-tracked version, but still the page’s authority is being split and duplicate content is created again.

How to fix it

Now this is not necessarily the fault of the web analytics companies, and definitely not the webmasters. There are many reasons why URL parameter tracking needs to be employed, especially in large companies where the split between the marketing team and the web development team is too pronounced to be able to implement HTML code changes with ease.

However the ideal solution is generally to use a combination of on-page JavaScript and browser session cookies in order to track visitors to your site. This doesn’t interfere with your URLs – remember, cool URIs don’t change!

If that is not possible for whatever reason (or if it is and you have URLs with the parameters above indexed in Google), use your robots.txt file, and use it wisely. I’d always recommend using Google Webmaster Central’s robots.txt syntax checking tool before implementing any major changes to your robots.txt file, but these rules should clear out the duplicate URLs:

Omniture:

User-agent: *
Disallow: /*intcmp=

WebTrends:

User-agent: *
Disallow: /*wt.*

2 Responses to “Is your web analytics solution causing duplicate content?”

  1. smeyler Says:

    Thanks for the post. We see this same problem repeatedly with clients using HBX, Omniture, and WebTrends. We typically 301 these if and when they show up, but it is a pain. We would rather avoid them.

    Your instructions, however, to use the robots.txt file seems like it would preclude Google, et al from following any link that included this parameter (which might be multiple links on any given page). Am I missing something?

  2. rob Says:

    Yes the robots.txt file solution *would* block search engines from indexing the pages that include those parameters. It’s not the ideal solution, but quite often these things are far down developers’ to-do lists, so sometimes a compromise is the best short term solution.

    Unless *all* of your URLs contain these parameters, it’s generally preferable to have these pages eliminated from the index to having them cause massive duplicate content.