A simple method for tracking unique visitors
It takes one month to set up any Web site for an accurate statistics snapshot. Why? Because every statistics package I have used or reviewed organizes its data into monthly reports. Of course you can configure your statistics to make daily or weekly data captures, and some applications (like Google Analytics, Sitemeter, et. al.) automatically perform daily captures anyway. Nonetheless, when you look at aggregate reports, you cannot get away from the monthly paradigm (although some packages will allow you to specify date ranges for on-demand reporting). Waiting out that month doesn’t have to be frustrating. You can use the time to tinker with your configuration and decide what you want to report on. But of course it’s virtually impossible to get 100% accurate data, especially from a very busy Web site or server. There are so many factors beyond your control that you can be more sure your server log won’t capture all activity accurately than that your statistics software will analyze the captured data accurately.
So here is one example of how you can experiment with your analytics. You’ll need the ability to create subdomains (or get a dedicated domain) and you’ll have to move some files around.
Do you have a graphical image that you use on every page of your site? If so, set up a subdomain like images.example.com and move that site-wide graphic image to the sub-domain. If you’re using Webalizer, make sure your Webalizer configuration is properly set up to provide separate reporting for the sub-domain. You absolutely do NOT want to mix all your sub-domains’ statistics in with your primary domain’s statistics.
Block robots from the sub-domain using robots.txt.
Block all sites but your own from accessing any image files on that sub-domain (so this also helps you deal with hotlinking not only by saving bandwidth but also by transferring all hotlink referral data to the sub-domain). One very common method for protecting images from hotlinks (where someone else links to your image from their site) is to use .htacces (I don’t know how to do this in ISAPI):
# .htaccess for http://images.example.com/
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?example.com(/)?.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?google.com(/)?.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yahoo.com(/)?.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?live.com(/)?.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://(www\.)?ask.com(/)?.*$ [NC]
RewriteRule .*\.(gif|jpe?g|png|bmp)$ [F,NC]
Notice how I allowed the major search engines into the site in this example. Unless you’re looking for image search referrals, you should block the search engines too and just restrict access to people looking at pages on your site.
Now, if you want to know how many unique visitors your primary site receives, you have to restrain yourself. This is not a typical image server where you offload your bandwidth from your primary domain. This is a tracking subdomain and you can really only put one file on it: the site-wide graphic you’re using as a tracking image.
If you’re using Webalizer you’ll have to define a PageType extension for your image. So if the file is sitewide-masthead.jpg your PageType definition looks like:
PageType jpg
Be sure you delete or comment out any HideURL directive for JPG types.
Set this up before the end of the month so that you can run a test or two. You want to make sure you’re just capturing traffic for the graphic. Of course, you cannot prevent curious people from trying to get into the subdomain (you’ll have to set up a blank index file to prevent them from crawling your directory structure). You may get some incidental traffic (although there are ways to detect that traffic and filter it out).
This tracking subdomain will help you determine:
- How many unique IP addresses request pages on your site
- How many real page views your site receives
- Page views per page
- Visits per page
All your referrer data will be from your own site. You’ll know which pages people are visiting. There are few if any robots that are currently grabbing all the images with HTML content, but some caching browsers will use prefetch directives to pull down data prematurely. These prefetches not only waste your bandwidth and server resources, they inflate your traffic statistics. But you can block them with .htaccess as well:
RewriteEngine On
SetEnvIfNoCase X-Forwarded-For .+ proxy=yes
SetEnvIfNoCase X-moz prefetch no_access=yes
# block pre-fetch requests with X-moz headers
RewriteCond %{ENV:no_access} yes
RewriteRule .* - [F,L]
That forces people to actually view your pages before their browsers can hammer your server. Now, there are some caveats. For example, this won’t prevent Xenu Link Sleuth from hammering your server. You’ll have to block Xenu by IP address (and there are obviously a LOT of potential Xenu addresses) or by .htaccess/ISAPI based on the agent field.
I don’t like Xenu link sleuth and have no need or desire to obtain links from obtuse people who insist on using Xenu to check their link partners’ integrity. If you feel you have to use Xenu Link Sleuth to make sure other people are linking to you, get out of the SEO business. You don’t know enough to be more than just plain dangerous and irresponsible.
In fact, I would regard anyone who use Xenu Link Sleuth on sites other than their own to be crossing the line from “Best Practices SEO” into “doesn’t care in the least about the integrity of other people’s sites”.
As I pointed out above, there is no way to be 100% sure of how many visitors your site receives, or what they are actually looking at. This technique should filter out screen readers, for example (because they won’t download the images), and if your site is receiving a lot of traffic from people who use screen reading software you won’t count them in your images subdomain’s statistics.
A very small percentage of people still use text-only browsers for reasons we cannot fully fathom. I am sure they would use something more robust if they could, but they cannot. Hence, those people won’t show up in your images statistics, either.
You can use image domains/subdomains for a variety of purposes: image servers can help you manage your bandwidth; image tracking servers can help you manage advertising account statistics as well as analyze visitor traffic; if you run a forum you can set up a registered members gallery on a protected subdomain or domain to limit their access to the rest of your site; and there are other things you can do.
Keep in mind that I am only using Webalizer as a convenient reference. There are many analytics packages out there, and you should be able to do something similar with any package that actually reads your server logs. You cannot do this with Javascript-based analytics.
www.seo-theory.com
published @ September 4, 2008