Webalizer Configuration - Configure Webalizer for SEO
Several people have asked me (some more than once) to explain my preferences for deep server log analysis.  I have not exaggerated when I’ve said that I prefer to analyze the raw log files with my own tools.  However, not everyone is happy with that answer, so I do have some recent experience with setting up Webalizer configuration files.   Here are some tips on how to configure Webalizer for search engine optimization analysis. Let me make it clear, however, that there is no one right way to do this.  You have to have clear objectives in mind.  Also, some sites are more complex than other sites.  And then the amount of traffic that a site gets has an impact on how efficiently (or effectively) Webalizer can work. 
  
In fact, I feel the standard Webalizer configuration is virtually useless. It just doesn’t clean up the data well enough.
Now, Webalizer will tell you some useful data in some areas and then it will just frustrate the heck out of you in other areas because it teases you. It says, “Hey, you got some errors” but it won’t tell you WHERE you got the errors. This is particularly annoying when you receive a lot of 404 errors. You have to look at your error log file to see what the problem is.
Now, for those people who want to say, “That’s why I don’t use Webalizer”, keep in mind you don’t always have a choice. If your client comes to you with Webalizer pre-installed and using the default configuration, telling them to buy your favorite analytics package is probably not a good idea. And telling them to install Google Analytics may make your life easier for you but Webalizer actually tells you some things that Google Analytics cannot. I’ve found that life is easier when I use the two packages together instead of choosing one over the other.
Again, MY PREFERENCE is to just look at the raw server logs. But I’ve learned to tweak Webalizer so that it provides some useful information.
That said, there are choices you have to make. These tips are not comprehensive. I’m not going through ALL the options. The webalizer.conf file does include comments to help you figure out what you may want to do. UNIX/Linux users beware: If you download the file onto a Windows PC, be sure you edit the file in Wordpad, not Notepad. Wordpad will not insert DOS-style CR/LF (CTRL+M) line terminators. Notepad is not so UNIX-friendly. There are other text editors out there but Wordpad works just fine for editing these configuration files.
Use a separate configuration for each domain/sub-domain
If you’re hosting a single domain on a shared server your hosting ISP should configure the Webalizer to put your server log in a directory for your account. I don’t recall having to do this myself, but it’s been many years since I’ve configured analytics on a shared server.
If you’re creating multiple domains on a dedicated server, then whatever tools you’re using to add domains should be absolutely worthless. You’ll be lucky if they turn on Webalizer for you. You may have to create the configuration files by hand, or at least edit them for each domain.
Using one configuration file for all your domains or sub-domains is a recipe for disaster. I simply don’t recommend this. You may have to manually create the separate files. You may have to dig into your account or server management interface to see where to tell Webalizer to look for those configuration files. There are command line options but no one should be invoking Webalizer manually (except for trouble-shooting). If you have to do it manually, you use “webalizer -c path to configuration file” (keep in mind you may have to specify a full path to webalizer).
Know where your log files are located
Webalizer will try to figure out where the server logs are stored. If you have only one domain on one server, you’re probably good to go. However, as soon as you get into a multiple domain and/or multiple sub-domain scenario you need to make some choices.
When you’re tracking sub-domains with one domain, you have to understand that if you use only one log file (or one set of log files) for the entire domain, you won’t see any traffic for your sub-domain root pages. Everything in the raw log file is treated as “/” for any root directory pages. Some people get around this by using 301-redirects to a secondary URL or by including a tracking bug (usually just a .GIF or .JPG, but sometimes something else).
The 301-redirect method tells you which domain or sub-domain root page was visited but it adds overhead to the server’s operation. I don’t like this approach but it’s an option that many people take.
The tracking bug technique is not easy to implement eloquently because usually people do NOT want to track robot fetches (or, they may want to identify rogue robots by seeing who does NOT fetch the tracking bug). Although there are other methods for trapping rogue robots, you can consider embedding your tracking bugs in Javascript. Robots won’t execute the Javascript so they won’t fetch the bugs.
I prefer to keep my raw server log files for sub-domains separate from the main domain. That’s just me. It doesn’t help identify robots but then I tend to filter out robot fetches when I analyze the raw log data. What the separate log file approach does do for you, however, is resolve ambiguities over similar URLs (and there could be MANY on sites with multiple sub-domains, especially where they organize similar data by region or category sub-domains).
Make sure your history file matches your configuration file
Webalizer stores its processed results in a file called “webalizer.hist”. This is just a simple text file summarizing data that is used to build new HTML report files. If you are using separate configuration files you’ll need separate history files.
Also, there is a file called “webalizer.current” (used with incremental processing — see below). I found out the hard way that if Webalizer stops working it may be because this file is corrupted. You just have to delete it and Webalizer will figure out the rest for you.
Use incremental processing wherever possible
Incremental processing reduces the processing time for Webalizer. On very active sites, incremental processing will keep the Webalizer activity productive far longer than the default options. A daily report works fine for a lot of sites. It also gives you better feedback than once-a-week or once-a-month (but understand that much of the reported data is cumulative within the month).
Specify the PageType data
There is a section in the configuration file that includes lines that look like:
PageType phtml
 PageType html
You need to create an entry for each one of the HTML file extensions that your server uses. If you generate dynamic pages and the base URLs (URLs without any trailing parameters) are terminated with “.cgi”, “.php”, “.pl”, etc. then you need to include these extensions in your list of PageTypes.
Webalizer uses the PageType definitions as a canonical list of page file names for determining its page counts. You don’t want to include image files in any page counts (usually), for example. Although some sites do allow people to view images separately (in which case, you have to decide if you want to include those images in page counts).
I would be reluctant to count images as pages, but if I did do so I would be sure to use a distinctive image file extension for the isolated images. That is, I would NOT use both .GIF and .JPG (as an example) for both separately viewable images and normal page dressing images.
If your site publishes RSS feeds, you have to decide if you want to include their fetches in your page counts. If you do, include a PageType definition for “rss”. However, a lot of robots now hammer .RSS files and your page counts could be hyperinflated. I’m afraid the current version of Webalizer doesn’t support separate RSS reporting (but that would be a good feature for any analytics package — especially if it attempted to distinguish between robots and readers).
CSS files also have to be considered. I don’t normally include .css extensions in my PageType definitions, as I consider .css files to be supplemental content, similar to images used to dress up the appearance of a page.
You can use wildcards in your PageType definitions:
PageType htm*
 PageType shtm*
 PageType cgi
Keep your DNSChildren low
By default, Webalizer sets the “DNSChildren” value to 0, which disables DNS cache file creation. If you want reverse DNS lookups in your analytics, however, you need to set a limit to the number of child processes that can be spawned. I don’t recommend more than 10, but Webalizer’s notes say “reasonable values should be between 5 and 20″.
You should test this limit to see what impact it has on your server by monitoring server performance with the limit set to a non-zero value and make adjustments accordingly.
Use Quiet mode judiciously
There is a boolean option, “Quiet”, that suppresses error messages if set to “yes”. You should set it to “yes” for automated runs but when you’re trouble shooting you may want to disable “Quiet” mode so you can see any error messages.
Use GMTTime consistently
If you’re managing multiple Webalizer reports, you may be confused about when reporting occurs. One option to help you track everything is to turn on “GMTTime” (with “yes”). You can use the default setting if you’re only managing one report (one domain or sub-domain).
Adjust the “Top …” section
I usually allow the following options for my “Top” section reports:
TopSites 200
 TopKSites 100
 TopURLs 200
 TopKURLs 200
 TopReferrers 200
 TopAgents 50
 TopCountries 200
 TopEntry 200
 TopExit 200
 TopSearch 200
 TopUsers 50
The default values are much smaller. You can go with larger values or you can comment out these values and just use ALL or DUMP reporting (see below).
Enable ALL reporting
If you want to see all the referral data that was captured in HTML format, set these options to “yes”.
AllSites yes
 AllURLs yes
 AllReferrers yes
 AllAgents yes
 AllSearchStr yes
 AllUsers       yes
Index aliasing
If you’ve got canonical URL issues in external links and you’re tired of sifting through 301-redirects, you can tell Webalizer to canonicalize the index files (in ALL your directories) with these definitions:
IndexAlias     home.htm
 IndexAlias main.htm
If a page URL ends with “index.” Webalizer will strip that off automatically, but if you once had non-index file names set as your directory default pages, this option removes some of the clutter.
Use hide options carefully
You have the option of hiding internal references to your content, but I don’t recommend you use them. Knowing which of your pages is sending the most traffic to the rest of your site (and, conversely, which pages send the least traffic) helps you evaluate your conversions and related issues. Nonetheless, I do recommend you use the following hide definitions:
HideURL *.gif
 HideURL *.GIF
 HideURL *.jpg
 HideURL *.JPG
 HideURL *.png
 HideURL *.PNG
 HideURL *.ra
 HideURL *.XML
 HideURL *.xml
 HideURL *.CSS
 HideURL *.css
 HideURL *.RSS
 HideURL *.rss
 HideURL *.JS
 HideURL *.js
Keep in mind that if you ARE counting .RSS or .JS or .CSS files in your page counts that you do NOT want to hide them. Be sure you maintain consistency between your options.
Keep your search engine keywords up to date
Search engine keywords tell Webalizer how to extract those referrer strings that are so useful for your keyword research. The default list will kill you as it is SO outdated. I am currently using this list:
SearchEngine alexa.com q=
 SearchEngine alltheweb.com q=
 SearchEngine alot.com q=
 SearchEngine altavista.com q=
 SearchEngine aolsearch.com query=
 SearchEngine ask.com q=
 SearchEngine badiu.com wd=
 SearchEngine blingo.com q=
 SearchEngine business.com query=
 SearchEngine chiff.com q=
 SearchEngine clusty.com query=
 SearchEngine comcast.net q=
 SearchEngine cuil.com q=
 SearchEngine dmoz.org search=
 SearchEngine dogpile.com web/
 SearchEngine euroseek.com string=
 SearchEngine exalead.com q=
 SearchEngine excite.com web/
 SearchEngine ezilon.com q=
 SearchEngine gigablast.com q=
 SearchEngine blogsearch.google q=
 SearchEngine www.google. q=
 SearchEngine images.google q=
 SearchEngine hakia.com q=
 SearchEngine hotbot.com query=
 SearchEngine joeant.com keywords=
 SearchEngine juno.com query=
 SearchEngine live.com q=
 SearchEngine lycos.com query=
 SearchEngine mamma.com query=
 SearchEngine metacrawler.com web/
 SearchEngine msn.com q=
 SearchEngine mytelus.com q=
 SearchEngine mywebsearch.com searchfor=
 SearchEngine netscape.com query=
 SearchEngine netzero.net query=
 SearchEngine northernlight.com qrh=
 SearchEngine pch.com q=
 SearchEngine rambler.ru words=
 SearchEngine scour.com web/
 SearchEngine search.com q=
 SearchEngine searching.uk.com query=
 SearchEngine snap.com query=
 SearchEngine webcrawler.com web/
 SearchEngine webfetch.com web/
 SearchEngine search.yahoo.com p=
 SearchEngine verizon.net q=
 SearchEngine yandex.com text=
 SearchEngine yodao.com q=
There are many more search engines than that and I may update the list from time to time. In fact, you need to update your list because the search tools change their referral data. Ixquick no longer sends you its query string, so I have removed them from my list.
Consider using DUMP file reporting
You can have Webalizer dump its data to tab-separated files that can be loaded into spreadsheets. You may want to allow the dump options alongside or in lieu of the ALL reporting (which sends the same data to HTML pages you view in your browser).
Using dump files is about as close as you can get to analyzing raw server data through Webalizer. It makes good practice.
Be sure you keep your dump files separated by domain/sub-domain.
Understanding the columns
People occasionally ask me what the column headers in the reports mean. The following information is available in the Webalizer README file (that is an FTP link).
Hits
  Any request made to the server which is logged, is considered a ‘hit’.
 The requests can be for anything… html pages, graphic images, audio
 files, CGI scripts, etc…  Each valid line in the server log is
 counted as a hit.  This number represents the total number of requests
 that were made to the server during the specified report period.
Files
  Some requests made to the server, require that the server then send
 something back to the requesting client, such as a html page or graphic
 image.  When this happens, it is considered a ‘file’ and the files
 total is incremented.  The relationship between ‘hits’ and ‘files’ can
 be thought of as ‘incoming requests’ and ‘outgoing responses’.
Pages
  Pages are, well, pages!  Generally, any HTML document, or anything
 that generates an HTML document, would be considered a page.  This
 does not include the other stuff that goes into a document, such as
 graphic images, audio clips, etc…  This number represents the number
 of ‘pages’ requested only, and does not include the other ’stuff’ that
 is in the page.  What actually constitutes a ‘page’ can vary from
 server to server.  The default action is to treat anything with the
 extension ‘.htm’, ‘.html’ or ‘.cgi’ as a page.  A lot of sites will
 probably define other extensions, such as ‘.phtml’, ‘.php3′ and ‘.pl’
 as pages as well.  Some people consider this number as the number of
 ‘pure’ hits… I’m not sure if I totally agree with that viewpoint.
 Some other programs (and people  ) refer to this as ‘Pageviews’.
Sites
  Each request made to the server comes from a unique ’site’, which can
 be referenced by a name or ultimately, an IP address.  The ’sites’
 number shows how many unique IP addresses made requests to the server
 during the reporting time period.  This DOES NOT mean the number of
 unique individual users (real people) that visited, which is impossible
 to determine using just logs and the HTTP protocol (however, this
 number might be about as close as you will get).
Visits
  Whenever a request is made to the server from a given IP address
 (site), the amount of time since a previous request by the address
 is calculated (if any).  If the time difference is greater than a
 pre-configured ‘visit timeout’ value (or has never made a request before),
 it is considered a ‘new visit’, and this total is incremented (both
 for the site, and the IP address).  The default timeout value is 30
 minutes (can be changed), so if a user visits your site at 1:00 in
 the afternoon, and then returns at 3:00, two visits would be registered.
 Note: in the ‘Top Sites’ table, the visits total should be discounted
 on ‘Grouped’ records, and thought of as the “Minimum number of visits”
 that came from that grouping instead.  Note: Visits only occur on
 PageType requests, that is, for any request whose URL is one of the
 ‘page’ types defined with the PageType and PagePrefix option, and not
 excluded by the OmitPage option.  Due to the limitation of the HTTP
 protocol, log rotations and other factors, this number should not be
 taken as absolutely accurate,  rather, it should be considered a pretty
 close “guess”.
KBytes
  The KBytes (kilobytes) value shows the amount of data, in KB, that
 was sent out by the server during the specified reporting period.  This
 value is generated directly from the log file, so it is up to the
 web server to produce accurate numbers in the logs  (some web servers
 do stupid things when it comes to reporting the number of bytes).  In
 general, this should be a fairly accurate representation of the amount
 of outgoing traffic the server had, regardless of the web servers
 reporting quirks.
www.seo-theory.com
published @ September 4, 2008