All Hits Are Not Created Equal


'Checklog' log files

by Sandy Antunes

To start off, I'll be a bit backwards and say what the logs do not contain. We do not provide the total number of hits, because this number is meaningless. The number of "accesses", not raw hits, more accurately reflects visitor numbers, This is because a single access to a home page that has one GIF or JPG image actually is recorded as 2 hits (one for the page, one for the image). For example, a page with 5 images would score 5 hits per access! We cull out the image hits, because they provide an artificially inflated hit rate. Knowing how many hits there were is meaningless; knowing how many people visited you is far more useful.

The first log gives a breakdown of how often each individual page was hit during that month. The most popular page is almost always your home page (i.e. myhome.html, sometimes appearing as index.html). The important information is which other pages people hit the most-- these are, essentially, the content of your site, and where your customers are going.

The second log gives a daily summary. This provides the time period covered, the number of accesses, the number of sites that visited you, and an estimate of the number of people that visited. Finally, we list the total number of pages that were hit. This last number may be slightly high, as some synonyms exist (i.e. "mysite/" and "mysite/index.html" would match the same).

The actual username of people accessing your page is not recorded by the system, so this estimate is found by assuming that consecutive hits from a single site represent one person. Similarly, if there is a cluster of hits from one site and then, a cluster of hits from the same site at a later time, we consider them to be from different people (for the purposes of determining your popularity).

Sample:
>27/Aug/96 to 02/Oct/96: 868 accesses (226 sites, 275 people) on 89 pages.
> Date Total People Greatest Most Page Hit Most
> access depth hits/pg
>12/Sep/1996 53 15 6 22 myhome

There is one line per day. The total number of accesses is listed, as well as the estimate of the number of people that visited the page. The "Greatest Depth" of any search that day is shown, where "depth" refers to how many different pages a single user accessed. The page that was accessed most often is listed, along with the number of times it was visited.

Interpreting this is easy. You want a large number of people visiting your site, and you'd prefer for each one to access many pages. Just dividing the number of accesses by the number of people gives you a rough idea of how many pages the average visitor accesses before wandering off. Deepness is a good indication of value-- that implies that your site was able to hold a person's attention for many different pages. (Note however that if you have someone that hit _every_ page, it is most likely a search engine or 'bot, not a human.) Finally, you will be able to tell what the single most popular page is, and how often it was visited.

We screen out "self-hits", which is to say that hits from the machine you typically log on from are not included in the summary. So your 1,000 accesses while editing your page are not part of the summary-- the totals given are for visitors only!

The file includes an error log of sorts-- if people mistype your site, that is listed as a hit, albiet an unfocused one. In the listing of "pages and accesses per page", there are frequently entries with only 1 access, with typographical errors. This lets you know what the most common typographical errors are. While one typo is meaningless, a record of ten access on "rgp.html" would be an incorrect link to your site. If requested, we can remove this information, but we recommend keeping it in because the numbers are low (so they do not affect the total statistics very much) and it provides a useful diagnostic.

The concept of "caching" has become popular with many service providers, including AOL. What this means is the ISP (for example, AOL) will make a complete copy, in memory, of a popular site's pages. Then, when its users click to go to http://www.mysite.net, they really are just grabbing the memory image (cache copy) of the pages. It saves the big ISP line charges, since the users are think they are reaching out to the Web, but really are just hitting the local cache. What this means for you is, once your pages get cached, the number of hits recorded drops way down. AOL users are hitting the cache, not us directly, so our access logs don't have any knowledge of that.

Some major sites that cache you if you become popular are AOL, Prodigy, Xerox, and others. If you see an entry marked "proxy.aol.com" (or similar, for other ISPs), that is an indicator of a cache being set up. The AOL one seems to update every day, the others are more sporadic. So, all the access totals are much lower than you would expect, and you couldn't be happier-- it means much of your site is popular enough for the larger ISPs to make caches of you!

Another kind of unusual hit are search engines and robots. These have different names for the different search engines, and typically hit every single page of yours. If you've seeded some search engines, you can expect a visit from them. Once they look at all your pages, you are on their engine, so their visits are a good thing. If a more complete analysis of caches and robots is desired, let us know.

Please look through the logs, and do not hesitate to ask me questions (sandy@clark.net). I am happy to make more information available (for example, the wwwstat pages, or the entire log file for your site) if that would be useful. At this point, we feel that our summary of the statistics is more useful than simply providing raw data, but we are happy to provide the raw data as well.

The tool that makes the log is called checklog, and is offered as freeware to the community. A talk presented on access logs is also available.