Are your readers human?
March 10, 2003
I’ve written a lot of articles on the market shares of different browsers. I couldn’t do this without careful analysis of MacEdition’s logs. I make sure that even semi-spoofed browsers are being counted correctly. I separately identify all manner of minor browsers, not only OmniWeb and iCab, but AvantGo, WannaBe and NetPositive for BeOS. But what’s most important, and what few others seem to do, is that I get the denominator right when I work out those market share percentages.
Want to dig even deeper? Post to the new MacEdition Forums (beta)!
The first thing I do is make sure I use pageviews, not hits, as the metric of choice. Otherwise, if someone has graphics turned off, or if they’re using a non-CSS browser, they will count for less than people using browsers that load all those auxiliary files. You don’t want that. Fortunately, most log analysis packages can be configured to sort things by pageviews, or at least show those pageviews.
The next thing I do is ensure that semi-spoofed browsers are correctly identified. These are browsers with tags designed to look like MSIE or Netscape but that have their true name somewhere in the tag. The Analog log analysis package correctly identifies browsers like WebTV, Konqueror, Galeon and more recently, Chimera and Phoenix. To these, I add lines to my Analog config files to identify OmniWeb, iCab, AvantGo, Safari and a few really obscure things like the Sega DreamCast browser. Since most other “sources” for browser stats don’t take this extra step (or publish what they do), they are almost certainly overstating the market shares of the big two browsers, since those are the browsers that the little guys pretend to be.
Thirdly – and this is the important thing – I separately identify all the search engine spiders, spambots, link checkers, validators, blog tools, RSS feed readers like NetNewsWire or Syndirella, and pretty much everything that is not controlled by a person using a desktop browser.
Mainly I do this by going through each week’s logs and seeing if there are any user agent names that I don’t recognise. I then go and investigate how that agent behaves – for example, whether it accesses robots.txt or is obviously spidering through the site, or hammers the front page more than ten times in rapid succession. I also look up other resources devoted to identify the robots, spiders and spammers. You can explore some of these using the links in the sidebar.
There are plenty of sites that purport to identify bots, spiders and other things you might want to exclude from your stats. None of them are truly comprehensive, but here are a few I have found useful.
- Robotstxt.org aims to be main resource about the Robot Exclusion Standard
- Free Conversant has a list of robots
- Psychedelix maintains a list of user-agent strings
- ABC Electronic compiles a list of exclusions they use to audit traffic numbers
- SendFakeEmail.com has a list of email harvesting bots that you probably want to block
- If you still aren’t sure if it is a robot, ask around at the WebmasterWorld forums
If that new user agent looks like a spider or other non-human user,
I use Analog’s
BROWALIAS command to lump them all
together as SearchEngineSpider, Validator, ProxyServer or Spammer.
These user agents need structural markup, accessible site design and
meaningful page titles; but I don’t need to include them if I’m
wondering whether to use a particular CSS property or just want to
know which browsers are the most popular.
Sometimes, looking for spiders isn’t a matter of just identifying new browser names and putting them in the robot pile. Even my extensive (and ever-growing) config file isn’t enough if the robot is successfully masquerading as a regular browser. That’s why it helps to compile this data as a regular time series. Over time, one gets a sense of when a series looks “wrong”. My many years of experience analysing economic data probably helps here, too. Browser shares bump around week to week, but they don’t change by an order of magnitude in that time without some identifiable reason. When we published a Naked Mole Rat report about Apple testing AMD chips, it attracted more Windows users than we normally get, so the share of IE 6 jumped up for that week. When Safari came out, it changed market shares in the OS X slice of our readership a lot. But other than situations like that, the trends should be fairly smooth week to week. So when old browsers like IE3 or Netscape 4 suddenly jump up in the figures for no apparent reason, one should be suspicious. More often that not, a bump up of Netscape 4 from 2 percent to 4 percent, or IE 3 from 0.1 percent to 0.5 percent will be a masquerading spider.
Working out how much to exclude as spiders isn’t easy. Some spiders use a user agent tag that omits a platform or OS designation, for example “Mozilla/4.7” without an OS name in parentheses. All legitimate versions of Netscape 4 contain an indication of the OS used, as do those of IE. So if there are partial user agent strings in your logs without the OS, you can exclude them automatically. Sometimes, though, there’s no option but to exclude a particular IP address after the fact. It’s a good idea to leave the hosts report on in Analog or your analysis tool, so you can check if a particular IP address is accounting for a disproportionate amount of pageviews, especially relative to the bandwidth it consumes. You can then go to the raw logs and check whether it’s a robot you already know about, or a user agent that’s incorrectly classified.
Of course, there are always a few things that don’t fall into my neat categories. I’ve never really known whether I should count Java/1.4 or the various user agent strings starting with Perl and Python as robots or not, so currently I leave them in my figures.
In general, if the unclassified user agents for more than 2 percent of the total including spiders, I re-think my classifications and investigate further to see if these are identifiable robots that I’ve missed. Every so often, especially before I write a new article on the subject, I go back and rerun the back data so that everything is consistent.
On the other hand, I exclude from my usual analysis the pages viewed by blog aggregators like Radio and privacy tools like WebWasher. Sure, these are pages viewed by humans not robots, but we have no way of knowing what their actual browser is. Far better to assume they are distributed the same way as the browser population you can identify, and thus exclude them, than to bias down everyone else’s market shares by an amount that varies week to week. You can see how that bias could mislead you if you imagine that a new browser like Safari was introduced at the same time a few robots indexed your site. Suppose you have 1000 pages – easy if you have some sort of discussion board – and your site gets a few thousand pageviews a week, as is typical for many weblogs and small business sites. The spidering will add a lot to your traffic and reduce the share accounted for by existing browsers like IE or Netscape. If you don’t analyse carefully, you might overestimate the amount of Safari switching that has occurred.
Of course, MacEdition gets a lot more traffic than that example, so robot traffic is a lower share of the total than some other sites I run. But excluding the non-human users still makes sense. It makes for more meaningful analyses of the human behaviour behind your logs, especially for smaller sites. It’s only through knowing your audience that you can cater to their particular needs and tastes.