First results of unique visitor tracking, the bots and crawlers are here - Céondo - Fluid Phase Equilibria, Chemical Properties & Databases

So the unique visitor tracking test is running. At the moment of writing, I have 159 unique visitors in my visitor table. From an excerpt of the results shown below, it is clear that I need to flag the bots and crawler and exclude them from the page tracking.

Part of the visitor table

id | User agent
82 | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_7; en-us) AppleWeb [...]
83 | Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/530.5 [...]
84 | msnbot/2.0b (+http://search.msn.com/msnbot.htm)
85 | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; [...]
86 | Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.2) Gec[...]

Part of the log table

id  | Visitor | Page
341 |      83 | /
342 |      83 | /tour.html
343 |      56 | /refund/
344 |      56 | /privacy/
345 |      84 | /robots.txt
346 |      84 | /
347 |      85 | /robots.txt
348 |      85 | /
349 |      84 | /refund/

The good thing is that the robots and crawler are good Internet citizens, as you can see for the MSN bot with the id 84, they are always requesting the robots.txt file for the first request. This means that one can directly flag a new visitor as a bot if the first action is to grab the robots.txt file.

Now, this will kick out most of the bots and crawler but not these ones:

70 | Mozilla/5.0

A very minimal user agent string.

303 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
304 | 70 | /
305 | 70 | //?_SERVER[DOCUMENT_ROOT]=http://www.[...]
306 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
307 | 70 | /  
308 | 70 | /doc.html//?_SERVER[DOCUMENT_ROOT]=http://www.[...]
309 | 70 | //?_SERVER[DOCUMENT_ROOT]=http://www.[...]
310 | 70 | /

And looking to trash my site. I am already not logging the ones without a user agent string, but it looks like I will need to use the heuristics of AWStats to mark more of the visitors as bot.

What to do next?

Add a field in the visitor table to mark a visitor as bot.
Mark a visitor doing the first request against /robots.txt as crawler/bot.
Do not log the requests of the bots.
Merge the AWStats robots definition as a simpler regex/substring matching to catch the robots.
Add small heuristics for the stupid security scanners. One could perform a small check on the request string to mark them and drop the corresponding logs.

I am going to work on that this afternoon and will report to you the results.