In my previous post, I wrote about unique user session tracking, now, here is what I ended up creating to implement that in practice. This approach is undergoing tests by tracking the unique visitors on www.indefero.net. I will then cross check the results with the Google Analytics data of the account to assess the quality of the idea.
The storage is composed of 2 tables, one for the visitors and one for the logs. The visitor table is needed as the goal is to track in realtime the unique visitors. To mitigate the need to lookup data in this visitor table, information is cached using Memcached.
The visitor table stores:
- IP address;
- User agent;
- Cookie value;
- Creation time stamp;
- Last seen date, this date is update a maximum of 1 time every 30 minutes.
The log table store:
- visitor (foreign key to the visitor table);
- page seen;
- time stamp.
To find a visitor in the visitor table, I first search by cookie and if not available by user agent/IP address combination. The real trick is the handling of the missing cookie. In my case, I log just before sending the response, this means that if this is a new visitor or a visitor without cookie, I have a new cookie. When doing the check for the visitor in the table, if the user agent/IP matches but not the cookie, I update the cookie in the table. This is because I have no idea if the visitor will now accept the cookie or not. This could be a performance problem.
Basically, I first perform a cookie check and then I default on the user agent/IP address combination. This is running at the moment on indefero.net (only the presentation website, not the hosted forges) and I will compare the results with the Google Analytics resuts in 24 or 48h. What is already better than GA is that I can see the bots. Maybe I should add a bot flag in the visitor table to easily exclude them when doing reports.