The Visitors of Cheméo Changed

Last week I pulled just under six days of access logs off the Cheméo server. I only wanted to check if my handling of the bot spamming the subscription form was efficient. I came away with something I did not expect: the audience of the site has almost entirely changed.

The boring part first

Cheméo runs on a single box. Caddy terminates TLS and serves the static files. Behind it are two Go binaries: a boring web server that answers the requests, and a custom indexer that runs quietly in the background, coupled with a Python compute node running the property prediction code. SQLite is the store underneath. No Kubernetes, no autoscaling, nobody on call. One server, in one rack.

Over a full month that box serves about 10.6 million requests and nearly 1 TB of data. That is roughly 5.6 million HTML pages and a steady 3 to 4 Mbps, day and night. Uptime is close to 100%, and I have not had to log in and fix anything in a long time.

And the box is doing real work the whole time. The busiest path on the whole site is /similar, the chemical similarity search, with about 150,000 hits in the window. Every one of those runs a fingerprint comparison against the full database, exercising the chemistry code I hand-rolled in assembly more than ten years ago. /predict, the property prediction, takes another 66,000, this time exercising the Python compute backend. The bots are driving the actual machinery, not grazing on static HTML, and it holds up.

This is the part I am quietly happy about. Every year there is a new way to deploy a web application, and every year I am glad I did not rewrite Cheméo to use it. Boring technology, chosen with care, keeps running while you go and do something else.

Almost nobody out there is human

Now the logs themselves. Roughly 99% of the traffic is automated. The real (or direct) human audience is small: somewhere between 50,000 and 100,000 page views a month, call it two to three thousand a day.

You can read it straight off the assets. The main stylesheet, the one a real browser loads once per page, was fetched about 110,000 times a month. That is the generous ceiling on real page renders, because some rendering bots pull the CSS too. About 98% of the so-called page views never asked for a single asset of the page they supposedly loaded. They took the HTML and left.

Some of the bots try to dress up to hide themselves. The traffic carrying a normal browser user-agent averaged 0.67 asset requests per page, where a real browser pulls five to fifteen. It was dominated by a handful of fixed user-agent strings, with 231,519 requests coming from one single Windows and Chrome string. That is not a person. That is a scraper wearing a browser costume, and a bad one!

A few thousand real readers a day, sustained for years with no marketing and no newsletter, is more than fine by me. The site does its job for the people who need a boiling point or a heat of formation.

The part that actually changed

Here is what I find interesting. Split the bots by what they are for:

Who	Share of requests
AI crawlers	37%
Browser-UA (mostly scrapers)	30%
Search engines	20%
SEO and marketing bots	8%
Everything else	5%

The single biggest consumer of Cheméo, by requests and by bandwidth, is no longer a search engine. It is AI training and retrieval crawlers.

AI crawler	Requests in the window	Operator
Amazonbot	257,925	Amazon
ChatGPT-User / OAI-SearchBot	174,846	OpenAI
ClaudeBot	115,945	Anthropic
GPTBot	102,617	OpenAI
Bytespider	36,849	ByteDance

For comparison, the classic search crawlers over the same window were Sogou, Bingbot, Googlebot, YandexBot and Baiduspider. Together they came to about a fifth of the traffic.

I have looked at these logs before, a few years back. Back then the story was search, almost entirely. Googlebot, Bingbot and the rest were something like 80% of the bot traffic, and the AI crawlers did not exist yet.

Today they are the largest single category, and search is down to a fifth. That is the pattern change, and it happened fast. In a few years, the dominant reason a machine visits Cheméo went from "index this page for a search result" to "ingest this data for a model".

What it means for the data

This changes what being found means. For fifteen years, being useful meant ranking in Google so an engineer could land on the right compound page. More and more, the same engineer gets the number straight from a chatbot, and the chatbot got it by crawling Cheméo. The data still comes from here. The reader in the middle is now a model.

So I am doing the obvious thing. I am writing a clear page on where Cheméo's data comes from, how it is collected and validated. Not for SEO, but so the models pulling the data have the provenance right there and can represent and cite it correctly. If the machines are going to read the site more than people do, they should at least get the story right.

I built Cheméo for chemical engineers. Most of its readers are now machines, and the little server in the rack handled the whole switch without me touching it.