The Problem with Performance Logging

To run a service like Indefero, you need to log a long list of metrics to follow the load on the system, find the bottlenecks and predict the future needed capacity. To do that, a very powerful system is Graphite, the only issue is that it is only storing and graphing numerical values. Of course, you cannot do different, but the problem is: correlation.

Basically: Once I see that every now and then component is not performing well, how can I drill down in my data to find the reason?

Graphite tells you: this day from 14:05 to 14:07, the rendering of a git tree view was slow. Good to know, the following question is of course: why? If you store more metrics, you can maybe find that I/O was slow on the server X, you can graph together many metrics and visually correlate them. But then, why was I/O slow?

At this point, you need to go one level deeper and take a look at the logs coming from server X from 14:05 to 14:07. This can bring you up to the application level where you figure out that a client repeatedly accessed a page which triggered a git command with a large output, thus loading the server. But to do that you need to access the logs too.

So, Graphite is wonderful, but what I need is that after identifying the subsystem and time range where we have an issue, being able to simply scan through all the corresponding logs in the time range. This would be a kind of integration between Graphite and Graylog2.

My problem now is that Graylog2 is overkill. That is, it tries to provide full text search on the logs, the result is that it requires a very big machinery where I just need aggregation of the logs and the equivalent of a time base search range with a filtering by component, for example webapp.backend.git.

This annoys me, I do not want to build a system by myself.