Few days ago Cheméo's laboratories went life. The labs are running software experiments in the field of chemical and physical properties. They are kind of sandboxes where ideas can be tried without disturbing the main Cheméo website.
The labs are running on top of Céondo's private Platform as a Service (PaaS). This platform will soon host all the services we deliver, from our products Cheméo and Indefero to simpler websites like ceondo.com. In case of, a status website will be kept independently using another technology with a different provider. I will soon write a bit more about this private PaaS.
These are exciting times, the best to close 2011 and start 2012.
To run a service like Indefero, you need to log a long list of metrics to follow the load on the system, find the bottlenecks and predict the future needed capacity. To do that, a very powerful system is Graphite, the only issue is that it is only storing and graphing numerical values. Of course, you cannot do different, but the problem is: correlation.
Basically: Once I see that every now and then component is not performing well, how can I drill down in my data to find the reason?
Graphite tells you: this day from 14:05 to 14:07, the rendering of a git tree view was slow. Good to know, the following question is of course: why? If you store more metrics, you can maybe find that I/O was slow on the server X, you can graph together many metrics and visually correlate them. But then, why was I/O slow?
At this point, you need to go one level deeper and take a look at the logs coming from server X from 14:05 to 14:07. This can bring you up to the application level where you figure out that a client repeatedly accessed a page which triggered a
git command with a large output, thus loading the server. But to do that you need to access the logs too.
So, Graphite is wonderful, but what I need is that after identifying the subsystem and time range where we have an issue, being able to simply scan through all the corresponding logs in the time range. This would be a kind of integration between Graphite and Graylog2.
My problem now is that Graylog2 is overkill. That is, it tries to provide full text search on the logs, the result is that it requires a very big machinery where I just need aggregation of the logs and the equivalent of a time base search range with a filtering by component, for example
This annoys me, I do not want to build a system by myself.
The Decane is a simple molecule but also the name of our new database server. It is a 24GB RAM/240GB SSD server with a lot of power to provide blazing fast data processing. In the next few days it will go through the standard Ganeti setup and the Indefero PostgreSQL database will be migrated over. Depending of the performance, we may migrate more database VMs on it.
This is the first time I am putting a server with SSD drives in production. I have been an heavy user of SSD drives for my desktop/laptop systems in the past two years and I must say, I will never go back to traditional drives, but of course, the amount of data stored is not the same for desktop and for a server.
So, yes, performance increase of Indefero is on the way!
Update: Decane just joined the Ganeti cluster:
# gnt-node list Node DTotal DFree MTotal MNode MFree Pinst Sinst node1.ceondo.net 2.7T 2.2T 11.8G 4.0G 8.2G 4 0 node2.ceondo.net 2.7T 2.4T 11.8G 2.9G 9.2G 5 0 node3.ceondo.net 194.3G 194.3G 23.6G 147M 23.4G 0 0
Update 2011.11.09: Base backup of the postgresql database is on the way, this is a huge rsync job and this is of course slowing down the system. Please be patient... thank you.
Update 2011.11.09 20:00 UTC: Ok, the new server is now acting as a warm standby for the database, this will allow fast "failover" to the new database server after the testing period.
Update 2011.11.10 10:28 UTC: The main application server will be unavailable for a short amount of time the time to connect it to its virtual LAN to communicate directly with the warm standby over a private network. Done.
Update 2011.11.10 11:46 UTC: Now that the connection at the switch level is supporting the VLAN, it needs to be configured at the host and vm level. This will again trigger short downtime here and there.
Update 2011.11.10 12:50 UTC: Ok, now setting up a second warm standby which will take over the current one on the new SSD powered server once it will start to act as master. Done.
Update 2011.11.10 13:34 UTC: Ok, things are running as expected, around 21:00 UTC today, Indefero will be stopped for about 15 minutes, this will ensure that we have the warm standby with the latest version. The warm standby will be brought online as master and then the web app will connect to the new master on the new server. Immediately, I will start to populate the new warm standby. Basically a bit of cascading.
Update 2011.11.10 19:20 UTC: Too tired to do the cascading, it is never good to do so when not really fresh. I will perform it tomorrow, ok it will be during the day, but it will be only about 15 minutes of downtime. So expect a downtime of maximum 30 minutes between 09:00 and 12:00 UTC on November 11.