If you noticed a slow down in the past minutes, one of the routers of our provider had some issues. This slowed down the services for a short period of time. As you can see on the following graph, suddenly our GET requests to monitor the response time of the services went bong. 20 second response time, this is the equivalent of dead...
At the moment, we move a bit the VMs (Virtual Machines) powering up our PaaS (Platform as a Service). The platform has one virtual machine receiving the code updates and then the Cheméo application is updated from it. The application itself is running on another VM behind the web server. The work we do is to disconnect the web server from the application. So, at the end we get :
This seams a bit complicated but the advantage is that it is very easy to add new application server VM to handle more requests.
Anyway, the split requires moving component here and there and synchronizing the data, this does not work always as expected and this why, one thing you can expect for sure is a bit of downtime during the next few days.
As a nice end of the year present, Cheméo is now running on Céondo's private Platform as a Service (PaaS). This is a huge change compared to the previous system. After the normal period of ironing out the system, the infrastructure will be more robust and flexible to allow fast iterative development. The goal is simple, Cheméo must become the reference for chemical engineering data. We are not targeting biological activity but only chemical and process engineers. 2012 is going to be fun.
Few days ago Cheméo's laboratories went life. The labs are running software experiments in the field of chemical and physical properties. They are kind of sandboxes where ideas can be tried without disturbing the main Cheméo website.
The labs are running on top of Céondo's private Platform as a Service (PaaS). This platform will soon host all the services we deliver, from our products Cheméo and Indefero to simpler websites like ceondo.com. In case of, a status website will be kept independently using another technology with a different provider. I will soon write a bit more about this private PaaS.
These are exciting times, the best to close 2011 and start 2012.
With his thinking about Redis Kenne Jima explores the fact that the next holy grail for the NoSQL trend will be to evaluate how the system behaves when the amount of data to manage is 10 times bigger than the available RAM of the system.
This big picture view is really interesting, but there is from my limited experience of "web scale" one big issue. This approach considers that all the web scale problems are similar in nature. Let me explain this a bit more with some examples of web scale:
These 4 examples operates easily at web scale but with very different usage patterns which makes the development of a generic answer to the problem very difficult. Do not forget that in the follow string of thoughts, the scale is the web, that is, at least 100's of GB of data.
The meta web scale, as its goal is to make sense of the complete web, can be composed of two parts:
When you think about it, the storage of the index in RAM is totally different than the storage of raw data and support data on disk. This means that one system to cover both needs is nearly unfeasible. Also, you need to have the index fully in RAM if you want millisecond response time. It was said that one search request could hit up to 1000 servers in the Google farm, if you hit them in a tree fashion, depending of your buckets, you may have let say a 5 to 10 server depth. If each depth requires 3 seeks (10 ms each), you add the read of data, the maybe 2ms latency of the communication, then the need to aggregate the data, you would reach very often something like 3 to 5 seconds waiting time to get an answer.
So, the index is in RAM, fully customized to do one thing and do it well.
In the post about the Twitter streaming API, we get that Twitter produces about 35KB of data per second (including the packaging of the network data in JSON format). But when you really think about the Twitter usage, it is:
So, they only need to be very efficient with 48x3600x35KB=6GB. Yes, only about 6GB of data. You can increase that to 600GB if you want, as you can get a system with 12GB of RAM for 100€ per month, you can store these in RAM for 5000€/m. Of course at such scale you have staff and go collocation and it is even cheaper.
So, twitter can keep hot only a fraction of its data in RAM with nearly no needs to access the old data. This makes the performance access of the old data a non problem, just one or two seeks will be fine.
The blog hosting platform is basically disk bound. The general usage pattern outside of the editors is: "search something on Google, click, see the page, go somewhere else". You may have a search engine (mostly used by editors) and a time line (for the feeds) but these fails into the Twitter/search engine problem.
In this case, only a RDBMS (or a simple file system based storage) with asynchronous update of the indexes/feeds is needed. Simple and any kind of well designed "bulk storage" will work. You will be able to go a long way with Lucene for search, so, you do not really need more.
Here it is interesting, you have a lot of data and when you process them you really need to process them fast. Here, the column oriented design, with the capacity to compress data is really needed to minimize seek time. We are really in the case to have extremely efficient 1:10 ratio. But, you cannot use a NoSQL system because they are not column oriented and do not offer the per column compression needed to really achieve high performance. You end up using a system designed for column based access.
So, different data access patterns from different business needs and different constraints are the reasons why we have so many alternatives. These alternatives were mostly developed by a company to solve a specific problem when they were already at web scale (or on the trajectory to reach it), they are answering the sweet spot requirements of a company to match their business goals.
But which one does not matter, the real point is that you need to use a system reasonably well designed, with good base performances (nearly all are providing them now) which can allow you to develop fast and answer your business goals, because the the day you will approach the web scale you will need to go custom. So, your building blocks need to play well and integrated themselves with others for the transition to web scale.
Thrift, protocol buffer, why are the big players developing such systems? Because the real key to achieve web scale is the integration of many components efficiently. The same way, you, as a person, work better when you integrated all the data coming from your 5 senses to get one picture of your world, to get one answer to your visitors, you will need to tap into many different sources with different constraints.
The future is integrated and the systems built from the ground to be integrated and communicate well with others will win the race. Today I see three very promising components:
I do not think silver bullet, and when you give a challenge like the 1:10 ratio, you tend to ask for a silver bullet answer, I think well integrated diversity. It is a bit harder to manage, but it adds flexibility and freedom. The challenges are here, not really in finding another perfect hammer.
The last step, from the Joback descriptors to the property prediction was not that complicated and is now available. Cheméo allows you to predict the properties of about 66,000 molecules and chemical using the Joback and Reid property prediction method. This method is not necessarily the best method, but it is one of the most well known.
At the moment it is not yet possible to use the Joback and Reid descriptors and equations to regress your own parameters for you to have your own custom models, but this is planned. Last week I have been experimenting with a high performance computing system to allow instant property prediction with confidence interval with your own models. It will require a couple of weeks to implement it in production, but once done, it should be really interesting, especially for research purposes.
Cheméo now provides for a bit more than 66,000 molecules, the corresponding Joback and Reid decomposition. The development of the Joback and Reid decomposition engine is an important step towards the setup of a property prediction engine.
As you can see for the Propanidid molecule, we sometimes have no publicly available data. Adding group contribution methods calculations will allow researchers and engineers to have better screening and help them not miss a possible candidate for their process.
The next step is to allow you to search by decomposition, for example, all the molecules with -CH3 and -CH2- groups. That would be interesting.
As last week was a great week for Cheméo, I need now to take care of a lot of things I left a bit behind when rushing to get everything ready for the annual meeting. The best thing to do is again a new NNW, so here it is, the Nothing New Week 4.
My goals are (and for once they are in the right order of priority):
At the moment, I am still really impressed at the synergies I get from developing Cheméo and Indefero at the same time. I learn a lot with one project that I then apply to the other. In terms of quality, they are now definitely better than if I would have developed only one. But I must say, I do not think it is possible to go beyond two projects and I have no ideas if this can last. Anyway, if this cannot last, this would be the results of having many customers. With customers comes money to hire the right people too, so...
Left to do:
What a week, I went to Denmark for the traditional annual meeting of CAPEC. My goal was to gather feedback on Cheméo, especially with respect to the modelling part of it. The other part of Cheméo is the documentation part and is basically a convenient way to search and access physical and chemical data.
This documentation part was really well received. Many engineers are at the moment using Google when trying to get information about chemical structures, the ability to get a specialized one stop place to get access to these information was really appreciated. One of the critical factor was speed but also the direct link to the data source. They are going to use Cheméo as way to search but then double checking with the original source. They think Cheméo as a Google for chemical properties.
The modelling part is more complex, as you can read in the overview, but I was able to run an informal session showing how it works. With one of the company engineer, we created a model and refined it in about 15 minutes and at the end he told me that doing the same with the tools he has take him nearly a full day. A future happy user and ready to become a customer.
It is always extremely hard to evaluate the value you bring to the users of your software tools but it is also a critical part to then set the price point and derivate from it a sustainable business. This is why the modelling/analysis part of Cheméo is going to be free from now until October this year. The goal is to get people to use the software on real problems and be able to provide us with feedback on where we need to improve and for them to really evaluate the value we bring. This is the non scientific part of the business but it needs to be done right to keep the scientific part happy.
The next 6 months are going to be fun.
It is now possible to create your own property prediction method with Cheméo. Ok, not fully, but you can now create linear models based on a UNIFAC decomposition of the molecules. At the moment, the way to select data points as source to perform the regression is a bit rudimentary but it is working nicely. You get the full statistical results and you can see the corresponding predictions.
The next steps are:
This work will be presented to the attendees of the CAPEC annual meeting in Copenhagen early June. I really think it can trigger some waves in the way people are working at the moment.
This week, the main work will be a sprint to get some functionalities out for Cheméo. A sprint week is a very focused week on a given project. I found this approach to be the best to get my projects ahead.
The focus will be on:
The challenge with Cheméo is that it ingests a large quantity of data which needs to be curated, deduplicated and merged. I am really happy to see the work of IUPAC with the InChI taking more and more weight, because the current approach of relying most of the time on the CAS number is a nightmare as you cannot derive the CAS from the molecular structure. Basically, you are always a "typo away" from a bad identifier for your components and you need to deal with all the copyright non-sense from CAS (if you are living in the US). In Europe, you do not have copyrights on the data in the databases and the copyright on numbers that CAS tries to claim has never been tested in court.
Anyway, we are manipulating too many molecules, too fast in too different fields, the CAS registry cannot scale by manually giving each structure an arbitrary number. This is good, this means that the CAS monopoly will fail by itself.
Oh, a very nice 3D molecular viewer: CanvasMol.
This is with a great pleasure that I am launching Cheméo today. Cheméo is a search engine for chemical properties and is developed to serve the chemical and process industry.
At the moment the Cheméo database stores more than 100,000 components and will grow based on the availability of high quality experimental data.
I am especially pleased to release Cheméo as it is my first project in close collaboration with Dr Claude Leibovici. Claude has more than 30 years of experience in the field of chemical property, starting with a thesis in quantum mechanics and following with a career in the oil and gas industry.
You can try it and do not hesitate to tell me what you think about it.
This was a dream I had for a long time, but my free time and my funding were not matching the requirements needed to accomplish it. Now, I have time, money and the right collaborators, so this is with a great pleasure that I am announcing ChemHQ.
The goal of ChemHQ is to become the Google of the chemical properties. What does it mean? It means you will be able to search through 100,000's of molecules by property, name, CAS number or any molecular descriptor.
For each molecule you will get a comprehensive list of properties, including predicted properties using the best QSAR models.
The ChemHQ portal (maybe the name will change) will be launched during Spring 2010 for a limited number of users and to a wider audience early Summer 2010. The pricing will break the bank, making high quality data finally accessible for research centers, students and small companies while being enough to keep the quality excellent.
If you are interested in beta testing the offer, just let me know.