It may seems that we are relatively quiet at the moment and yes we are, in fact we are like little bees, working hard without making noise.
Maybe you are not aware, but the hosted Indefero offer is running a single Indefero installation for the 3000 forges we are hosting. What is happening is that when you request a page of a forge, the domain is matched against a list and automatically the configuration is set to use the right repositories, the right database, etc.
When the system was setup, it was following the: "Do the simplest thing which can possibly work." Nice, but now we are suffering the pain of growth as the simplest way to go was te build everything in a pretty monolithic system. Ouch.
So, we are hard at work on getting a new system setup to break everything in more manageable components. This is a long haul project, but this is needed to recover our agility in getting new versions online several times a day.
The idea is to do all the changes without changing anything, that is, everything will be done without changing the system for you, then once the migration to components will be completed, new features will be rolled in. Small incremental changes are better on the long run.
This year Fall and Winter are going to be very busy seasons.
With his thinking about Redis Kenne Jima explores the fact that the next holy grail for the NoSQL trend will be to evaluate how the system behaves when the amount of data to manage is 10 times bigger than the available RAM of the system.
This big picture view is really interesting, but there is from my limited experience of "web scale" one big issue. This approach considers that all the web scale problems are similar in nature. Let me explain this a bit more with some examples of web scale:
These 4 examples operates easily at web scale but with very different usage patterns which makes the development of a generic answer to the problem very difficult. Do not forget that in the follow string of thoughts, the scale is the web, that is, at least 100's of GB of data.
The meta web scale, as its goal is to make sense of the complete web, can be composed of two parts:
When you think about it, the storage of the index in RAM is totally different than the storage of raw data and support data on disk. This means that one system to cover both needs is nearly unfeasible. Also, you need to have the index fully in RAM if you want millisecond response time. It was said that one search request could hit up to 1000 servers in the Google farm, if you hit them in a tree fashion, depending of your buckets, you may have let say a 5 to 10 server depth. If each depth requires 3 seeks (10 ms each), you add the read of data, the maybe 2ms latency of the communication, then the need to aggregate the data, you would reach very often something like 3 to 5 seconds waiting time to get an answer.
So, the index is in RAM, fully customized to do one thing and do it well.
In the post about the Twitter streaming API, we get that Twitter produces about 35KB of data per second (including the packaging of the network data in JSON format). But when you really think about the Twitter usage, it is:
So, they only need to be very efficient with 48x3600x35KB=6GB. Yes, only about 6GB of data. You can increase that to 600GB if you want, as you can get a system with 12GB of RAM for 100€ per month, you can store these in RAM for 5000€/m. Of course at such scale you have staff and go collocation and it is even cheaper.
So, twitter can keep hot only a fraction of its data in RAM with nearly no needs to access the old data. This makes the performance access of the old data a non problem, just one or two seeks will be fine.
The blog hosting platform is basically disk bound. The general usage pattern outside of the editors is: "search something on Google, click, see the page, go somewhere else". You may have a search engine (mostly used by editors) and a time line (for the feeds) but these fails into the Twitter/search engine problem.
In this case, only a RDBMS (or a simple file system based storage) with asynchronous update of the indexes/feeds is needed. Simple and any kind of well designed "bulk storage" will work. You will be able to go a long way with Lucene for search, so, you do not really need more.
Here it is interesting, you have a lot of data and when you process them you really need to process them fast. Here, the column oriented design, with the capacity to compress data is really needed to minimize seek time. We are really in the case to have extremely efficient 1:10 ratio. But, you cannot use a NoSQL system because they are not column oriented and do not offer the per column compression needed to really achieve high performance. You end up using a system designed for column based access.
So, different data access patterns from different business needs and different constraints are the reasons why we have so many alternatives. These alternatives were mostly developed by a company to solve a specific problem when they were already at web scale (or on the trajectory to reach it), they are answering the sweet spot requirements of a company to match their business goals.
But which one does not matter, the real point is that you need to use a system reasonably well designed, with good base performances (nearly all are providing them now) which can allow you to develop fast and answer your business goals, because the the day you will approach the web scale you will need to go custom. So, your building blocks need to play well and integrated themselves with others for the transition to web scale.
Thrift, protocol buffer, why are the big players developing such systems? Because the real key to achieve web scale is the integration of many components efficiently. The same way, you, as a person, work better when you integrated all the data coming from your 5 senses to get one picture of your world, to get one answer to your visitors, you will need to tap into many different sources with different constraints.
The future is integrated and the systems built from the ground to be integrated and communicate well with others will win the race. Today I see three very promising components:
I do not think silver bullet, and when you give a challenge like the 1:10 ratio, you tend to ask for a silver bullet answer, I think well integrated diversity. It is a bit harder to manage, but it adds flexibility and freedom. The challenges are here, not really in finding another perfect hammer.
If I tell you that on my website I have a download link which can be either:
The link is like on the screen shot below:
What is your answer? I must say, I would have put:
So, just for the fun, I ran the test on the download page of Indefero. After a week, the results were not really like expected. So I let the test run until my confidence percentage was stable. Here are the results:
Yes, bold and italic is the converter, and this by a large margin with 9% better, but what surprised me is that the the bold link is not statistically significantly better than the normal link!
So, my best judgment was basically wrong. What a blow, especially for something as simple as the font style off a link. This small experiment as changed a lot my way to think about improving my software. For my scientific work, I always use data, for the design I often trust my feelings. I was wrong, terribly wrong.
Now, the problem is that I cannot test everything because I am not google and I do not have thousands of visitors a day. But at least, I can test the key points in my application, that is, where actions and conversions are performed.