Sunday, September 18, 2011

Crisps abt scalability Part 3

New to this post?? Go on with Crisps abt Scalablity Part 1 and Part 2
So are you convinced with Master Slave. Let us move on to Sharding or Shared Nothing Architecture
What is it?
 Before this I'll answer a different question.
For Whom is it??
 If the application you built can be partitioned. A scenario, You have 1 billion customers and all the customers are currently being maintained by only one data store. And they are being served from the only available data store. But the beauty of your application is there is nothing being shared between two customers. So, what is the use of it?? Just split them. Because they all are going to read data about them and they are going to write to data about them. Put up in another way, No one is going to read or write to your data except you. So, what you can do is just split your billion customers in to half a billion and make another data store to handle them. I hope now you understood what it is. 
That is awesome :)  
 Ya really, provided your application is of such a kind :)
Final touch on Consistency, Availability and Partition tolerance [CAP]
Why these three?
 According to Brewer's theorem on distributed systems. Any well designed distributed system can achieve only any two of these verticals. It is a tradeoff. And the best thing is, this theorem has a proof as well. 
Cool, what are they?
 Nice delayed question. Consistency - It can be broadly explained as read and write consistency. Read consistency -> Assurance to the user that he/she reads the most recent data. Means once there is an update that should be immediately reflected back to the user who sends the next request after the update. Write consistency -> A scenario, let X be a global variable P1 and P2 both wishes to write to it. By timeline P1 should be able to do the first write and P2 the second one. So, Finally P2's data should be available. If this doesn't happen then write consistency is said to be failed. As per the needs the system integrates a desired level of consistency like Strict, Causal and Weak. Causal only looks odd. It is nothing but event based consistency. Like if e2 is the effect of e1 then we can say e1->e2. So, this is the way the reads are write will be performed. But the problem is, if the events occur with in a single system everything is fine we can determine causal relationship based on time and process execution. Here everything happens across system. So, we need to implement a kind of global clock or use protocol like NTP. Here I'm discussing abt distributed systems. So, reads and write I mention can happen @ any point in the system but it should be reflected properly in all points :)
Availability - This is simple but difficult :P What if one of the machines of yours failed. Means either network unavailability, floods blah, blah  n all. The data contained by that machine is lost. So, how do we handle this?? The solution is replica management. Instead of writing data to a single server write it to other complement servers and that would help atleast to retain data after loss. This n is mostly called as replication factor.
Partition Tolerance - Consider you have 5 nodes and are arranged in a ring what if node 2 and 4 goes down. 3 becomes unreachable now to 1 and 5 and this is called as network partitioning. Your application should be able to handle this.  

As I said earlier choose two go good :) Cassandra choses AP, Google Big table choses CP, MYSQL choses CA. And hence they are performing well :)