关于cassandra vnode的理解-marsyoung.

http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
 
数据是如何放入这些node的呢,每个数据有对应的partition key,根据这个key值会计算出一个token,多个key值根据hash算法得到的token是同一个,假设这个Key值的集合是A,那么用node 1来对应这个token,那么集合A的数据就应该存到node1中,node和token是一对一的。 具体是:如果根据key值计算出的token值 大于 node1所对应的token值,小于node2所对应的token值,那么它就应该放在Node2中。同时,这个数据会被复制到顺时针的两个节点中,那么就是Node3和Node4,那么这个数据对应的就是图一中的B。
 
好吧,这个环其实是一个数据范围环,或者叫一致性hash环。这个环是怎么形成的呢?是由一致性hash来的。规定一个node对应了一个token,而这个token是根据hash算法算出来的,那么这个token对应的就是一定的哈希值范围,同理一个node对应的也就是一定的哈希值范围。每个哈希值假设是一个点,那么哈希值范围就是一条线,多个哈希值范围按照值的大小顺序排列,并且把代表最大值和最小值的点合并就组成环了。
这个环的作用是干啥呢? 首先,每个node指定一个token后,则对应环上的一段,所以node在环上的位置就固定了。所以第一个作用就是指定位置。
然后,就是给node上面分配数据了,假设没有备份,那么图一中A的位置就是Node1的位置,B就是Node2的位置。一次类推。 那么对应的数据A来了就放Node1 ,B来了就放Node2.以此类推。  而实际上数据还是要备份的,怎么备份呢,假设A来了,那么放到Node1之后,同时在这个环中找接下来的两个Node,每个放一份。图一中,每个数据都是备份3份。
 
 
在cassandra1.2以前,每个node只对应一个token.  这个token决定了node在环中的位置和它的那部分根据hash值算出来的数据。
之后,cassandra允许每个node有多个token.  那么有个多个token之后,vnode就出现了。其实是个虚拟的概念,实际上物理层还是一个node。换句话说,原来一个token对应一个node,现在一个node要对应多个token,并且还是按照一致性哈希去分配数据,那么每个token对应的则是一个vnode.
 


 
在cassandra1.2以前,每个node只对应一个token.  这个token决定了node在环中的位置和它的那部分根据hash值算出来的数据。
之后,cassandra允许每个node有多个token.  那么有个多个token之后,vnode就出现了。其实是个虚拟的概念,实际上物理层还是一个node。换句话说,原来一个token对应一个node,现在一个node要对应多个token,并且还是按照一致性哈希去分配数据,那么每个token对应的则是一个vnode.
 
在cassandra1.2以前,每个node只对应一个token.  这个token决定了node在环中的位置和它的那部分根据hash值算出来的数据。
之后,cassandra允许每个node有多个token.  那么有个多个token之后,vnode就出现了。其实是个虚拟的概念,实际上物理层还是一个node。换句话说,原来一个token对应一个node,现在一个node要对应多个token,并且还是按照一致性哈希去分配数据,那么每个token对应的则是一个vnode.
 
 

One of the new features slated for Cassandra 1.2’s release later this year is virtual nodes (vnodes.) What are vnodes? If you recall how token selection works currently, there’s one token per node, and thusly a node owns exactly one contiguous range in the ringspace. Vnodes change this paradigm from one token or range per node, to many per node. Within a cluster these can be randomly selected and be non-contiguous, giving us many smaller ranges that belong to each node.

What advantages does this bring to the table? Let’s consider the following scenario: we have 30 nodes and replication factor of 3. A node dies completely, and we need to bring up a replacement. At this point the replacement node needs to get a replica for 3 different ranges to reconstitute not only the data it is the first natural replica for, but also data that it is a secondary/tertiary natural replica for (though do recall no replica has ‘priority’ over another in Cassandra, this terminology is strictly to illustrate placement on the ring.) Since our RF is 3 and we lost a node, we logically only have 2 replicas left, which for 3 ranges means there are up to 6 nodes we can stream from. In current practice though, Cassandra will only use one replica from each range, so we’ll stream from 3 other nodes total.

We want to minimize how long this operation is going to take, because if we lose another node while this is happening there’s a chance we’ll be down to 1 replica for some ranges, and then all operations for that range with a consistency level greater than ONE would fail. Even if we used all 6 possible replica nodes, we’d only be using 20% of our cluster, however.

If instead we have randomized vnodes spread throughout the entire cluster, we still need to transfer the same amount of data, but now it’s in a greater number of much smaller ranges distributed on all machines in the cluster. This allows us to rebuild the node faster than our single token per node scheme.

Cassandra has worked toward increasing the amount of data that can be reasonably stored per node in many releases, and of course 1.2 will be no different with its new disk failure handling. One last wrinkle though is if you lose one disk, you’ll have to wait on repair before anything will begin to be restored to the new disk. Repair is two phases, first a validation compaction that iterates all the data and generates a Merkle tree, and then streaming when the actual data that is needed is sent. The validation phase might take an hour, while the streaming only takes a few minutes, meaning your replaced disk sits empty for at least an hour. Much like the node replacement scenario I began with, with vnodes you’ll gain two distinct advantages in this situation. The first is that since the ranges are smaller, data will be sent to the damaged node in a more incremental fashion instead of waiting until the end of a large validation phase. The second is that the validation phase will be parallelized across more machines, causing it to complete faster.

Another nice advantage vnodes bring is easing the use of heterogeneous machines in a cluster. As time goes on, everyone is going to come to a point where it’s time to replace older, weaker machines with newer, more powerful ones. While in transition however, it would be nice if the newer nodes could bear more load immediately. You might be able do this today with very careful planning and range calculation, but it would be cumbersome and error prone. If you have vnodes it becomes much simpler, you just assign a proportional number of vnodes to the larger machines. If you started your older machines with 64 vnodes per node and the new machines are twice as powerful, simply give them 128 vnodes each and the cluster remains balanced even during transition.

As you can see, virtual nodes are a large feature addition for 1.2, but don’t worry if you have an existing cluster, they won’t be forced on you and everything will work the way it did before. If you’d like to upgrade an installation to virtual nodes, that’s possible too, but I’ll save that for a later post. If you want to get started with vnodes on a fresh cluster, however, that is fairly straightforward. Just don’t set the initial_token parameter in your conf/cassandra.yaml and instead enable the num_tokens parameter. A good default value for this is 256.

posted @ 2017-02-22 14:26  princessd8251  阅读(524)  评论(0)    收藏  举报