MySQL High Availability at Yahoo (雅虎的MySQL高可用实践)
I was asked to write a blog post about MySQL High Availability at Yahoo, particularly for writes. Our standard practice is not particularly high-tech, but we’ve been using it for over 4 years now and it has become a company-wide standard with a few exceptions.
我们的方案虽然不是特别得高科技,但也在公司里实践了4年,并成为公司内部的标准。
Let me start by saying that at Yahoo! we consider a datacenter as a Single Point of Failure (SPoF). We build and manage many of our own datacenters, and we still don’t assume they are invulnerable. How many people can attest to the fact that however to configure your racks, how many redundant switches, power supplies, drives, etc. you buy, if your leased datacenter has power or network issues, you are at their mercy.
在雅虎内部,我们把一个机房看作一个故障单点。我们建造和管理了很多机房,尽管如此,我们还是不认为这些机房是安全的。
We are assuming out of the gate that we have to have a presence in at least 2 datacenters. This concept is something we call BCP or Business Continuity Planning: we want our business to continue regardless of some event that would completely destroy a datacenter. So, we require that any and all components within Yahoo are run redundantly from at least 2 datacenters that are not in the same geographical region. That way if an earthquake, hurricane, or some other natural (or otherwise) disaster affects one datacenter, the other one should be far enough away to not have been affected.
我们证明至少需要两个数据中心。我们要求雅虎内部的组件都是冗余,这个冗余是通过不在相同地点的多机房实现的,如果其中一个机房被摧毁,则另一个机房就接管工作,这样用户请求不会受影响。
One big advantage of this way of thinking is that this level of redundancy also nicely satisfies High Availability requirements. We can take downtime in one datacenter for software upgrades or other maintenance, and still happily serve out of the other because we’ve already planned ahead and already built our BCP systems.
多机房的方案是有很多好处的,比如我们可以关闭一个机房用于软件更新和维护,而这么做并不会影响到业务。
Because of the BCP philosophy, we immediately discount most “High availability” solutions for MySQL that people tend to gravitate towards. MySQL Cluster has its uses, but you cannot form a cluster across a WAN link [that performs well]. DRDB cannot handle WAN replication [that performs well, and is synchronous].
因为多机房的理念,让我们放弃一些MySQL的高可用解决方案,比如MySQL集群和DRDB,因为这些方案不适合在WAN上工作。
That isn’t to say that WAN replication can’t work with these technologies, it’s just that they aren’t built to make that problem any better than standard MySQL replication does. Ultimately the problem to solve is that writing a piece of data to two redundant locations simultaneously comes at the expense of latency. Until someone comes up with a solution to write data faster than the speed of light, we are stuck with this problem, regardless of the technology solution we try.
So, we stick with good old MySQL replication. It’s still asynchronous, but lots of people already understand it, and it does the job.
那么,我们坚持使用旧式的MySQL复制方案,它是依然是异步的,但是很多人都理解它,然后它工作得很好。
Now, what makes our solution different? Not much. The layout is this: two master databases, one in each of our two colocations. These masters replicate from each other, but we would never have more than two masters in this replication loop for the same reason we don’t use token ring networks today: one master outage would break replication in a chain of size > 2. Our slaves replicate from one of the two masters, often half of the slaves in a given colocation replicate from one of the masters, and half from the other master.
那么现在有什么是能让我们的方案变得与众不同得呢?其实并不多。我们的设计是这样的:两个主DB,分别在两个地方(地理位置是分开的,比如北京和上海)。两个DB互相复制,从DB一半一半地部署在两个机房,他们会复制同机房的主DB。
Two masters by themselves don’t present a solid solution. If our application can write to either master, we open ourselves up to autoincrement/unique key collisions. Auto_increment_offset does not fix this problem, it just band-aids it; it assumes the auto_increment key order doesn’t matter, that it’s ok for the tables on each master to not be identical, and it doesn’t address any other primary/unique key collisions.
But, key collisions (that are quite visible when they happen because replication breaks) are only the tip of the iceberg. Writing to multiple masters simultaneously opens you up race conditions of the invisible and nasty types. If you have master A and master B in circular replication and the application makes a change to record X on both A and B before the replication for that change makes it to the other, there is an issue. Imagine that the application changes X to 1 on master A and X to 2 on master B. After replication catches up, A has X=2 and B has X=1. Whoops. Replication hasn’t broken, you don’t know it happened, but your two masters (and presumably your slaves) are now inconsistent. Can it be designed/coded around? Sure, but we can’t assume that it will work for a generic solution.
So, because of this problem, we strongly discourage the use of dual-writers. Can they be useful and not a problem in some cases? Yes, but that is the exception, not the rule.
所以,因为这些问题,我们强烈不建议两个主DB可以同时写。
So, we are left with dual-master, single-writer. We only want our application to ever write to one of our two dual-masters at a given time; we set all our databases read_only except the active master. By doing this we gain consistency: we know if a given record is locked on the active master, no other part of the application can change it at the same time. Most programmers assume consistency when writing code for any RDBMS, so most code can’t cope without it.
这样,我们就得到方案是双主DB,但是同时只有一个可写。除了那个可写的主DB,我们设置其他所有的DB都为只读模式。
The trade off, of course, is latency. If our application is similarly distributed like our databases, some of our application servers will have to cross a WAN link to write to our database. This is a problem, of course, but one that is acceptable for many web applications; Many web applications serve many more reads than writes. If those writes are slightly slower, not the biggest deal. There are also a variety of handy web tricks to hide write latency, including ajax progress bars, asynchronous/batched writes, etc. Backend daemons that need to connect to the active master typically fail over with the master; so when the active master changes colo, that daemon needs to as well.
我们的方案要付出什么代价?答案是延迟。如果我们的应用是和DB一样分布的,也就是一半一半地部署在两个机房,那么必然有一半的应用会通过WAN连接去访问不和自己在同一机房的那个可写的主DB。这是是一个问题,但是在web应用中是可接受的,因为大多数web服务器都是读请求多于写请求。如果那些请求有点慢,这不是什么大问题,有很多方案可以处理写延迟,比如ajax进度条,异步/批量写,等等。
If we need to change masters in a clean way (say, for maintenance), the process is clean and can be done with no data loss and with a brief writer outage: Set read_only=ON on the active master. This stops writes briefly. Wait for replication to catch up to our secondary master Set read_only=OFF on the secondary master—writes resume Flush connections from the old master: the application is redirected to the new master by a system we use internally.
如果我们需要改变可写主DB,假设它名字为A,这样子操作是安全的而且没有数据丢失的:将可写主DB设置成只读模式。这会短暂地阻止了写操作。当另一个主DB,假设它名字为B,(之前是只读模式)的数据完全从A拷贝过来后,B就可以设置成可写模式。最后是运用一个雅虎的内部系统将应用的写请求以引导到B,也即是新的可写主节点。
As long as our application can handle the WAN latency, it shouldn’t matter which of the two colos have the active master.
只要我们的应用能应付写延迟,那么就不需要关心那个可写的主DB在哪个机房。
On a failure of the current active master and if we want writes to be available immediately, we must take the risk of a data loss. We are running standard MySQL replication that is asynchronous. So, if we need to make the secondary master active while the primary is down, we cannot know if replication was caught up before the primary went down. We do assume, however, that it was close (we accomplish this by monitoring our replication). At this point our engineers have a choice: either take the (potential) data loss and bring the other master up (set read_only=OFF), or take the outage (which is only for writing to the DB in this scenario) and hope the primary comes back soon. The policy for deciding this is left up to the individual teams and their business requirements.
在可写主节点A宕机时,如果我们想让写操作一直可用,我们必须要知道这里会有数据丢失的风险。因为我们使用了MySQL的异步复制,在A宕机期间,我们并不知道另一个主节点B是否已经完成复制主节点A。不过,我们会假设B不会落后A太多。这时,工程师有两个选择:要么接受数据丢失并将B设置为可写主节点,要么接受此次宕机并希望A快点恢复。具体选择取决于团队和业务。
Once we pull the failover trigger, we are accepting a possible data loss. If the downed master never recovers, our course is clear: we simply rebuild from the remaining master and its slaves. However, if the downed master does come back online (as it usually does) and we allow it to resume replicating, we are accepting that we may have introduced the race conditions or unique key collisions (described above) into our system; we’ve effectively had dual-writers running for a brief window. The safest course of action would be to never allow the downed master back into replication, but in practice, that tends not to be practical because of the expense of recloning MySQL slaves, particularly across WAN links.
如果我们使用failover机制应付故障,那么就意味着我们接受数据丢失。如果宕机的可写主DB(假设为A)永远不会恢复了,处理方案就很清晰:用另一个主DB(假设为B)和A的从DB重建系统(具体一点我觉得应该是这样:用B跨机房同步最接近A的从节点,完成后再拉升这个从节点为代替A的主节点,注意它是只读不可写主节点)。不过,如果A恢复了(这种情况经常发生),我们也允许它继续复制(复制B),这意味着我们接受双可写主DB所遇到的问题:race conditions or unique key collisions;我们实质上在一个短暂的时间窗口拥有两个可写主节点。最安全的方案是不允许A再参加复制(不再复制B),但在实践中,这种方案往往不实用,克隆一个从节点还是比较耗时的,特别是通过WAN拷贝数据(我觉得这里可以拉升A的一个最靠近A的从节点成为主节点,这样需要拷贝的数据就会比重零开始拷贝少很多)。
Now, what about automatic failover? This is a requirement for some systems: availability over consistency, but there are a few challenges with automated failover: An automated failover solution that does not use some unbiased system to determine which master should be active will be susceptible to split brain issues. Imagine our two colocations each fighting to become the active master because they simply cannot communicate with each other; without some outside authority guaranteeing a given colo can make itself active, we would end up with two active colos (and two independent databases!). An automated failover solution would need to consider a system policy for taking availability over data consistency and vice versa. An automated failover solution might be inclined to cause failover flapping, which would be problematic. My experience has been that this system is the most adaptable to existing mysql databases, using existing storage engines and an existing technology to handle replication. It’s likely unfashionably boring, conservative, and stable; but it has worked and continues to work for us.
In the future, I hope to be able to use technologies that improve on the standard MySQL replication by being able to repoint slaves automatically, to find which slave has the most up to date data from a downed master, and even mitigate a lot of data-loss with support for synchronous writes at least within a given colo. There are some candidates out there that are close.
在未来我希望能用一些能够自动化地技术来选择从节点,它能自动地找到最接近宕机的主节点的从节点,以及能够缓解数据丢失的在同一个机房里的同步写操作。
Jay Janssen
Percona, Inc.
Jay works for Percona as a MySQL Consulting Lead. He previously worked at Yahoo! for 7 years as an expert in MySQL and High Availability technologies on a team that helps develop and enforce best practices for Business continuity and HA across Yahoo.

浙公网安备 33010602011771号