大规模服务开发——读James Hamilton "On Designing and Deploying Internet-Scale Services"有感

最初看到了James的一篇博客(http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx),里面主要说了硬件不可靠,网络不可靠,需要做checksum才能尽可能避免数据问题,后来看到了其针对Windows Live Services Platform写的一篇大规模服务开发的best practice,里面的很多理念就是今天Amazon的开发理念,看来Amazon的很多技术和开发流程非一日之功,仔细看了一下,并做笔记如下。

文章下载:http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf

 

首先分布式系统三个最重要的元素:

Expect failures. A component may crash or be stopped at any time. 

Keep things simple. 比如,重启也许比修复更快捷,更简单。

Automate everything. 系统太大,事情太多,尽可能的自动化能解决很多问题,google的一位工程师也讲过类似的话,他们会花时间手动解决问题,然后花更多的时间来做自动化。

 

下面是分布式系统从设计、测试、部署、运维、监控、管理等一系列的建议,摘出其中重要的分列如下:

1. 分布式系统设计

Design for failure. If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively. the best way to test the failure path is never to shut the service down normally. Just hard-fail it.

Redundancy and fault recovery. nothing special, each compoent and hare-ware can be lost

 

2. 设计方便操作的服务

Quick service health check. 定义系统的healthy指标是必要的,然后定期获取这些指标以评估service 是否还正常。

Develop in the full environment 真实环境里面测试一下,做软件的这个应该会有的

Zero trust of underlying components 完全不信任一切做起事情来会非常困难,还是要选择性的信任,比如操作系统,流程的开源软件还是开源信任的,而刚开发的软件通常需要怀疑。

One pod or cluster should not affect another pod or cluster 服务之间不要互相影响是一个设计思路,但不是绝对的

Allow (rare) emergency human intervention 如果你能想到某个错误,你就会开发auto recovery工具;如果你想不到呢,你也没法开发人工操作工具,不太好做

Keep things simple and robust 绝对同意,尽量采用简单方案,关于何为简单,作者说,Our general rule is that optimizations that bring an order of magnitude improvement are worth considering,

Understand the network design 理解 driven load between servers in a rack, across racks, and across data centers 

Understand access patterns 对任何的新feature来讲,考虑其生态环境,比如如果某feature要访问数据库,那么就要理解该feature对数据库读写的压力

Avoid single points of failure 有时候没法完全避免,但是可以尽量减少。而且时不时single point,要从软件,硬件多个角度考虑。

 

3. 自动化服务管理

紧急事件人工处理是可能出错的:

Many services are written to alert operations on failure and to depend upon human intervention for recovery. The problem with this model is that if operations engineers are asked to make tough decisions under pressure, about 20% of the time they will make mistakes.

Be restartable and redundant. 保证重启肯定能解决问题(除了性能不够)

Support geo-distribution. 作者认为这点很重要,对服务质量和压力分担都有好处,但是显然需要花更多的钱。

Configuration and code as a unit(开发和发布) 非常有用的一条,而且用得到的配置都要测试。

the development team delivers the code and the configuration as a single unit, the unit is deployed by test in exactly the same way that operations will deploy it, and operations deploys them as a unit. 

a configuration change must produce an audit log record

Multi-system failures are common. Expect failures of many hosts at once (power, net switch, and rollout). 影响设计和测试方案。

Recover at the service level. 某个component恢复了但是整个service却无法恢复是没有用的。某个component“不是单点”的含义是,在软件和硬件上都不是单点,比如交换机,机器,进程死掉,该component都不是单点。

Never rely on local storage for non-recoverable information.

Keep deployment simple and automated,简单且自动化能避免很多错误

Fail services regularly. Take down data centers, shut down racks, and power off servers. Those unwilling to test in production aren’t yet confident that the service will continue operating through failures. 要非常有信心才敢这么做

 

4. 依赖管理

Expect latency. Ensure all interactions have appropriate timeouts to avoid tying up resources for protracted periods, avoid a repeatedly failing request from consuming ever more system resources. latency太大可能就意味着问题,不过这有一个threshold设定的问题。 

Isolate failures. mark fail components as down and stop using them to prevent threads from being tied up waiting on. 如果该component是个单点,那就没办法了,下面的这个例子很好,值得仔细品味。

Decouple components For example, rather than reauthenticating on each connect, maintain a session key and refresh it every N hours independent of connection status. On reconnect, just use existing session key. That way the load on the authenticating server is more consistent and login storms are not driven on reconnect after momentary network failure and related events. 

 

5. 测试和发布周期

Ship often. 让发布完全自动化

Minimize false positives 常常发假报警会让人疲劳,最终忽略真正的错误。

Make the system health highly visible.

Require a globally available, real-time display of service health for the entire organization. Have an internal website people can go at any time to understand the current state of the service. 要让每个员工随时都能看到这个服务的状态。

Single-server deployment.  if running the full system is  difficult, developers will have a tendency to take a component view rather than a systems view

Stress test for load. 使用线上系统两倍的压力来测试一下,保证这个时候系统还是ok的

Perform capacity and performance testing prior to new releases 性能测试也是发布的前提条件之一。

Soft delete only. Never delete anything. Just mark it deleted. 无论是用户还是运维都有可能误操作数据。

 

6. 资源管理

Whatever the metric, there must be a direct and known correlation between this measure of load and the hardware resources needed. 搞清楚当前的指标跟硬件资源的关系,除了存储之外,都不太好做。

Make one change at a time.

 

7. 审计,监控和报警

To get alerting levels correct, two metrics can help and are worth tracking: 报警级别要合适,以下指标可以用来评判,

1) alerts-to-trouble ticket ratio (with a goal of near one), and

2) number of systems health issues without corresponding alerts (with a goal of near zero).

Have a customer view of service. 有时候系统内部看起来都好,但是用户却访问不了,所以在外部架设监控也是必要的。

Latencies are the toughest problem. Examples are slow I/O and not quite failing but processing slowly.

Use performance counters for all operations. 监控qps和操作latency

Audit all significant operations. 这个significant不太好定义,随着场景的不同而不同,这个就看设计了。

Configurable logging. 

Make all reported errors actionable. 这个建议挺好的,你看windows每次crash之后还弹出一个框告诉你可以做些什么,the error message should indicate

possible causes for the error and suggest ways to correct it

Debugging in production Ensure they are well trained in what is allowed on production servers. 生产环境上debug还是有要注意的问题的。

Support a ‘‘big red switch.’’ 比如关闭高级功能,关闭某些feature,甚至关闭数据写入,关闭数据导入导出都可以算作big red switch,当然,能关掉也要能打开才行

Control admission. 比如禁止处理请求,禁止处理某类请求,停止某个账户等

Meter admission It’s vital that each service have a fine-grained knob to slowly ramp up usage when coming back on line or recovering from a catastrophic failure. 如果服务还未完全恢复,大量的访问就来了,可能再次crash掉。

 

8. 客户沟通计划,为各种可能事故预设处理方案

Certain types of events will bring press coverage. The service will be much better represented if these scenarios are prepared for in advance. Issues like mass  data loss or corruption, security breach, privacy violations, and lengthy service down-times can draw the press. Have a communications plan in place. Know who to call when and how to direct calls. The skeleton of the communications plan should already be drawn
up. Each type of disaster should have a plan in place on who to call, when to call them, and how to handle communications. 出了事如如何尽快及时、随时跟用户保持沟通是很重要的,要将信息(甚至问题的root cause)透明的传递到用户,这样他们会更满意。

有时候某些问题也可以让系统引导用户自行解决,这会让用户更加高兴。

 

 

posted on 2012-03-25 19:02  RaymondSQ  阅读(1891)  评论(0编辑  收藏  举报