Michael Nygard on Building Resilient Systems
原文 @ InfoQ.
- Feature Complete Software 和 Production Ready Software是不同的。而很多時候,開發人員不清楚Production下的情況,所以沒有很好的考慮到在Production下運行的情況。例如,在開發環境下,Sever A和Server B的壓力是 1:1的,但是在Production下有可能是20:1,那么這里對Server B就可能會出問題。這一點開發人員往往是不知道而忽略掉的。
- Circuit Breaker
- Log很有用。監控的內容盡量和業務實現分開,因為監控的策略會經常變化。監控的很多配置項最好是可以動態配置的。
?Anywhere there is a pool definitely track who's blocking and how often, high water, low water and some stats about number of times things are being checked in and out. Other kind of health indicators: any place you've got a cache, keep track of how many items are in cache, what the hit rate is, what the eviction rate is; any place you've got the circuit breakers, keep track of how many times the circuit breakers are flipping from an open to a closed state or from closed to open, current state of all of them, of course, and the thresholds that are configured into it. Those are all useful things to expose through a monitoring and management interface.
It can also be useful to expose controls on these things - for instance, with the circuit breaker, a control to reset it; with a pool a control to change what the high water and low water mark will be. I can think of several cases where we've had an ongoing partial failure mode and we needed to go in and change the maximum number of connections in a connection pool and dial it down, so that the front end system would stop crushing the back end system. That's a very useful kind of control to have at runtime.
- 有一些問題,如果在開發階段解決的, 就會為產品維護節省很多費用。算了一筆賬:對于一個訪問量為100萬的網站,如果每次頁面請求多出來250毫秒,這不起眼的250,折合70個額外的計算時間,就需要4個服務器。而出去服務器的購買和維護費用,還有licence的費用,合同管理,還要投入人力維護這些服務器,接下來又涉及到這些維護人員的管理…… 像蝴蝶效應一樣。
If we do that we will make the decision differently in some cases and we'll make the decision the same way in some cases. By that I mean we'll sometimes choose to incur that ongoing operational cost we'll sometimes choose to spend some additional development time to avoid the ongoing operations cost. One of the examples that I use when I talk about capacity is if you're handling say, web page requests and you have 1 million hits per day - 1 million hits per day is not all that large these days - and each one takes just an extra 250 milliseconds.
First of all, that's going to have an impact on your revenues, and companies like Google and Amazon have identified that very clearly, but secondly an extra 250 milliseconds on 1 million hits per day is about 70 hours of additional computing time, which means roughly you need 4 additional servers to handle the load. 4 additional servers draw power every month, they require administration every month, they may or may not require software licensing every month, they probably have support contracts. Once you get enough administrators, you need managers of administrators to keep the organization in check, so really, that 250 milliseconds per page that seems pretty small in development, translates into a pretty substantial ongoing operations cost. ?
轉載于:https://www.cnblogs.com/caff/archive/2010/04/10/1708907.html
總結
以上是生活随笔為你收集整理的Michael Nygard on Building Resilient Systems的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [第一财经周刊] 疯狂的团购
- 下一篇: 代码运行框