Earlier this week we had a service outage. The proper chain of events would be:
But what happened was:

This started around 3am California time, which is why none of the PBwiki team noticed it independent of the sms alert mechanism. What should have been an isolated transient, simple to resolve and not user-visible turned into a cascade of unpleasant timeouts which caused the service to slow and eventually halt. We’ve done an extensive internal examination of what happened, and we’re changing some technology, adding some additional automated checks, and doing a few procedural things more intelligently.
The main process change is something that is probably old hat for old-school ops people — the absence of a page alert is not an indication of systemwide health. We’ve deployed a lot of new infrastructure in the last few weeks, and I’d been getting occasional pages for a while, but none for the prior day or two. I’ve set up the daily equivalent of the Tuesday-at-noon air raid siren test — in which the absence of a message every morning will be a problem itself. We’ve also got independent Nextel phones for on-call ops folks so there are now several routes for the alarm pages to take, plus that funny push-to-talk thing so we can annoy one another at all times.
3 Responses for "Whither pagers?"
7/26, 11:15 EST I keep getting the “slow down” message for robots - at this time, I simply can’t access our wiki at all. Is there an “event” at pb wiki?
I’ve replied over email as well but here’s some data for reference —
Your browser sends us this User-Agent string, which our software classifies as being a likely robot:
“Mozilla/5.0 (000000000; 0; 00000 000 00 0; 00000; 0000000000) 00000000000000 000000000000000″
Do you have any idea why that would be sent instead of something more common such as:
“Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4″
[…] 2: 37signals is hosted at the DFW data center and was down as well. photo via PBwiki Blog Related PostsRackspacePower Outages In San Francisco Bring Down Major WebsitesLaughing Squid in […]
Leave a reply