• 15 days ago
Would be great if the timeline covered 19 minutes of 6:32 – 06:51. How long did it take to get the right people on the call? How long did it take to identify deployment as a suspect?
Another massive gap is the rollback: 6:58 – 7:42 – 44 minutes! What exactly was going on and why did it take so long? What were those back-up procedures mentioned briefly? Why engineers where stepping on each other toes? What's the story with reverting reverts?
Adding more automation, tests and fixing that specific ordering issue of course is an improvement. But that adds more complexity and any automation ultimately will fail some day.
Technical details are all appreciated. But it is going to be something else next time. Would be great to learn more about human interactions. That's where the resilience of a socio-technical system happened and I bet there is some room for improvement there.
ReplyIt's nearly always BGP when this level of failure occurs.
ReplyThis is a very nice write up.
ReplyAre there any steps that can be taken to test these types of changes in a non-production environment?
Replyoff-topic-ish, this post on /r/ProgrammerHumor gave me a chuckle
https://www.reddit.com/r/ProgrammerHumor/comments/vh9peo/jus...
Reply07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically.
Ouch
Replyreally appreciate the speed, detail and transparency of this post-mortem. Really one of, if not the best in the industry
ReplyWhat's it like to be an engineer designing and working on these systems? Must be sooo fulfiling! #Goals; Y'all are my heores!!
ReplyGotta hand it to them, a shining example of transparency and taking responsibility for mistakes.
ReplySomething else that I think would be smart to implement is a reorder detection. Have the change approval specificy point out stuff that gets reordered, and require manual approval for each section that gets moved around.
I also think that having a script that walks through the file and points out any ovibious mistakes would be good to have as well.
ReplyIf I use Cloudflare, what can I do — if anything — to avoid disruption when they go down?
ReplyNow this is a post mortem.
ReplyOne of our sites uses Cloudflare and serves 400k pageviews per month and generates around $650/day in ad and affiliate revenue. If the site is not up the business is not making any money.
Looking at the hourly chart in Google Analytics (compared to the previous day) there isn't even a blip during this outage.
So for all the advantages we get from Cloudflare (caching, WAF, security [our WP admin is secured with Cloudflare Teams], redirects, page rules, etc) I'll take these minor outages that make HN go apeshit.
Of course it helped that most our traffic is from the US and this happened when it did but in the past week alone we served over 180 countries which Cloudflare helps make sure is nice and fast :D
ReplyHow did no one at cloudflare think that this MCP thing should be part of the staging rollout? I imagine that was part of a // TODO.
It sounds like it's a key architectural part of the system that "[...] convert all of our busiest locations to a more flexible and resilient architecture."
25 year experience and it's always the things that are supposed to make us "more flexible" and "more resilient" or robust/stable/safer <keyword> that ends up royally f'ing us where the light don't shine.
Replyshit dawg i just woke up
ReplyTime and time again, this type of response proves that it's the right way handle a bad situation. Be humble, apologize, own your mistake, and give a transparent snapshot into what went wrong and how you're going to learn from the mistake.
Or you could go the opposite direction and risk turning something like this into a PR death spiral.
ReplyAm I the only who really doesn't think this is a big deal? They had an outage, they fixed it very quickly. Life goes on. Talking about the outage as if it's reason for us to all ditch CF, then buy/ run our own hardware (which will be totally better), so hyperbolic.
ReplyMost of the criticisms seem to be around BGP and network management. What I’m seeing here that also is important is that the change was applied to a DC where the route change didn’t trigger the defect. In essence, this is also due to a very classic problem of a test dataset giving a false sense of security due to variation from other configurations. For this reason my team prefers to rollout changes to production using a test region that most customers don’t use yet will have some visible impact if there’s any error in our presumptions so far such as hard-coding regions and relying upon services not present or as capable across all regions. This practice has caught a number of rather serious errors for us that while customer impacting was nowhere near as bad as if we had rolled out simply randomly like many teams do essentially. This is even more important the more difficult it is to perform rollbacks of changes or for rollbacks to take effect such as DNS and CDN caching changes.
ReplyBGP changes should be like the display resolution changes on your PC...
It should revert as a failsafe if not confirmed within X minutes.
ReplyNaively, it seems to me that there should at least be a warning somewhere if there are declarations after a REJECT-THE-REST.
I'm not familiar with whatever language this is, but wouldn't such a construct always indicate something was being ignored?
ReplyIt's interesting that in 2022 we still have network issues caused by wrong order of rules.
Everybody at one time experiences the dreaded REJECT not being at the end of the rule stack but just too early.
Kudos to CF for such a good explanation of what caused the issue.
ReplyI’m surprised they did not conclude roll outs should be executed over longer period with smaller batches. When a system is complicated as theirs with so much impact, the only sane strategy is slow rolling updates so that you can hit the brake when needed.
ReplyEvery outage represents an opportunity to demonstrate resilience and ingenuity. Outages are guaranteed to happen. Might as well make the most of it to reveal something cool about their infrastructure.
ReplyBeen a fan of CF since they were an essential for DDOS protection for various Wordpress sites I deployed back then.
I buy more NET every time I see posts like this.
ReplyReally interesting that 19 cities handle 50% of the requests.
ReplyUh, shouldn’t there be a staging environment for these sort of changes?
ReplyThe dns resolver also impacted and seems still have issue. We change to google dns and it solved.
The problem is, we couldn't tell all our client they should change this :(
ReplyThe default way that most networking devices are managed is crazy in this day and age.
Like the post-mortem says, they will put mitigations in place, but this is something every network admin has to implement bespoke after learning the hard way that the default management approach is dangerous.
I’ve personally watched admins make routing changes where any error would cut them off from the device they are managing and prevent them from rolling it back — pretty much what happened here.
What should be the default on every networking device is a two-stage commit where the second stage requires a new TCP connection.
Many devices still rely on “not saving” the configuration, with a power cycle as the rollback to the previous saved state. This is a great way to turn a small outage into a big one.
This style of device management may have been okay for small office routers where you can just walk into the “server closet” to flip the switch. It was okay in the era when device firmware was measured in kilobytes and boot times in single digit seconds.
Globally distributed backbone routers are an entirely different scenario but the manufacturers use the same outdated management concepts!
(I have seen some small improvements in this space, such as devices now keeping a history of config files by default instead of a single current-state file only.)
Replyi'm gonna go with the less popular view here that overly detailed post mortems do little in the grand scheme of things other than satisfy tech p0rn for a tiny, highly technical audience. does wonders for hiring indeed.
sure, transparency is better than "something went wrong, we take this very seriously, sorry." (although the non technical crowd couldn't care less)
only people who dont do anything make no mistakes, but doing such highly impactful changes so quickly (inside one day!) for where 50% of traffic happens seems a huge red flag to me, no matter the procedure and safety valves.
ReplyWe use Cloudflare to serve ~20-30TB of traffic a month where I work. Was the SRE on call when I got paged on our blackbox monitoring/third party web checks failing..
It was very pleasant to find the cloudflare status page pointing me to the issue right away (minutes after our alerts triggered), even though I couldn't replicate the issue myself yet.
I wish more companies would take note of the transparency and sense of urgency on updating their status page. (Looking at you Azure)
ReplyAh, this is why iCloud Private Relay wasn't working this morning.
ReplyAs others have said, this is a clear and concise write up of the incident. That is underlined even more when you take into account how quickly they published this. I have seen some companies take weeks or even months to publish an analysis that is half as good as this.
Not trying to take the light away from the outage, the outage was bad. But the relative quickness to recovery is pretty impressive, in my opinion. Sounds like they could have recovered even quicker if not for a bit of toe stepping that happened.
Replywho will make the abstraction as a service we all need to protect us from config changes
ReplyI read the blog twice and have some thoughts: The root cause seems is as: "While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes."
And a dry-run: "a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure."
And a Peer review: "Before it was allowed to go out, it was also peer reviewed by multiple engineers. "
I would doubt the expertise of tech guys of cloudflare, reviewing the change. And there was a dry-run.
But is it really OK to apply the change to a spine network which would affect 50% network traffic? Just out of peer review and a dry run? No green/blue, no gray release, maybe these are not proper for a small change here. But this "small" change really got big affect. I thougt it was worth it.
And from my shallow experience, the dry-run would always have do nothing to the env. It is dry-run anyway.
And at last the three lines are found out. So I wonder how did this re-order happen? And why?
With these tiny changes, there should be some mechanism to verify their correctness, not just review and dry-run.
Replyhappy solstice everyone
ReplyCF is the only company I have ever seen that can have an outage and get pages of praise for it. I don't have any (current) use for CloudFlare's products but I would love to see the culture that makes them praiseworthy spread to other companies.
ReplyHaving been on the other side of similar outages, I am very impressee at their response timeline.
ReplyThey said they ran a dry-run. What did that do, just generate these diffs? I would have expected them to have some way of simulating the network for BGP changes in order to verify that they didn't just fuck up their traffic.
ReplyPart of the blog says :
"In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo."
Is the term MCP synonymous with "tier 1 PoPs" (mentioned elsewhere in other cloudflare blogs from time to time) or are the two terms referring to different things?
ReplyYet another BGP caused outage. At some point we should collect all of them:
- Cloudflare 2022 (this one)
- Facebook 2021: https://news.ycombinator.com/item?id=28752131 - this one probably had the single biggest impact, since engineers got locked out of their systems, which made the fixing part look like a sci-fi movie
- (Indirectly caused by BGP: Cloudflare 2020: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...)
- Google Cloud 2020: https://www.theregister.com/2020/12/16/google_europe_outage/
- IBM Cloud 2020: https://www.bleepingcomputer.com/news/technology/ibm-cloud-g...
- Cloudflare 2019: https://news.ycombinator.com/item?id=20262214
- Amazon 2018: https://www.techtarget.com/searchsecurity/news/252439945/BGP...
- AWS: https://www.thousandeyes.com/blog/route-leak-causes-amazon-a... (2015)
- Youtube: https://www.infoworld.com/article/2648947/youtube-outage-und... (2008)
And then there are incidents caused by hijacking: https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us...
ReplyI lead the platform team of a fairly young startup in the D2C commerce space in the APAC region. This outage happened during peak traffic hours which made me and the team look like amateurs in the company.
Cloudflare is great, and I would never move away from it. But from a business continuity standpoint, is there a fallback approach that we should be prepared for during such cases?
One crude approach we were discussing is during an outage we could change the NS records in the registrar to point to for eg. Google Cloud DNS which would already be in sync in terms of the DNS records it has.
ReplyWould be nice to have some automation that one could use for keeping track of health status of cloud services. Status API, webhook solution, something. Maybe even a standard for it. Or a service that monitors all major cloud services.
We did get alarms. Our things partially worked though so CF was not the first thing to check.
ReplyThis is a great concise explanation. Thank you for providing it so quickly
If you forgive my prying, was this an implementation issue with the maintenance plan (operator or tooling error), a fundamental issue with the soundness of the plan as it stood, or an unexpected outcome from how the validated and prepared changes interacted with the system?
I imagine that an outage of this scope wasn’t foreseen in the development of the maintenance & rollback plan of the work.
ReplyFeels a little disingenuous to use the first 3/4 of the report to advertise.
ReplyI wish computers could stop us from making these kinds of mistakes without turning into Skynet.
ReplySounds like Cloudflare need a small low-traffic MCP that they can deploy to first.
ReplyIn a world where it can take weeks for other companies to publish a postmortem after an outage (if they ever do), I never ceases to amaze me how quickly CF manage to get something like this out.
I think it's a testament to their Ops/Incident response teams and internal processes, it builds confidence in their ability to respond quickly when something does go wrong. Incredible work!
ReplyNodejs is still having issues. For example: https://nodejs.org/dist/v16.15.1/node-v16.15.1-darwin-x64.ta... doesn't download if you do "n lts"
Replytl;dr: Another BGP outage due to bad config changes.
Here's a somewhat old (2016) but very impressive system at a major ISP for avoiding exactly this: https://www.youtube.com/watch?v=R_vCdGkGeSk
Reply...and yet they still push so hard for recentralization of the web...
ReplyIs there no system to unit test a rule-set?
ReplyWhere does one even start with learning BGP? It always seemed super interesting to me, but not really something that could be dealt with on a small scale, lab type basis. Or am I wrong there?
ReplySeems that after this outage a lot of website that are behind Cloudflare NSs now gained top positions on Google SERP with strange links like http://domain/XX/yyyyyyy
Really strange, a coincidence?
Replysite design / logo © 2022 Box Piper