Hacker News Re-Imagined

Cloudflare outage on June 21, 2022

  • 703 points
  • 15 days ago

  • @jgrahamc
  • Created a post

Cloudflare outage on June 21, 2022


@grenbys 15 days

Replying to @jgrahamc 🎙

Would be great if the timeline covered 19 minutes of 6:32 – 06:51. How long did it take to get the right people on the call? How long did it take to identify deployment as a suspect?

Another massive gap is the rollback: 6:58 – 7:42 – 44 minutes! What exactly was going on and why did it take so long? What were those back-up procedures mentioned briefly? Why engineers where stepping on each other toes? What's the story with reverting reverts?

Adding more automation, tests and fixing that specific ordering issue of course is an improvement. But that adds more complexity and any automation ultimately will fail some day.

Technical details are all appreciated. But it is going to be something else next time. Would be great to learn more about human interactions. That's where the resilience of a socio-technical system happened and I bet there is some room for improvement there.

Reply


@edf13 15 days

Replying to @jgrahamc 🎙

It's nearly always BGP when this level of failure occurs.

Reply


@sidcool 15 days

Replying to @jgrahamc 🎙

This is a very nice write up.

Reply


@testplzignore 15 days

Replying to @jgrahamc 🎙

Are there any steps that can be taken to test these types of changes in a non-production environment?

Reply


@malikNF 15 days

Replying to @jgrahamc 🎙

off-topic-ish, this post on /r/ProgrammerHumor gave me a chuckle

https://www.reddit.com/r/ProgrammerHumor/comments/vh9peo/jus...

Reply


@thejosh 15 days

Replying to @jgrahamc 🎙

07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically.

Ouch

Reply


@dpz 15 days

Replying to @jgrahamc 🎙

really appreciate the speed, detail and transparency of this post-mortem. Really one of, if not the best in the industry

Reply


@ElectronShak 15 days

Replying to @jgrahamc 🎙

What's it like to be an engineer designing and working on these systems? Must be sooo fulfiling! #Goals; Y'all are my heores!!

Reply


@CodeWriter23 15 days

Replying to @jgrahamc 🎙

Gotta hand it to them, a shining example of transparency and taking responsibility for mistakes.

Reply


@minecraftchest1 15 days

Replying to @jgrahamc 🎙

Something else that I think would be smart to implement is a reorder detection. Have the change approval specificy point out stuff that gets reordered, and require manual approval for each section that gets moved around.

I also think that having a script that walks through the file and points out any ovibious mistakes would be good to have as well.

Reply


@mproud 15 days

Replying to @jgrahamc 🎙

If I use Cloudflare, what can I do — if anything — to avoid disruption when they go down?

Reply


@junon 15 days

Replying to @jgrahamc 🎙

Now this is a post mortem.

Reply


@weird-eye-issue 15 days

Replying to @jgrahamc 🎙

One of our sites uses Cloudflare and serves 400k pageviews per month and generates around $650/day in ad and affiliate revenue. If the site is not up the business is not making any money.

Looking at the hourly chart in Google Analytics (compared to the previous day) there isn't even a blip during this outage.

So for all the advantages we get from Cloudflare (caching, WAF, security [our WP admin is secured with Cloudflare Teams], redirects, page rules, etc) I'll take these minor outages that make HN go apeshit.

Of course it helped that most our traffic is from the US and this happened when it did but in the past week alone we served over 180 countries which Cloudflare helps make sure is nice and fast :D

Reply


@keyle 15 days

Replying to @jgrahamc 🎙

How did no one at cloudflare think that this MCP thing should be part of the staging rollout? I imagine that was part of a // TODO.

It sounds like it's a key architectural part of the system that "[...] convert all of our busiest locations to a more flexible and resilient architecture."

25 year experience and it's always the things that are supposed to make us "more flexible" and "more resilient" or robust/stable/safer <keyword> that ends up royally f'ing us where the light don't shine.

Reply


@kache_ 15 days

Replying to @jgrahamc 🎙

shit dawg i just woke up

Reply


@rocky_raccoon 15 days

Replying to @jgrahamc 🎙

Time and time again, this type of response proves that it's the right way handle a bad situation. Be humble, apologize, own your mistake, and give a transparent snapshot into what went wrong and how you're going to learn from the mistake.

Or you could go the opposite direction and risk turning something like this into a PR death spiral.

Reply


@gcau 15 days

Replying to @jgrahamc 🎙

Am I the only who really doesn't think this is a big deal? They had an outage, they fixed it very quickly. Life goes on. Talking about the outage as if it's reason for us to all ditch CF, then buy/ run our own hardware (which will be totally better), so hyperbolic.

Reply


@devonkim 14 days

Replying to @jgrahamc 🎙

Most of the criticisms seem to be around BGP and network management. What I’m seeing here that also is important is that the change was applied to a DC where the route change didn’t trigger the defect. In essence, this is also due to a very classic problem of a test dataset giving a false sense of security due to variation from other configurations. For this reason my team prefers to rollout changes to production using a test region that most customers don’t use yet will have some visible impact if there’s any error in our presumptions so far such as hard-coding regions and relying upon services not present or as capable across all regions. This practice has caught a number of rather serious errors for us that while customer impacting was nowhere near as bad as if we had rolled out simply randomly like many teams do essentially. This is even more important the more difficult it is to perform rollbacks of changes or for rollbacks to take effect such as DNS and CDN caching changes.

Reply


@leetrout 15 days

Replying to @jgrahamc 🎙

BGP changes should be like the display resolution changes on your PC...

It should revert as a failsafe if not confirmed within X minutes.

Reply


@ransom1538 15 days

Replying to @jgrahamc 🎙

Still seeing failed network calls.

https://i.imgur.com/xHqvOzj.png

Reply


@drfrank 15 days

Replying to @jgrahamc 🎙

Naively, it seems to me that there should at least be a warning somewhere if there are declarations after a REJECT-THE-REST.

I'm not familiar with whatever language this is, but wouldn't such a construct always indicate something was being ignored?

Reply


@Belphemur 15 days

Replying to @jgrahamc 🎙

It's interesting that in 2022 we still have network issues caused by wrong order of rules.

Everybody at one time experiences the dreaded REJECT not being at the end of the rule stack but just too early.

Kudos to CF for such a good explanation of what caused the issue.

Reply


@xiwenc 15 days

Replying to @jgrahamc 🎙

I’m surprised they did not conclude roll outs should be executed over longer period with smaller batches. When a system is complicated as theirs with so much impact, the only sane strategy is slow rolling updates so that you can hit the brake when needed.

Reply


@ttul 15 days

Replying to @jgrahamc 🎙

Every outage represents an opportunity to demonstrate resilience and ingenuity. Outages are guaranteed to happen. Might as well make the most of it to reveal something cool about their infrastructure.

Reply


@asadlionpk 15 days

Replying to @jgrahamc 🎙

Been a fan of CF since they were an essential for DDOS protection for various Wordpress sites I deployed back then.

I buy more NET every time I see posts like this.

Reply


@nerdbaggy 15 days

Replying to @jgrahamc 🎙

Really interesting that 19 cities handle 50% of the requests.

Reply


@rubatuga 15 days

Replying to @jgrahamc 🎙

Uh, shouldn’t there be a staging environment for these sort of changes?

Reply


@ggalihpp 15 days

Replying to @jgrahamc 🎙

The dns resolver also impacted and seems still have issue. We change to google dns and it solved.

The problem is, we couldn't tell all our client they should change this :(

Reply


@jiggawatts 15 days

Replying to @jgrahamc 🎙

The default way that most networking devices are managed is crazy in this day and age.

Like the post-mortem says, they will put mitigations in place, but this is something every network admin has to implement bespoke after learning the hard way that the default management approach is dangerous.

I’ve personally watched admins make routing changes where any error would cut them off from the device they are managing and prevent them from rolling it back — pretty much what happened here.

What should be the default on every networking device is a two-stage commit where the second stage requires a new TCP connection.

Many devices still rely on “not saving” the configuration, with a power cycle as the rollback to the previous saved state. This is a great way to turn a small outage into a big one.

This style of device management may have been okay for small office routers where you can just walk into the “server closet” to flip the switch. It was okay in the era when device firmware was measured in kilobytes and boot times in single digit seconds.

Globally distributed backbone routers are an entirely different scenario but the manufacturers use the same outdated management concepts!

(I have seen some small improvements in this space, such as devices now keeping a history of config files by default instead of a single current-state file only.)

Reply


@throwaway_uke 15 days

Replying to @jgrahamc 🎙

i'm gonna go with the less popular view here that overly detailed post mortems do little in the grand scheme of things other than satisfy tech p0rn for a tiny, highly technical audience. does wonders for hiring indeed.

sure, transparency is better than "something went wrong, we take this very seriously, sorry." (although the non technical crowd couldn't care less)

only people who dont do anything make no mistakes, but doing such highly impactful changes so quickly (inside one day!) for where 50% of traffic happens seems a huge red flag to me, no matter the procedure and safety valves.

Reply


@llama052 15 days

Replying to @jgrahamc 🎙

We use Cloudflare to serve ~20-30TB of traffic a month where I work. Was the SRE on call when I got paged on our blackbox monitoring/third party web checks failing..

It was very pleasant to find the cloudflare status page pointing me to the issue right away (minutes after our alerts triggered), even though I couldn't replicate the issue myself yet.

I wish more companies would take note of the transparency and sense of urgency on updating their status page. (Looking at you Azure)

Reply


@js2 15 days

Replying to @jgrahamc 🎙

Ah, this is why iCloud Private Relay wasn't working this morning.

Reply


@kylegalbraith 15 days

Replying to @jgrahamc 🎙

As others have said, this is a clear and concise write up of the incident. That is underlined even more when you take into account how quickly they published this. I have seen some companies take weeks or even months to publish an analysis that is half as good as this.

Not trying to take the light away from the outage, the outage was bad. But the relative quickness to recovery is pretty impressive, in my opinion. Sounds like they could have recovered even quicker if not for a bit of toe stepping that happened.

Reply


@sharps_xp 15 days

Replying to @jgrahamc 🎙

who will make the abstraction as a service we all need to protect us from config changes

Reply


@mikewang 15 days

Replying to @jgrahamc 🎙

I read the blog twice and have some thoughts: The root cause seems is as: "While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes."

And a dry-run: "a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure."

And a Peer review: "Before it was allowed to go out, it was also peer reviewed by multiple engineers. "

I would doubt the expertise of tech guys of cloudflare, reviewing the change. And there was a dry-run.

But is it really OK to apply the change to a spine network which would affect 50% network traffic? Just out of peer review and a dry run? No green/blue, no gray release, maybe these are not proper for a small change here. But this "small" change really got big affect. I thougt it was worth it.

And from my shallow experience, the dry-run would always have do nothing to the env. It is dry-run anyway.

And at last the three lines are found out. So I wonder how did this re-order happen? And why?

With these tiny changes, there should be some mechanism to verify their correctness, not just review and dry-run.

Reply


@ruined 15 days

Replying to @jgrahamc 🎙

happy solstice everyone

Reply


@psim1 15 days

Replying to @jgrahamc 🎙

CF is the only company I have ever seen that can have an outage and get pages of praise for it. I don't have any (current) use for CloudFlare's products but I would love to see the culture that makes them praiseworthy spread to other companies.

Reply


@badrabbit 15 days

Replying to @jgrahamc 🎙

Having been on the other side of similar outages, I am very impressee at their response timeline.

Reply


@lilyball 15 days

Replying to @jgrahamc 🎙

They said they ran a dry-run. What did that do, just generate these diffs? I would have expected them to have some way of simulating the network for BGP changes in order to verify that they didn't just fuck up their traffic.

Reply


@wondernine 14 days

Replying to @jgrahamc 🎙

Part of the blog says :

"In this time, we’ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, São Paulo, San Jose, Singapore, Sydney, Tokyo."

Is the term MCP synonymous with "tier 1 PoPs" (mentioned elsewhere in other cloudflare blogs from time to time) or are the two terms referring to different things?

Reply


@kurtextrem 15 days

Replying to @jgrahamc 🎙

Yet another BGP caused outage. At some point we should collect all of them:

- Cloudflare 2022 (this one)

- Facebook 2021: https://news.ycombinator.com/item?id=28752131 - this one probably had the single biggest impact, since engineers got locked out of their systems, which made the fixing part look like a sci-fi movie

- (Indirectly caused by BGP: Cloudflare 2020: https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...)

- Google Cloud 2020: https://www.theregister.com/2020/12/16/google_europe_outage/

- IBM Cloud 2020: https://www.bleepingcomputer.com/news/technology/ibm-cloud-g...

- Cloudflare 2019: https://news.ycombinator.com/item?id=20262214

- Amazon 2018: https://www.techtarget.com/searchsecurity/news/252439945/BGP...

- AWS: https://www.thousandeyes.com/blog/route-leak-causes-amazon-a... (2015)

- Youtube: https://www.infoworld.com/article/2648947/youtube-outage-und... (2008)

And then there are incidents caused by hijacking: https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us...

Reply


@kebab-case 14 days

Replying to @jgrahamc 🎙

I lead the platform team of a fairly young startup in the D2C commerce space in the APAC region. This outage happened during peak traffic hours which made me and the team look like amateurs in the company.

Cloudflare is great, and I would never move away from it. But from a business continuity standpoint, is there a fallback approach that we should be prepared for during such cases?

One crude approach we were discussing is during an outage we could change the NS records in the registrar to point to for eg. Google Cloud DNS which would already be in sync in terms of the DNS records it has.

Reply


@AtNightWeCode 14 days

Replying to @jgrahamc 🎙

Would be nice to have some automation that one could use for keeping track of health status of cloud services. Status API, webhook solution, something. Maybe even a standard for it. Or a service that monitors all major cloud services.

We did get alarms. Our things partially worked though so CF was not the first thing to check.

Reply


@Tsiklon 15 days

Replying to @jgrahamc 🎙

This is a great concise explanation. Thank you for providing it so quickly

If you forgive my prying, was this an implementation issue with the maintenance plan (operator or tooling error), a fundamental issue with the soundness of the plan as it stood, or an unexpected outcome from how the validated and prepared changes interacted with the system?

I imagine that an outage of this scope wasn’t foreseen in the development of the maintenance & rollback plan of the work.

Reply


@xtat 15 days

Replying to @jgrahamc 🎙

Feels a little disingenuous to use the first 3/4 of the report to advertise.

Reply


@DustinBrett 15 days

Replying to @jgrahamc 🎙

I wish computers could stop us from making these kinds of mistakes without turning into Skynet.

Reply


@terom 15 days

Replying to @jgrahamc 🎙

TODO: use commit-confirm for automated rollbacks

Sounds like a good idea!

Reply


@trollied 15 days

Replying to @jgrahamc 🎙

Sounds like Cloudflare need a small low-traffic MCP that they can deploy to first.

Reply


@samwillis 15 days

Replying to @jgrahamc 🎙

In a world where it can take weeks for other companies to publish a postmortem after an outage (if they ever do), I never ceases to amaze me how quickly CF manage to get something like this out.

I think it's a testament to their Ops/Incident response teams and internal processes, it builds confidence in their ability to respond quickly when something does go wrong. Incredible work!

Reply


@sschueller 15 days

Replying to @jgrahamc 🎙

Nodejs is still having issues. For example: https://nodejs.org/dist/v16.15.1/node-v16.15.1-darwin-x64.ta... doesn't download if you do "n lts"

Reply


@thomashabets2 15 days

Replying to @jgrahamc 🎙

tl;dr: Another BGP outage due to bad config changes.

Here's a somewhat old (2016) but very impressive system at a major ISP for avoiding exactly this: https://www.youtube.com/watch?v=R_vCdGkGeSk

Reply


@johnklos 15 days

Replying to @jgrahamc 🎙

...and yet they still push so hard for recentralization of the web...

Reply


@philipwhiuk 15 days

Replying to @jgrahamc 🎙

Is there no system to unit test a rule-set?

Reply


@thesuitonym 15 days

Replying to @jgrahamc 🎙

Where does one even start with learning BGP? It always seemed super interesting to me, but not really something that could be dealt with on a small scale, lab type basis. Or am I wrong there?

Reply


@loist 12 days

Replying to @jgrahamc 🎙

Seems that after this outage a lot of website that are behind Cloudflare NSs now gained top positions on Google SERP with strange links like http://domain/XX/yyyyyyy

Really strange, a coincidence?

Reply


About Us

site design / logo © 2022 Box Piper