Hacker News Re-Imagined

Nov 16 GCP Load Balancing Incident Report

  • 167 points
  • 1 day ago

  • @joshma
  • Created a post
  • • 73 comments

Nov 16 GCP Load Balancing Incident Report


@stevefan1999 1 day

Replying to @joshma 🎙

one bug fixed, two bugs introduced...

Reply


Replying to @joshma 🎙

This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.

Reply


@m0zg 23 hours

Replying to @joshma 🎙

> customers affected by the outage _may have_ encountered 404 errors

> for the inconvenience this service outage _may have_ caused

Not a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.

Reply


@bullen 17 hours

Replying to @joshma 🎙

This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.

Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.

Europe is a non-issue for hosting because it's where I live and services are plentiful.

I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.

Enough of this external dependency crap!

Reply


@gigatexal 1 day

Replying to @joshma 🎙

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

Reply


@SteveNuts 1 day

Replying to @joshma 🎙

Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?

For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?

At that scale there has to be many examples of similar issues, no?

Reply


@breakingcups 11 hours

Replying to @joshma 🎙

What I would not give for a comprehensive leak of Google's major internal post-mortems.

Reply


@darkwater 20 hours

Replying to @joshma 🎙

"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."

This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.

Reply


@chairmanwow1 1 day

Replying to @joshma 🎙

Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.

My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.

Anyone else have stabilizing ancedata?

Reply


@throwoutway 1 day

Replying to @joshma 🎙

Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.

I’m not good with statistics but what are the chances?

Reply


@htrp 23 hours

Replying to @joshma 🎙

Did Roblox ever release the incident report from their outage?

Reply


Replying to @joshma 🎙

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.

For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)

Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.

Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

Reply


About Us

site design / logo © 2021 Box Piper