Hacker News Re-Imagined

Inside the longest Atlassian outage

  • 1232 points
  • 2 months ago

  • @andyjohnson0
  • Created a post

Inside the longest Atlassian outage


@hsnewman 2 months

Replying to @andyjohnson0 🎙

Sounds like the continuity planners at Atlassian (the fall guys) will be looking for a new job.

Reply


@bogomipz 2 months

>"The outage is its 9th day, having started on Monday, 4th of April." >"It took until Day 9 for executives at the company to acknowledge the outage."

Just to put this in perspective. These executives would have left on a Friday afternoon to start their weekends without bothering to publicly address an ongoing outage that was by then 5 days old.

This is mind boggling. Like did some C-level exec say something like "Let's just park this whole outage communication discussion until Monday, have a good weekend everyone."?

Reply


@gunapologist99 2 months

Trello seems to still be up?

Reply


@8n4vidtmkvmk 2 months

hahaha, i left atlassian when they deleted everyone's mercurial repos with no way to export either. fuck them. they've proven time and time again that they dont care about their customers

Reply


@victor9000 2 months

Wow would you look at that, a complete Atlassian puff piece got published in the WSJ just hours ago.

How peculiar that the biggest active outage in the history of this company is not mentioned once in this "article".

I'm left to assume that PR teams can plant whatever they see fit in the WSJ at a moment's notice. I guess that's what passes for journalism these days.

https://www.wsj.com/articles/atlassian-puts-easy-to-use-codi...

Reply


@R0ger 2 months

I guess this is wake call for the people rushing to SaaS solutions.

Reply


@kgeist 2 months

We use on-premises setups for almost everything (we generally avoid cloud solutions to have full control of our data), sometimes (approximately once a month) it goes down for a few minutes which already feels like a torture because all our processes depend on it, I can't imagine having no access to it for several weeks, all our work would stop to a halt... The office of the guy who administers on-premise servers is literally next door, all it takes is to make a visit to him and everything works again after 5 minutes. Reading horror stories like this (Slack being down, Atlassian being down, no one knows what is happening and when it will end etc.), I wonder why many companies choose cloud solutions for critical business processes. Is it pricing? Ease of use? I can understand why very small companies would choose it, but I don't understand why a medium/large business would choose anything but an on-premises setup.

Reply


@kingofpandora 2 months

Engineering mistakes happen.

The most inexcusable thing is not communicating with the paying customers who have been affected for over a week.

Atlassian's Global Head of Customer Success probably should have been fired but here she is promoting Atlassian Cloud on LinkedIn three days ago: https://www.linkedin.com/mwlite/in/gertie-rizzo-5b70061

Actually reading a bit more, it seems like their customer team was partying in Las Vegas instead of taking care of business: https://www.linkedin.com/mwlite/feed/hashtag/atlassianteam22

Priorities.

Reply


@xyst 2 months

Atlassian about to dip over the next few years as firms around the world slowly remove themselves from their ecosystem of products.

Reply


@xiaodai 2 months

Comes across as jerk. How can an outsider say things with such certianty?

Reply


@dirtylowprofile 2 months

Around 4/5 years ago I saw their Facebook page and was surprised with the bad reviews. And here we are, companies are still using Atlassian.

Reply


@bluedino 2 months

Regarding the backup restores:

I once worked a company that had a data loss issue. There was nothing else we could do, we had exhausted every option we had over almost 40 hours. At the end of the second day, it was decided to restore from backup.

We had done this before, as a test. It took about 12 hours to restore the data and another 12 hours to import the data and get back up and running.

One small thing was different this time, and it had huge consequences. As a cost-saving measure, an engineer had changed the location of our backups to the cold-storage tier offered by our cloud provider. All backups, not just 'old' ones.

This added 2 additional days to our recovery time, for a total of five days. Interestingly enough, even though we offered a full month's refund to all of our customers, not even half of them took us up on it.

Reply


@farseer 2 months

They have recently killed off on premise offerings, it's cloud only now. And this makes it harder to trust both the security and integrity of your data.

Reply


@jmondi 2 months

What blow's my mind is that Atlassian stock has barely taken a hit...

Reply


@mdoms 2 months

Title is a bit misleading, there's no insider info here. This is all stuff we knew from the official statements, the blog post, reddit and twitter.

Reply


@rmbyrro 2 months

Are Confluence pages and Jira tickets build like a GPT-3 300 Terabyte model?

I mean, I thought they were text.

5 days to restore text?

They must be generated by a huge complex deep learning voodoo.

Atlassian is working on the bleeding edge of technology. This outage is understandable...

Reply


@napolux 2 months

Yeah, let's centralize the Internet (born decentralized). This is what the Internet has become.

Reply


@sgallant 2 months

Thinking about all the folks with sites powered by Confluence and this quote I heard from a customer today:

“We build to static HTML, deploy to S3 and Cloudfront, and it’s f*ing bomb proof”

Reply


@Alex3917 2 months

A few years ago we didn't renew our subscription on time because we got the email over Christmas break, and iirc they deleted all of our data in less than two weeks. They were eventually able to manually restore it from backups, but they restored it incorrectly so there was a bunch of stuff broken. This whole thing isn't even remotely surprising to me.

Reply


@RomanPushkin 2 months

> I've never seen a product outage last this long

Title should be "Inside the longest outage of all time", without "Atlassian" word in it

Reply


@nemothekid 2 months

>Most of them said they won’t leave the Atlassian stack, as long as they don’t lose data. This is because moving is complex and they don’t see a move would mitigate a risk of a cloud provider going down.

I still don't understand the strangehold JIRA has on some clients. I can't quickly think of another SaaS product that could be down for almost 2 weeks and not have most customers leave.

Reply


@faddypaddy34 2 months

It's almost as if entrusting business critical data to offsite cloud providers isn't the best solution.

Reply


@anotherevan 2 months

RT @paularambles: Why say Jira’s down when you can say Atlassian shrugged?

https://twitter.com/paularambles/status/1514268251349569538

Reply


@travisgriggs 2 months

> Atlassian is a tech company, built by engineers, building products for tech professionals.

I am curious if anyone can provide any more insight on this simplification.

I've worked at companies like this. Originally a core of motivated creative individuals make a cool product. As the business grows rapidly, Pournelle's (Iron) Law (of Bureaucracy) takes over. For a variety of reasons, the very capable creators depart and are replaced by less motivated/aware individuals who are glad to have a job and easily compelled to do things to the product that probably should not be done.

My guess is that while Atlassian may have originally been one of those cool founder places, it has probably morphed into the more incompetent version that comes with scale all too often. But I don't know. Thus my question if anyone can speak to the true current tech capabilities of this company.

Reply


@cdjk 2 months

This isn't the longest outage - last time they couldn't recover and recovered data from email archives.

Reply


@elesbao 2 months

In a side note that someone else already made: it is interesting to see that many companies that uses JIRA also uses Slack but the noise/complaint/mentions comparing when Slack is down is way different. I barely saw people complaning.

Reply


@dynamohk 2 months

Rely on a SaaS, complain then don’t have BCP/DR yourself . Huh

Reply


@ineedasername 2 months

Something to consider is that Jira can require a great deal of configuration to tailor it to your needs. If you already have a DevOps team of some capacity (not everyone does) then it may only be a small incremental increase to run thinks on prem. I did it myself: I'm ver much not a DevOps person, mostly unfamiliar with optimizing JVM parameters for apps like this, but it still only took me about 5 hours to get things running stable, and then another 2 hours or so a few weeks later to tweak things like heap size to help things go a bit faster (though it was still somewhat slow)

To be complete open though I don't know how much DevOps overhead is involved in maintenance or feature updates. I hated the app and used it for less than a year so I didn't have much exposure. I guess my point though is simply that you may not need to use their SaaS option if you have a decent DevOps team already. After the initial setup time I doubt I spent more than half an hour a month managing the internals and updates.

I did spend more than that on configuring the system for use, which you'll need to do regardless.

Reply


@h2odragon 2 months

> However, if they [restore backups], while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point

OK, so you restore backups to a separate system, and selectively copy the stomped accounts data back to production. Simple concepts aren't that simple at their scale, sure, but I suspect this is skimping details on some truly horrendous monolithic architecture choices that they're trying to hide.

Not that I ever thought using their products was a good idea; to be clear about my position... But at this point anyone continuing to rely on them for anything is asking for the suffering they'll get. Signing up for their crap for a vital business function is like offering your tonker to a snapping turtle.

Reply


@ChrisMarshallNY 2 months

Heh. We have a Confluence account.

That no one uses.

So we didn't notice.

Reply


@throwawayHN378 2 months

When in doubt I go on LinkedIn and find an engineer that works for the company and message them directly.

Reply


@Rantenki 2 months

I've repeatedly asked Atlassian if:

1. They can confirm that they have backups of our data (about a thousand stories, substantial confluence, opsgenie history, and three service desks).

2. Will our integrations, configuration, and customizations also be recovered, or will we need to rebuild those once our data is recovered?

I have received no response, and no human is even willing to acknowledge those questions. The service desk staff ignore them as if I never asked. Repeatedly.

Also, I've been asking around, and haven't been able to find a single story from somebody that can confirm that they were down, who has had their data recovered.

Reply


@flaviotsf 2 months

I recommend doing disaster recovery steps for your personal data as well, such as Gmail. At one point recently I was creating filters to delete bulk messages and - when the filter got created, it somehow missed the from:@xyz.com domain part and I ended up deleting => delete forever all emails. I noticed the issue right away but it was enough to wipe 2-3 months worth of emails (all of them, even Sent ones).

Reply


@Traster 2 months

I remember finding out one of the senior managers from my company ended up as head of software at Atlassian. It was at that point I was convinced Atlassian has no idea what the hell they're doing. I think this demonstrates the point nicely.

Reply


@Cederfjard 2 months

PSA, because I’m seeing a lot of JIRA in this thread: Since the 2017 rebranding, Jira is no longer officially written in all caps: https://community.atlassian.com/t5/Feedback-Forum-articles/A...

(You can argue how successful it was when people are still using the old style in 2022).

It also makes more sense, since Jira is not an acronym, it’s a truncation of Gojira, inspired by Bugzilla/Mozilla.

Reply


@Vaslo 2 months

I bet the Shitlassian guy is dancing and singing because of this.

Reply


@a-dub 2 months

i hate deleting things. prefer flags that hide things instead (like a boolean deleted flag in an rdbms table).

prevents data integrity issues in relational databases, makes debugging easier and prevents disasters.

ideally also include a timestamp, both for bookkeeping and safe tools that only remove things that have been soft deleted for some time and are safe to delete without compromising integrity of anything that is not deleted (this is especially important in relational data models)

Reply


@yabones 2 months

What's a good Jira replacement? Redmine? Phabricator? OpenProject? Just leaving the jira server alone and hoping there's no new and exciting zero-days? One thing is clear, these guys are a bunch of cowboys who can't be trusted with any amount of data.

Reply


@jacquesm 2 months

I suspect - pure speculation - that they can't restore the backups, because if they could then they could easily do this in a way that accounts affected could be restored selectively. In other words: test your backups, if you don't they won't be there for you when you need them.

Reply


@digital79 2 months

Wow this might top the slack outage, who knew a company this large could take this long to resolve the issue

Reply


@Cthulhu_ 2 months

It kinda reads like their user's data is not separated very cleanly; I've never worked at a SaaS before, but reading this, especially given the size of some customers, I'd want each customer to have their own independent instance, with its own backup pipeline. I was thinking of "just" giving them their own database, but there's been plenty of instances where authentication got botched allowing one user to see another user's data; this should be impossible if things are running on their own instances.

Note that I'm pretty naïve and armchair on this subject, I'll see myself out.

Reply


@ordiel 2 months

All I can say as an Attlassian Server products user is that the moment they say it was Cloud or nothing, I choose nothing.

I much rather running Gittea on a raspberry pi that I CONTROL than having to have the impotence of doing nothing for more than a week. + having work at cloud companies and having been requested to "collect customer data" to hand it over to the government I would NEVER move critical pieces to anyone else's infa...

(Note: I am not supporting crime, but I rather to have privacy and criminals than living on an authoritarian regime where a dictator who knows everything abot everyone keeps "peace".... Yes I am looking at you China!)

If mistakes will be made, at least I wont pay others to do them for me....

Reply


@kache_ 2 months

return to monke

vi your todolists on an ec2 box

Reply


@N19PEDL2 2 months

What are good alternatives to Jira and Confluence?

Asking for a friend.

Reply


@linsomniac 2 months

Honest question here: The companies impacted by this, are they not taking backups of their Jira/Confluence/Bitbucket instances? Or is this outage impacting the ability to import those backups?

There are some Python scripts that will back up Jira and Confluence. I whipped up a quick script that gets a list of all our bitbucket repos and then it clones those daily as well.

Reply


@passerby1 2 months

Do anyone seriously consider changing Jira/confluence to some alternative after this?

I personally stopped using Jira a couple of years ago in projects I lead.

Reply


@mc4ndr3 2 months

They never heard of beta testing, rolling updates, infrastructure as code, federation, customer isolation, or Public Relations. What the heck.

Reply


@parentheses 2 months

A case for reducing complexity of software. Also, given the recent GitHub incident spree, it's almost debilitating. The entire tech industry takes a hit when companies like these fail at operations.

Reply


@fargle 2 months

The lesson learned is that outsourcing at the level of containers or machines and raw compute in the cloud is one thing. It's a pretty fungible open market.

But outsourcing one's whole engineering environment to a SaaS on a cloud is just freakin lunacy. Not only do you have things like this outage, but what about simple things like features and versions of the apps changing all the time with no ability to control that. What if they remove or change a feature you use?

And expensive vendor-locked-in closed tools have no place in a modern software workflow anyway, on-prem let alone SaaS. Look at the rug-pull for the on-prem Atlasian Server product.

Reply


@oldshatterhand 2 months

Random guess, that this is a "we say we make backups, but we actually take snapshots" issue :)

Reply


@luckydata 2 months

so this is the end of Atlassian as a company right?

Reply


@knbrlo 2 months

My current employer uses Jira but we seem to have not been affected by this. Hopefully those customers affected are able to press Atlassian for improvements from notification time, backups, usability etc.

Reply


@bitwise101 2 months


@nitinagg 2 months

Selectively restoring data only for certain rows is super hard. But the communications by Atlassian has been the worst I have ever seen in the industry.

Reply


@abraae 2 months

This is extremely poor for a large SaaS company.

A standard RFP question for SaaS should be:

- Can you restore data for a single customer, and if so, what is the RTO for that operation?

A smaller SaaS could be excused for only thinking about full database restores. When you're a scrappy upstart, thinking about hypotheticals is less important than survival.

But for any decent size multi-tenanted SaaS, it's imperative that you have the ability to selectively restore individual customers.

The usual approach is to do a full database restore into a separate instance, then run your pre-prepared "restore customer" scripts to extract a single customer's data from there and pump it across your prod instance. In Oracle for example you might use database links to give your restore code access to prod and also the restore instance at the same time.

Atlassian - MUST DO BETTER.

Reply


@1970-01-01 2 months

Interesting note: Atlassian stock (NASDAQ: TEAM) is up 4% as of noon today.

Reply


@mkl95 2 months

The fact it's been so long and they still haven't revealed and explained the root cause of the outage is going to make it hard to regain trust on their buggy, slow tools. The bright side of the incident is that competitors that somewhat care about users have a unique opportunity to stand out.

Reply


@scurraorbis 2 months

Don't trust cloud providers with your core business functions. I'd go even further and say don't trust the cloud, period. I think the next big thing is going to be moving back on premise or private cloud as more businesses realize this.

Reply


@selimnairb 2 months

CTO should be fired.

Reply


@febeling 2 months

Reading this piece is kinda boring. As usual, the root cause is a design defect in their backup-restore functionality. And it's at a complexity level any senior developer could have pointed out to be posing a fatal risk to the company.

My guess is many people new about the problem inside, but corporate taboos made it impossible to discuss. I'd bet a fortune on this being the case.

Reply


@scottlamb 2 months

Gmail had a vaguely similar outage years ago. [1] tl;dr:

1. Different root cause. There was a bug in a refactoring of gmail's storage layer (iirc a missing asterisk caused a pointer to an important bool to be set to null, rather than setting the bool to false), which slipped through code review, automated testing, and early test servers dedicated to the team, so it got rolled out to some fraction of real users. Online data was lost/corrupted for 0.02% of users (a huge amount of email).

2. There were tape backups, but the tooling wasn't ready for a restore at scale. It was all hands on deck to get those accounts back to an acceptable state, and it took four days to get back to basically normal (iirc no lost mail, although some got bounced).

3. During the outage, some users could log in and see something frightening: an empty/incomplete mailbox, and no banner or anything telling them "we're fixing it".

4. Google communicated more openly, sooner, [2] which I think helped with customer trust. Wow, Atlassian really didn't say anything publicly for nine days?!?

Aside from the obvious "have backups and try hard to not need them", a big lesson is that you have to be prepared to do a mass restore, and you have to have good communication: not only traditional support and PR communication but also within the UI itself.

[1] https://static.googleusercontent.com/media/www.google.com/en...

[2] https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve...

Reply


@Aissen 2 months

The sad truth is that with 99.8% of customers unaffected, it was probably thought to be a minor issue. If those customers didn't have Gergely's ear we probably wouldn't have heard about it.

Reply


@nijave 2 months

Very ironic if you have a look at their values...

>Open company, no bullshit >Don’t #@!%the customer

https://www.atlassian.com/company/values

Reply


@jgrahamc 2 months

Communicate directly and transparently

Yes. Always.

Reply


@politelemon 2 months

> it takes between 4 and 5 elapsed days to hand a site back to a customer.

Atlassian's SLA page says, Premium Cloud Products 99.9%

That's 43 minutes of downtime per month.

That works out to, Atlassian can't have any more downtime for the next 14 years. Are SLAs even real?

I'm being slightly facetious. From the page text it's just a threshold after which I think you're entitled to some money back for that month.

Reply


@hinkley 2 months

The longest Atlassian outage so far

Reply


@anshumankmr 2 months

I don't get it. JIRA is working for me.

Reply


@1970-01-01 2 months

Wouldn't you love to see the Atlassian internal JIRA epic for this outage?

Reply


@captaincaveman 2 months

A dumpster fire of a company that has terrible communication with customers outside of outages as well.

Reply


@snarkerson 2 months

> Most of them said they won’t leave the Atlassian stack, as long as they don’t lose data. This is because moving is complex and they don’t see a move would mitigate a risk of a cloud provider going down. However, all customers said they will invest in having a backup plan in case a SaaS they rely on goes down.

The real key lesson here. Your business is important to you. Not so much to the service provider.

Reply


@hougaard 2 months

Always judge companies on how they handle a crysis, not on how they do when everything runs smoothly.

Reply


@escot 2 months

When doing bulk deletes like this what safe guards do you put in place, other than testing the script up/down in another environment, turning off app servers etc (which Im guessing they did not do)?

Reply


About Us

site design / logo © 2022 Box Piper