Cloudflare is struggling with another outage - here's what to know - Matthew Prince’s worst nightmare came true

DavidS877 · Nov 4, 2023

The problem is the further you split your redundancy the less bandwidth you have between the datacenters, the more that bandwidth costs and the higher the latency.

If they're close enough you just treat them as if they were the same datacenter for setup purposes.

Sadly, proper wide area redundancy requires actual engineering competency. So it's out of reach of tiny companies like Cloudflare, Google and AWS.

It's almost time for the annual AWS US-East Black Friday outage anyway.

Flaming Dumpster · Nov 4, 2023

notorietus said:
The root cause of these problems is a data center power failure combined with a failure of services to switch over from data centers having trouble to those still functioning.

This outage completely fucked up DreamHost too. Apparently the datacenter (Flexential PDX02) lost utility power and shit hit the fan while switching to generators.

From a Flexential customer on the outages mailing list

We're in Flexential PDX02, I have guys on-site now.

Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet). Some utility power is back, so recovery is in progress for some portion of the site. I still have ~70 racks without power....but things are coming back to life slowly.

Cloudflare published a thing on their blog about this https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/

They're pissed that Flexential didn't give them a heads up that something was going on, else they reckon they would've started moving things away from the location immediately. Apparently the stated 10 minutes worth of UPS battery life was probably closer to 4 minutes and the on site techs were struggling with the generators as the ground fault which caused the utility failure also tripped the breakers on the generators.

Normally the generators are fired up automatically but because it required human intervention and they didn't have as much time as they should've, they completely lost power to the whole facility. I'm doubtful Cloudflare could've shifted their workloads even in the event of somebody notifying them the instant utility power was lost, 4 minutes is barely enough time to really do anything.

At 12:48 UTC, Flexential was able to get the generators restarted. Power returned to portions of the facility. In order to not overwhelm the system, when power is restored to a data center it is typically done gradually by powering back on one circuit at a time. Like the circuit breakers in a residential home, each customer is serviced by redundant breakers. When Flexential attempted to power back up Cloudflare's circuits, the circuit breakers were discovered to be faulty. We don't know if the breakers failed due to the ground fault or some other surge as a result of the incident, or if they'd been bad before and it was only discovered after they had been powered off.

There is a God and he hates Cloudflare :story:

Big moth tiddies · Nov 4, 2023

Flaming Dumpster said:
Normally the generators are fired up automatically but because it required human intervention and they didn't have as much time as they should've, they completely lost power to the whole facility. I'm doubtful Cloudflare could've shifted their workloads even in the event of somebody notifying them the instant utility power was lost, 4 minutes is barely enough time to really do anything.

Again, I'm not an expert in this shit, but isn't this kinda shit supposed to be tested on a regular basis?

Yeah, I know. The answer is gross incompetence and laziness, I just hate how people who do fuck all get rewarded despite their shittiness.

Pill Cosby · Nov 4, 2023

aromatic said:
Turns out only retaining unstable troon engineers leads to problems. Who on God’s green earth would want to work for and support such a disgusting company and person.

Don’t forget about all the monkey crush, zoosadists (animal rapists and torturers), terrorists like Isis and various other illegal content that Cloudflare protects but at least it’s not that horrible US Legal Websoge which critiques and laughs at regards on the Internet is a big no-no.

Pill Cosby · Nov 4, 2023

Big moth tiddies said:
Again, I'm not an expert in this shit, but isn't this kinda shit supposed to be tested on a regular basis?

Supposed to be. There should be multiple contingency plans in place.

Look at when one of the OVH data centres burnt down.

Morethanabitfoolish · Nov 4, 2023

Flaming Dumpster said:
There is a God and he hates Cloudflare

Even God gets cranky when he can't sneed.

XYZpdq Jr. · Nov 4, 2023

Cpt. Stud Beefpile said:
Maine can at least pick up most of the slack.

at least until the lobstermen rise up to destroy the accursed terrainians!

IAmNotAlpharius · Nov 4, 2023

This put me in a great mood reading this article and comments.

Fatboi Gus · Nov 4, 2023

Somewhere, in some God forsaken gotnik shithole, Null smiles in between handfuls of cheddar

Tanner Glass · Nov 4, 2023

Big moth tiddies said:
Again, I'm not an expert in this shit, but isn't this kinda shit supposed to be tested on a regular basis?

In all honestly - not really.

You can theorycraft stuff, you can make test environments, but very often you cannot test in a production environment. Flexential is just one of many colocations - think of it like an apartment complex for servers. In order to get some level of "this could have an impact on uptime" testing done - you would need all of the tenants to sign off on it (they will never do so).

If Cloudflare doesn't like - they can host their own equipment in their own facility.

Balr0g · Nov 4, 2023

This is the secret of Kiwiflare: every attack is routed against Cloudflare and the smell of the tranny bots made the Clodufare server vomit and shit themselves.

MeltyTW · Nov 4, 2023

Well hopefully all his good karma from helping a goblinoid tranny rapist bears fruit for this situation

PhoBingas · Nov 4, 2023

Big moth tiddies said:
While we're on the topic of "redundancies", these generators... any of us in the US and even outside of such can go buy a generator right now that will power on and do a self test every month to ensure the fucking thing works, and can even text you saying "Hey, I'm working."

Its in Oregon though, for all we know meth heads either stole the fuel for the generators, the generators or both. Kinda weird that the power going down at one of their many centers fucked them so bad. Guess that's what happens when you hire stinkditches to do your rollover code :story:

Milkshake Sniffer · Nov 4, 2023

Mr. Manchester said:
Kiwiflare seems to be working just fine though.

The important question is, has Null started marketing the far superior Kiwiflare to all those who have again been let down by the third rate failed product that is Cloudflare?

DavidS877 · Nov 4, 2023

PhoBingas said:
Its in Oregon though, for all we know meth heads either stole the fuel for the generators, the generators or both. Kinda weird that the power going down at one of their many centers fucked them so bad. Guess that's what happens when you hire stinkditches to do your rollover code

More likely the copper or aluminium wires for the grounding. They seem to think "Ground Wires"==Safe to steal. Until that ground wire is actually carrying current like in a substation.

Stoneheart · Nov 4, 2023

PhoBingas said:
Its in Oregon though, for all we know meth heads either stole the fuel for the generators, the generators or both.

Or just crawled inside to have a warm place to sleep.

reptile baht spaniard rid · Nov 5, 2023

The post mortem they posted got on to hacker nudes and almost everyone was ripping them an extra ass for spending 75% of it blaming the data center; cause who gives a fuck why the DC went down, you’re supposed to survive that.

Multihomed redundancy shit isn’t it you don’t test it, and if you’re scared to test it it’s doubly fucked.

kurosawa · Nov 5, 2023

Ok I'll bite on Little Mattel Princess's post mortem.

The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted.

Are they all on the same utilities? Not just electrical as was the case here but water, wastewater, stormwater, etc. Might want to look into that.

The three data centers are independent of one another, each have multiple utility power feeds...
It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time....
It would also have been possible for Flexential to run the facility only from the remaining utility feed...
At approximately 11:40 UTC, there was a ground fault on a PGE transformer at PDX-04. Unfortunately, in this case, the protective measure also shut down all of PDX-04’s generators.

As far as I can tell because the explanation is cloudy (what do I expect from someone who probably never turned a wrench in his life) the independent feeds go to the same switchgears. I'm guessing but when one feed lost power the generators automatically started up (I've worked at industrial sites and this is usually the case). Something happened to the second line that caused the GFI's to trip and the generators to shut down. Which is a plausible scenario that you should have identified or required mitigation from Flexential. Or if they did manually switch the generators on - once again identify and mitigate.

This leads to the next point where you try to blame Flexential for the failure chain (theoretically the generators would've kicked on after the failure of the second utility feed rather than the first = no loss of power).

We have been unable to locate any record of Flexential informing us about the DSG program.

That's funny because its listed in one of their air permits, wow so hard to find. I have to confirm the site, the three permitted sites Flexential has don't list whatever PDX-04 is. However this permit is for a site with 10 emergency generators. And it would beg the question, even if this isn't the PDX-04 site - if they have this for one of their data centers, do they have similar agreements for the others?

When Flexential attempted to power back up Cloudflare's circuits, the circuit breakers were discovered to be faulty.

Were they doing IR inspections of the breakers? Might want to look into that.

So in essence Mr. Mattel Princess - your entire auditing program as a customer of the data center appears to be insufficient. Maybe spend less time trying to police the internet and fix that instead.

Flaming Dumpster · Nov 6, 2023

kurosawa said:
I have to confirm the site, the three permitted sites Flexential has don't list whatever PDX-04 is.

Confusingly Cloudflare referred to it as PDX-04, DreamHost said PDX-01 but people online are saying it was PDX-02. Eitherway you can find the addresses for each site on their website. https://www.flexential.com/data-centers/or/portland
PDX-01 3935 NE Aloclek Place
PDX-02 5737 NE Huffman Street
PDX-03 5419 NE Starr Blvd
PDX-04 4915 NE Starr Blvd

kurosawa said:
So in essence Mr. Mattel Princess - your entire auditing program as a customer of the data center appears to be insufficient. Maybe spend less time trying to police the internet and fix that instead.

Too busy burying himself knee deep in tranny gash to do his due diligence. Apparently Flexential is known to be pretty shit. From Brad Chapman on the outages mailing list

Here's an insightful comment today from a redditor who claims to have worked at Flexential in the managed services side.

I looked him up on LinkedIn: he was in DevOps / SRE and worked there a total of 6 years.

Flexential sounds like a shit show, but he's also surprised that Cloudflare chose this place. Either they didn't research it, or Flexential was doing a terrific snow job on them.

-Brad

Well, I'm not surprised in the slightest.

I didn't work on the Colo side -- I was in Managed Services, but our entire platform lived on top of the Colo stuff, so I had plenty of interaction with those folks.

Flexential is a company that was created out of mergers/acquisitions between ViaWest and Peak10 (both big Colo companies, the former mostly West of the Mississippi and the latter mostly East), and INetU, a Managed Services company with its own datacenter footprint. After the mergers, the company culture went from one of the best I had ever been a part of (at INetU) to dismally bad. All of the good employees left and it seemed like they only hired numbnutses to replace them.

Management is ineffective at best (toxic at worst), and routinely makes bad decisions. Most of the good fortune the company has are due to either pure accidents that end up somehow leading to success (think: Mr. Magoo), or because the C-suite schmoozed some customer executive at a fancy steakhouse into giving them their business.

Their profits have been very bad (if existent) for years and they pinch pennies wherever possible. Their infrastructure testing practices have gotten worse and worse as the good infrastructure people left the company and their replacements didn't know any better.

So yeah, I'm not surprised at all that they fucked up this badly -- I'm actually more surprised that Cloudflare put their core of operations at a Flexential datacenter.

Click to expand...

Cloudflare is struggling with another outage - here's what to know - Matthew Prince’s worst nightmare came true

Attachments

DavidS877

Giant Meteor Goes to Washington

Flaming Dumpster

O Kees, where art thou?

Big moth tiddies

Anyone got a light?

Pill Cosby

my pronouns are fag/faggot

Pill Cosby

my pronouns are fag/faggot

Morethanabitfoolish

Bruh.

XYZpdq Jr.

on the fbi dumbasstic terror list

IAmNotAlpharius

https://youtube.com/watch?v=AmrFT_v_M40

Fatboi Gus

Tanner Glass

Balr0g

MeltyTW

PhoBingas

There are Bosnians outside my house.

Milkshake Sniffer

DavidS877

Giant Meteor Goes to Washington

Stoneheart

Well hung, and snow white tan

reptile baht spaniard rid

witless witness schema iguanas

kurosawa

Flaming Dumpster

O Kees, where art thou?