Cloudflare is struggling with another outage - here's what to know - Matthew Prince’s worst nightmare came true

  • 🐕 I am attempting to get the site runnning as fast as possible. If you are experiencing slow page load times, please report it.
Article / Archive

Cloudflare, one of the top web and internet services businesses, is having trouble again. In this storm, Cloudflare Dashboard and its related application programming interfaces (API) have gone down. The silver lining to this trouble is that these issues are not affecting the serving of cached files via the Cloudflare Content Delivery Network (CDN) or the Cloudflare Edge security features. No, those had gone down last week.

On Oct. 30. Cloudflare had rolled out a failed update to its globally distributed key-value store, Workers KV. The result was that all of Cloudflare's services were down for 37 minutes. Today's problem isn't nearly as serious, but it's been going on for several hours.

As of 7:15 PM Eastern time, Cloudflare reported, that it was "seeing gradual improvement to affected services."

Still, it's bad enough. Cloudflare disclosed that the snags have affected a slew of products at the data plane and edge level. These include Logpush, WARP / Zero Trust device posture, Cloudflare dashboard, Cloudflare API, Stream API, Workers API, and Alert Notification System.

Other programs are still running, but you can't modify their settings. These are Magic Transit, Argo Smart Routing, Workers KV, WAF, Rate Limiting, Rules, WARP / Zero Trust Registration, Waiting Room, Load Balancing and Healthchecks, Cloudflare Pages, Zero Trust Gateway, DNS Authoritative and Secondary, Cloudflare Tunnel, Workers KV namespace operations, and Magic WAN.

Cloudflare failures are a big deal. As John Engates, Cloudflare's field CTO, recently tweeted, "Cloudflare processes about 26 million DNS queries every SECOND! Or 68 trillion/month. Plus, we blocked an average of 140 billion cyber threats daily in Q2'23."

The root cause of these problems is a data center power failure combined with a failure of services to switch over from data centers having trouble to those still functioning.

Late in the day, Cloudflare gave ZDNET a fuller explanation of what happened:

We operate in multiple redundant data centers in Oregon that power Cloudflare's control plane (dashboard, logging, etc). There was a regional power issue that impacted multiple facilities in the region. The facilities failed to generate power overnight. Then, this morning, there were multiple generator failures that took the facilities entirely offline. We have failed over to our disaster recovery facility and most of our services are restored. This data center outage impacted Cloudflare's dashboards and APIs, but it did not impact traffic flowing through our global network. We are working with our data center vendors to investigate the root cause of the regional power outage and generator failures. We expect to publish multiple blogs based on what we learn and can share those with you when they're live.
Cloudflare is still working to resolve this problem. But, since the problem was with data center power outages rather than its software, solving it may be outside its control. Hang in there, folks. Fixing this may take a while.

Trust me, Cloudflare really wants to fix this as soon as possible. Cloudflare's earning call is today.

-

By the way, unrelated, if you check cloudflarestatus.com right now, A LOT is affected. A LOT. Ouch.

EDIT: better picture (thanks)
IMG_0236.jpeg
 

Attachments

  • IMG_0234.jpeg
    IMG_0234.jpeg
    856.9 KB · Views: 12
  • IMG_0233.jpeg
    IMG_0233.jpeg
    1.1 MB · Views: 13
Last edited:
The problem is the further you split your redundancy the less bandwidth you have between the datacenters, the more that bandwidth costs and the higher the latency.

If they're close enough you just treat them as if they were the same datacenter for setup purposes.

Sadly, proper wide area redundancy requires actual engineering competency. So it's out of reach of tiny companies like Cloudflare, Google and AWS.

It's almost time for the annual AWS US-East Black Friday outage anyway.
 
The root cause of these problems is a data center power failure combined with a failure of services to switch over from data centers having trouble to those still functioning.
This outage completely fucked up DreamHost too. Apparently the datacenter (Flexential PDX02) lost utility power and shit hit the fan while switching to generators.

From a Flexential customer on the outages mailing list
We're in Flexential PDX02, I have guys on-site now.

Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet). Some utility power is back, so recovery is in progress for some portion of the site. I still have ~70 racks without power....but things are coming back to life slowly.
Cloudflare published a thing on their blog about this https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/

They're pissed that Flexential didn't give them a heads up that something was going on, else they reckon they would've started moving things away from the location immediately. Apparently the stated 10 minutes worth of UPS battery life was probably closer to 4 minutes and the on site techs were struggling with the generators as the ground fault which caused the utility failure also tripped the breakers on the generators.

Normally the generators are fired up automatically but because it required human intervention and they didn't have as much time as they should've, they completely lost power to the whole facility. I'm doubtful Cloudflare could've shifted their workloads even in the event of somebody notifying them the instant utility power was lost, 4 minutes is barely enough time to really do anything.
At 12:48 UTC, Flexential was able to get the generators restarted. Power returned to portions of the facility. In order to not overwhelm the system, when power is restored to a data center it is typically done gradually by powering back on one circuit at a time. Like the circuit breakers in a residential home, each customer is serviced by redundant breakers. When Flexential attempted to power back up Cloudflare's circuits, the circuit breakers were discovered to be faulty. We don't know if the breakers failed due to the ground fault or some other surge as a result of the incident, or if they'd been bad before and it was only discovered after they had been powered off.
There is a God and he hates Cloudflare :story:
 
Normally the generators are fired up automatically but because it required human intervention and they didn't have as much time as they should've, they completely lost power to the whole facility. I'm doubtful Cloudflare could've shifted their workloads even in the event of somebody notifying them the instant utility power was lost, 4 minutes is barely enough time to really do anything.
Again, I'm not an expert in this shit, but isn't this kinda shit supposed to be tested on a regular basis?

Yeah, I know. The answer is gross incompetence and laziness, I just hate how people who do fuck all get rewarded despite their shittiness.
 
Turns out only retaining unstable troon engineers leads to problems. Who on God’s green earth would want to work for and support such a disgusting company and person.
Don’t forget about all the monkey crush, zoosadists (animal rapists and torturers), terrorists like Isis and various other illegal content that Cloudflare protects but at least it’s not that horrible US Legal Websoge which critiques and laughs at regards on the Internet is a big no-no.
 
Again, I'm not an expert in this shit, but isn't this kinda shit supposed to be tested on a regular basis?
In all honestly - not really.

You can theorycraft stuff, you can make test environments, but very often you cannot test in a production environment. Flexential is just one of many colocations - think of it like an apartment complex for servers. In order to get some level of "this could have an impact on uptime" testing done - you would need all of the tenants to sign off on it (they will never do so).

If Cloudflare doesn't like - they can host their own equipment in their own facility.
 
While we're on the topic of "redundancies", these generators... any of us in the US and even outside of such can go buy a generator right now that will power on and do a self test every month to ensure the fucking thing works, and can even text you saying "Hey, I'm working."
Its in Oregon though, for all we know meth heads either stole the fuel for the generators, the generators or both. Kinda weird that the power going down at one of their many centers fucked them so bad. Guess that's what happens when you hire stinkditches to do your rollover code :story:
 
Its in Oregon though, for all we know meth heads either stole the fuel for the generators, the generators or both. Kinda weird that the power going down at one of their many centers fucked them so bad. Guess that's what happens when you hire stinkditches to do your rollover code :story:
More likely the copper or aluminium wires for the grounding. They seem to think "Ground Wires"==Safe to steal. Until that ground wire is actually carrying current like in a substation.
 
The post mortem they posted got on to hacker nudes and almost everyone was ripping them an extra ass for spending 75% of it blaming the data center; cause who gives a fuck why the DC went down, you’re supposed to survive that.

Multihomed redundancy shit isn’t it you don’t test it, and if you’re scared to test it it’s doubly fucked.
 
Ok I'll bite on Little Mattel Princess's post mortem.

The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted.
Are they all on the same utilities? Not just electrical as was the case here but water, wastewater, stormwater, etc. Might want to look into that.

The three data centers are independent of one another, each have multiple utility power feeds...
It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time....
It would also have been possible for Flexential to run the facility only from the remaining utility feed...
At approximately 11:40 UTC, there was a ground fault on a PGE transformer at PDX-04. Unfortunately, in this case, the protective measure also shut down all of PDX-04’s generators.
As far as I can tell because the explanation is cloudy (what do I expect from someone who probably never turned a wrench in his life) the independent feeds go to the same switchgears. I'm guessing but when one feed lost power the generators automatically started up (I've worked at industrial sites and this is usually the case). Something happened to the second line that caused the GFI's to trip and the generators to shut down. Which is a plausible scenario that you should have identified or required mitigation from Flexential. Or if they did manually switch the generators on - once again identify and mitigate.

This leads to the next point where you try to blame Flexential for the failure chain (theoretically the generators would've kicked on after the failure of the second utility feed rather than the first = no loss of power).

We have been unable to locate any record of Flexential informing us about the DSG program.
That's funny because its listed in one of their air permits, wow so hard to find. I have to confirm the site, the three permitted sites Flexential has don't list whatever PDX-04 is. However this permit is for a site with 10 emergency generators. And it would beg the question, even if this isn't the PDX-04 site - if they have this for one of their data centers, do they have similar agreements for the others?

Screenshot 2023-11-05 070804.jpgScreenshot 2023-11-05 073548.jpg

When Flexential attempted to power back up Cloudflare's circuits, the circuit breakers were discovered to be faulty.
Were they doing IR inspections of the breakers? Might want to look into that.

So in essence Mr. Mattel Princess - your entire auditing program as a customer of the data center appears to be insufficient. Maybe spend less time trying to police the internet and fix that instead.
 
I have to confirm the site, the three permitted sites Flexential has don't list whatever PDX-04 is.
Confusingly Cloudflare referred to it as PDX-04, DreamHost said PDX-01 but people online are saying it was PDX-02. Eitherway you can find the addresses for each site on their website. https://www.flexential.com/data-centers/or/portland
PDX-01 3935 NE Aloclek Place
PDX-02 5737 NE Huffman Street
PDX-03 5419 NE Starr Blvd
PDX-04 4915 NE Starr Blvd
So in essence Mr. Mattel Princess - your entire auditing program as a customer of the data center appears to be insufficient. Maybe spend less time trying to police the internet and fix that instead.
Too busy burying himself knee deep in tranny gash to do his due diligence. Apparently Flexential is known to be pretty shit. From Brad Chapman on the outages mailing list
Here's an insightful comment today from a redditor who claims to have worked at Flexential in the managed services side.

I looked him up on LinkedIn: he was in DevOps / SRE and worked there a total of 6 years.

Flexential sounds like a shit show, but he's also surprised that Cloudflare chose this place. Either they didn't research it, or Flexential was doing a terrific snow job on them.

-Brad
Well, I'm not surprised in the slightest.

I didn't work on the Colo side -- I was in Managed Services, but our entire platform lived on top of the Colo stuff, so I had plenty of interaction with those folks.

Flexential is a company that was created out of mergers/acquisitions between ViaWest and Peak10 (both big Colo companies, the former mostly West of the Mississippi and the latter mostly East), and INetU, a Managed Services company with its own datacenter footprint. After the mergers, the company culture went from one of the best I had ever been a part of (at INetU) to dismally bad. All of the good employees left and it seemed like they only hired numbnutses to replace them.

Management is ineffective at best (toxic at worst), and routinely makes bad decisions. Most of the good fortune the company has are due to either pure accidents that end up somehow leading to success (think: Mr. Magoo), or because the C-suite schmoozed some customer executive at a fancy steakhouse into giving them their business.

Their profits have been very bad (if existent) for years and they pinch pennies wherever possible. Their infrastructure testing practices have gotten worse and worse as the good infrastructure people left the company and their replacements didn't know any better.

So yeah, I'm not surprised at all that they fucked up this badly -- I'm actually more surprised that Cloudflare put their core of operations at a Flexential datacenter.
 
Back