Disaster Cloudflare has admitted that one of its engineers "stepped beyond the bounds of its policies" and throttled traffic to a customer's website. - Website and API became unresponsive due to extensive throttling

UPDATED Cloudflare has admitted that one of its engineers stepped beyond the bounds of its policies and throttled traffic to a customer's website.

The internet-grooming outfit has 'fessed up to the incident and explained it started on February 2 when a network engineer "received an alert for a congesting interface" between an Equinix datacenter and a Cloudflare facilit

Cloudflare's post about the matter states that such alerts aren't unusual – but this one was due to a sudden and extreme spike of traffic and had occurred twice in successive day

"The engineer in charge identified the customer's domain … as being responsible for this sudden spike of traffic between Cloudflare and their origin network, a storage provider," the post states. "Traffic from this customer went suddenly from an average of 1,500 requests per second, and a 0.5MB payload per request, to 3,000 requests per second (2x) and more than 12MB payload per request (25x)

As the spike created congestion on a physical interface, it impacted many Cloudflare customers and peer

Cloudflare's automated remedies swung into action, but weren't sufficient to completely fix the proble

An unidentified engineer "decided to apply a throttling mechanism to prevent the zone from pulling so much traffic from their origin

A post to Hacker News that Cloudflare's post links to – and which The Register therefore assumes was posted by the throttled customer – states the throttle was applied without warning and caused the customer's site and API to become effectively unavailable due to slow responses leading to timeouts.

Cloudflare has issued a mea culpa for its decision to impose the throttle.

"Let's be very clear on this action: Cloudflare does not have an established process to throttle customers that consume large amounts of bandwidth, and does not intend to have one," wrote Cloudflare senior veep for production engineering Jeremy Hartman and veep for networking engineering Jérôme Fleury.

This remediation was a mistake, it was not sanctioned, and we deeply regret it."

Cloudflare has promised to change its policies and procedures so this can't happen again – at least not without multiple execs signing off on it.

"To make sure a similar incident does not happen, we are establishing clear rules to mitigate issues like this one. Any action taken against a customer domain, paying or not, will require multiple levels of approval and clear communication to the customer," Hartman and Fleury state. "Our tooling will be improved to reflect this. We have many ways of traffic shaping in situations where a huge spike of traffic affects a link and could have applied a different mitigation in this instance."

The Hacker News post referenced above sparked a 300-plus comment conversation in which few authors have kind things to say about Cloudflare. Nor do various folks in some of the darker reaches of the web, where Cloudflare has often been accused of throttling traffic as a political act, given its track record of declining to serve sites that host hate speech.

Actually throttling a customer without warning will likely fuel theories that Cloudflare, like its Big Tech peers, is an activist organization that does not treat all types of speech fairly.

Hartman and Fleury promised that Cloudflare is re-writing its legalese to better explain what customers can expect. "We will follow up with a blog post dedicated to these changes later," the pair wrote.

The post does not mention what, if anything, happened to the engineer who applied the throttle. ®

Updated to add at 2350 UTC, February 9
Cloudflare contacted The Register with the following statement: "There were no punitive measures taken against anyone involved in this unfortunate incident. We have a blame-free culture at Cloudflare. People make mistakes. It's the responsibility of the organization to make sure that the damage from those mistakes is limited."

 
This is the first I heard of this so I had to look for another source to say what even went down.

Small SaaS banned by Cloudflare after 4 years of being paying customer

Something about crypto markets.

I find the second half interesting
but when I got approached by Cloudflare sales team I explicitly asked if I can still be on pay as you go/self server model and reply was: "Enterprise wise, that's up to you and you could likely get away with utilising self-serve as you go, but if you did choose to go enterprise (without R2) I might be able to have something approved in the xx/month range."
I would fully understand that I am required to upgrade, but why not sending me an email before shutting down my business completely? I even asked about such scenario on zoom meeting I had with their Sales and they said it will never happen - few weeks forward and here we are...anyways going back to replying to my customers emails regarding service outage.
Allegedly, he'd spoken to CF before and they said this would not happen. On a zoom meeting, so presumably that was a human and not a chat bot. And he asked if he should pay more for his usage. They made it should like it didn't matter instead of being honest and saying "upgrading would be wise".

Now, after the incident he indicates a willingness to pay for a higher usage tier, if only they'd help him maintain uptime by telling him when he needs to fix something.

If I take his account at face value, maybe it's a little the engineer's fault, but it's also a lot the sales team's fault for hooking him up with an inappropriate plan for his use case. I would totally understand that for a self serve automated system, but if they were in a zoom call actually speaking as humans, then I'd expect sales to either (depending on level of cynicism)
  1. Help the customer figure out what the right plan for them is, based on needs and wants, or
  2. Ignore what the customer needs and constantly try to upsell them
And somehow they managed to fail to even over aggressively push the upsells.
 
So a rogue employee took advantage of a system that they don't have set up to attack a customers web service, which is a "serious issue", then gets a pat on the back from a blame-free culture.


That'll learn him, surely there aren't any other "rogue employee's" weaponizing a horrificly powerful company repercussion free.


One employee sure managed to do it pretty quickly, completely on his own

"Unfortunate"
Seems like a microcosm of corporate organisation. Nobody who makes the decisions is affected by the long term success of the company, because they don't own it, the hedgies playing with your money do. The only consequences are reputation based .
 
Last edited:
"There's 0 punishment, we don't blame him, anyone can do whatever they want to our customers we don't care"
It's one thing not to overreact to an honest mistake or a situation coming from inexperience/lack of adequate training and instead focus on making sure something doesn't happen again. That didn't happen here, though. While I might agree that throttling service is appropriate in certain scenarios, having it done by someone who lacks the authority to do it on their own appears to be the concern in the OP. Anyone else who oversteps their authority and fucks things up for a client would likely be taken to task at minimum and fired if the screw-up was serious/egregious enough.

However, this is unsurprising coming from CF; they speak out of both sides of their mouth ever since they reversed course and dropped the 'Farms.
 
I appreciate how they make us sound cooler than we actually are.
Speak for yourself, choom. This is me.
1676087241692.png
 
Frankly what the customer thinks probably doesn't matter, I'm sure the TOS and the like give CF that ability to do things like this, but this quote is clearly trying calm down/reassure the customer/s:


Why would there be punishment? The employee protected CF first and foremost during an active incident. It's not like the customer is going to go elsewhere in 2023, KF only did because Null was given zero choice in the matter.
This is not true. Cloudflare has lots of competitors if you are willing to pay. Their niche is the free tier they offer that they use to get you hooked to their services without rewrites.
 
This reads like basic triage on the fly, if the attack was large enough to start affecting other customer's shit then it only makes sense to throttle the offender down. Obviously this site has its issues with Cloudflare, but what this engineer did is just common sense network admin stuff.
I assume Cloudflare is all-in on globohomo bullshit like ISO9000 so this should be something that has an SOP/documentation. Essentially a plain-language worksheet outlining the authority of any technician granted a corresponding power in terms of privileges. I know any half-decent system will allow them to have actions trigger automated communications with a direct reference number. I.E. "Shit went down, here's what we did, call this number and give them this number and they can tell you more." Even if some GodAdmin swoops in and stops the communication, that stoppage would be reflected in logs. If you've constructed the procedures right there isn't a need for a "blameless" culture. It's either above-board or gross misconduct, no room for a mistake. The latter obviously isn't good but there's only so much you can do proactively.
This is the first I heard of this so I had to look for another source to say what even went down.

Small SaaS banned by Cloudflare after 4 years of being paying customer

Something about crypto markets.

I find the second half interesting

Allegedly, he'd spoken to CF before and they said this would not happen. On a zoom meeting, so presumably that was a human and not a chat bot. And he asked if he should pay more for his usage. They made it should like it didn't matter instead of being honest and saying "upgrading would be wise".

Now, after the incident he indicates a willingness to pay for a higher usage tier, if only they'd help him maintain uptime by telling him when he needs to fix something.

If I take his account at face value, maybe it's a little the engineer's fault, but it's also a lot the sales team's fault for hooking him up with an inappropriate plan for his use case. I would totally understand that for a self serve automated system, but if they were in a zoom call actually speaking as humans, then I'd expect sales to either (depending on level of cynicism)
  1. Help the customer figure out what the right plan for them is, based on needs and wants, or
  2. Ignore what the customer needs and constantly try to upsell them
And somehow they managed to fail to even over aggressively push the upsells.
Death to all sales teams and scope-of-work writers. Also, archive fucking everything. "Not a problem" and "that won't happen" sounds comforting but lacking an MP3 to attach to an email of them saying that, it's really meaningless in the long run. It's no guaranteed victory and it cannot do the impossible, but it's usually a card that precludes outright losing. Even if it is ruled inadmissible in whatever litigation, that doesn't make it not public knowledge that it actually happened. It's the kind of evidence of an entirely unforced error that big stockholders can easily have queued up for the next conference call. Although I can say from experience if it's a card you have to play, expect a pyrrhic victory.
 
So a rogue employee took advantage of a system that they don't have set up to attack a customers web service, which is a "serious issue", then gets a pat on the back from a blame-free culture.


That'll learn him, surely there aren't any other "rogue employee's" weaponizing a horrificly powerful company repercussion free.


One employee sure managed to do it pretty quickly, completely on his own

"Unfortunate"
Nah, this whole article is corpo-speak for:

We did something naughty, but we thought everyone would be on our side. Now that there are real consequences, we can't let the company take a hit, so we're going to make a scapegoat and give them a cushy pillow to fall on.
 
Turns out when you make politically and activist motivated activities, people stop giving you any benefit of the doubt. In a neutral organization known for not fucking around, nobody would bat an eye at this - It'd just be dismissed as the left hand not talking to the right, standard corporate fuckups. Now, every single hiccup in services is going to get a high level of public scrutiny since folks don't trust "was it a hiccup, or an attack?"

These people never realize how valuable the benefit of the doubt is, right up until its gone. And its not coming back.
 
Weasel words, they're happy the guy didn't let the infection spread but have to do PR over the fact they let a customer eat shit.

Welcome to reality, you think all of those websites providing the five nines actually achieve such things? What is going to happen is that CF is going to rewrite the contracts with a statement that they can isolate or degrade you if they feel its required to protect the entire business and its customers, if its not already in the contract, which it would almost certainly be.

Yeah I think that's why I'm a bit confused by this articles framing. So they throttled bandwidth to protect their own network, because Cloudflare is not invincible obviously. I'm sure if you were a large government, you'd have the network resources to nuke Cloudflare off the internet.

Is there any other articles that cover this that seem a bit more neutral?
 
I could be wrong but this just oozes of damage control.

"Oh no no no, it wasnt our fault but it was the fault of a nameless rogue employee! We arent normally like that, believe us, plz!"
 
I could be wrong but this just oozes of damage control.

"Oh no no no, it wasnt our fault but it was the fault of a nameless rogue employee! We arent normally like that, believe us, plz!"
Wouldn't surprise me if it was actually true. Engineers tend to develop delusions of godhood once they end up in critical systems.
 
Back