The Year of Endless Technical Problems

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Status
Not open for further replies.
Without taking a look and poking around, which of course can't happen, this can only be guesswork. So my shot-in-the-dark is: I'd take a close look at name resolution, especially DNS, to see if something's fucky in that pipeline somewhere. I've resolved issues like this that ended up being:
  • One bad DNS resolver in a group (missing host, or mis-matching namespaces)
  • Incorrect or non-existent dns search namespaces (i.e. using a non-existent internal DNS namespace like .local)
  • Reverse DNS lookups occurring when they shouldn't be, or being unable to resolve a reverse lookup due to DNS partitioning
Yes, these managed to cause issues that didn't directly make sense as being DNS-related, such as sluggish per-letter ssh feedback. Intermittency of issues in these cases was due to multiple resolver paths and hosts flipping between them. I've also seen similar problems from misconfigured time (NTP) settings where services were doing secure negotiations using differing perceptions of what the current time is.

Again since we're necessarily limited on information this can only be guesswork. Knowing if you connect on SSH by name or IP would cut down possibilities, same with how internal services are reaching each other. This is also where being a one-man-shop is very tricky, since the issue might just be 'way null has always set up <small thing> because it's how he learned to do it' that turns out to be causing a cascade of downstream weirdness.

Separately:
You haven't mentioned it so I'm going to assume, hopefully incorrectly, that you are running on bare metal. If you do not already have a hypervisor host somewhere in this environment, with at least test/dev replicas of the production environment on it if not production itself, please take the time to construct one. It doesn't need to be on beefcake hardware, you could even run the hypervisor on main alongside the production workloads if that's the only place you have any spare compute/memory resources to do so. Having this allows you to experiment, troubleshoot, clone, back up, cut over, and even download to run locally, entire environments. Plus, in this instance, if you're seeing the same issues on a virtualised environment, you can rule out the underlying hardware. Having the physical and logical of your system be separated is worthwhile, and with modern virtualization the overheads are genuinely minimal (like 1%).

Edit: Are you seeing any issues when connecting to the IPMI for the host, assuming you have access to that?
 
Am I the only one that doesnt have issues? The site might run a little slow in like, 1 in 50 loading times. Pretty much every other time its under 5 seconds. Are modern tech fags so goldfish brained they cant deal with a 5 second load time? Is this the Great Filter?
 
Am I the only one that doesnt have issues? The site might run a little slow in like, 1 in 50 loading times. Pretty much every other time its under 5 seconds. Are modern tech fags so goldfish brained they cant deal with a 5 second load time? Is this the Great Filter?
Consistent random 504 issues on mobile and desktop are the issue, attachment uploads failing, posts erroring and timing you out for 30 seconds before you can try again - the issues are real and they aren't as simple as "load times"
 
 Feds are mirroring the drives we are comped its over bros. (Kidding)

If I'm reading correctly are you saying disk write speeds are slow? Compared to what you expect? Wasn't there a thing with the enterprise grade nvmes shitting the bed?

Does lspci -nvv show the true speeds? I'm certain you've done the usual trim stuff & firmware updates so I won't insult your intelligence.

Are they running cool I know nvmes tend to slow down when saturated in heat.

Check if the drives aren't part of the batches of enterprise drives that flooded the second hand market a few years ago because they had a factory/manufacturing issue.

As for slow ssh I've read loads of crap about power saving etc when I was running my box I disabled unneeded crap like Pam and anything else I didn't see as being required as I wasn't running the box outside of my LAN nor was i tunneling into it from outside my LAN.

Useless input as I'm sure you have way more tech savvy people helping.

I'm unsure if using "ssh -vvv user@host" causes data leakage but that may inform you.
 
Maybe reverse DNS lookups? sshd and inetd are often configured to do reverse DNS lookups by default.
 
Since creating OP, I've set the NICE of the FPM to -10.

@dumbledore suggested setting up pdns-recursor and I am intrigued by his idea that using remote S3 and having to do constant NS lookups is the issue. It would definitely fit into that "nightmare undetectable issue" realm. I've also done that, will do it on the other nodes.
 
I disabled unneeded crap like Pam
Don't get Null hacked again.
If I'm reading correctly are you saying disk write speeds are slow?
Along with BFQ and scx_cosmos, setting vm.dirty_ratio as low as you can tolerate can really help (counterintuitively). Yeah it's weird, but queuing is a bitch.

Again, like with the overcommit ratio:
Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read.
(Vecr: that means that you need to set dirty_ratio without touching dirty_bytes after)
 
Don't get Null hacked again.

Not trying to do that, perhaps I'm confusing my lingo here, I meant I disabled password auth for SSH and only allowed SSH keys to authenticate.

I doubt null is using any type of auto completion in shell so it won't be that (which can cause input lag) but doesn't ssh send the keystroke to the server and report it back to the client?
 
As someone who never got proper training and education for certain tech shit I wanted to ages ago due to numerous factors, I can still feel the fucking pain and aggravation. This year in general seems to be a year of rot and messed up/unfulfilled goals for a lot of people.

I think the best thing to sum up the experience of this year is this emote.:ow:
 
Last edited:
Since creating OP, I've set the NICE of the FPM to -10.

@dumbledore suggested setting up pdns-recursor and I am intrigued by his idea that using remote S3 and having to do constant NS lookups is the issue. It would definitely fit into that "nightmare undetectable issue" realm. I've also done that, will do it on the other nodes.
have you tried turning it off and back on
also, i know you busy but not having update why i couldn't load site on telegram is annoying. it was nice to have some reassurance it is being worked on or something.
thanks for all you do
 
I'm the kind of guy who sets /etc/hosts file DNS records for all my important stuff but that isn't always practical with more complex setups. Nice as an emergency "oh shit DNS is completely down" situations.
 
I just wanted to say that 99% of the time the only reason I know the site is slow is because you say so. I don't know what other people's expectations re: site speed is but it's almost always fast enough for me unless a DDOS is ongoing.
 
As much as this is annoying it only really happens noticeably on sending posts.
KiwiFlare usually works flawlessly.
Site either does 504, loads half way or just halts completely, so something is interrupting the actual data stream server side.
Hopefully it's not something as stupid as logs getting full leading to service crash and reset.
 
Site load times, video buffering speed, thumbnail expanding, highlight navigation - all seem more consistent and quicker to load in general. Had the first image in this post take a couple of seconds to expand, but every one after that was instant.

Performance seems better, hope this post goes through first try :story:

Edit: no error on posting, it instantly posted, and clicking edit to add this was instant as well.

Edit 2: DM tray populated basically instantly and clicking a DM link loaded quickly too. Same with the notification tray.
 
Does this have to do with the site?
1757464255318.webp
 
I've hosted the Kiwi Farms for 13 years and at this point I have completely exhausted my own personal understanding of computers, the understanding of everyone around me, and also AI. I do not know why the site is slow. This machine should not experience mutli-second latency doing anything.
have you tried turning the computer on and off again
 
Status
Not open for further replies.
Back
Top Bottom