- Joined
- Sep 9, 2025
Without taking a look and poking around, which of course can't happen, this can only be guesswork. So my shot-in-the-dark is: I'd take a close look at name resolution, especially DNS, to see if something's fucky in that pipeline somewhere. I've resolved issues like this that ended up being:
Again since we're necessarily limited on information this can only be guesswork. Knowing if you connect on SSH by name or IP would cut down possibilities, same with how internal services are reaching each other. This is also where being a one-man-shop is very tricky, since the issue might just be 'way null has always set up <small thing> because it's how he learned to do it' that turns out to be causing a cascade of downstream weirdness.
Separately:
You haven't mentioned it so I'm going to assume, hopefully incorrectly, that you are running on bare metal. If you do not already have a hypervisor host somewhere in this environment, with at least test/dev replicas of the production environment on it if not production itself, please take the time to construct one. It doesn't need to be on beefcake hardware, you could even run the hypervisor on main alongside the production workloads if that's the only place you have any spare compute/memory resources to do so. Having this allows you to experiment, troubleshoot, clone, back up, cut over, and even download to run locally, entire environments. Plus, in this instance, if you're seeing the same issues on a virtualised environment, you can rule out the underlying hardware. Having the physical and logical of your system be separated is worthwhile, and with modern virtualization the overheads are genuinely minimal (like 1%).
Edit: Are you seeing any issues when connecting to the IPMI for the host, assuming you have access to that?
- One bad DNS resolver in a group (missing host, or mis-matching namespaces)
- Incorrect or non-existent dns search namespaces (i.e. using a non-existent internal DNS namespace like .local)
- Reverse DNS lookups occurring when they shouldn't be, or being unable to resolve a reverse lookup due to DNS partitioning
Again since we're necessarily limited on information this can only be guesswork. Knowing if you connect on SSH by name or IP would cut down possibilities, same with how internal services are reaching each other. This is also where being a one-man-shop is very tricky, since the issue might just be 'way null has always set up <small thing> because it's how he learned to do it' that turns out to be causing a cascade of downstream weirdness.
Separately:
You haven't mentioned it so I'm going to assume, hopefully incorrectly, that you are running on bare metal. If you do not already have a hypervisor host somewhere in this environment, with at least test/dev replicas of the production environment on it if not production itself, please take the time to construct one. It doesn't need to be on beefcake hardware, you could even run the hypervisor on main alongside the production workloads if that's the only place you have any spare compute/memory resources to do so. Having this allows you to experiment, troubleshoot, clone, back up, cut over, and even download to run locally, entire environments. Plus, in this instance, if you're seeing the same issues on a virtualised environment, you can rule out the underlying hardware. Having the physical and logical of your system be separated is worthwhile, and with modern virtualization the overheads are genuinely minimal (like 1%).
Edit: Are you seeing any issues when connecting to the IPMI for the host, assuming you have access to that?

