GPU Inference:
Multi-gpu inference comes with performance penalties, gotchas, and limited software support. As a result you should first determine what inferencing your doing and if possible get a single card with enough memory. This is basically the pc build guide question of "what do you use your computer for".
CUDA is king. Yes intel and AMD can do inference but expect performance penalties, gotchas and limited software support. For serious inferencing that you want to work it needs to be nvidia still (sadly).
Old nivida cards are a great value but are slower, have some gotchas and have limited software support. Old intel cards doesn't exist, old AMD are even less supported.
Mix and match any of these three things for exponentially more difficult and potentially unworkable results.
So what we want is
nvidia 20-series cards or later with enough memory to do what you want on one card.
CPU (Inference):
Using CPUs for inference is used to run really big models because you can get a lot more memory. The actual CPU doesn't really matter except for the SIMD extensions it has. Most projects won't run without AVX2 and after that the only big speed up is AMX. Beyond this basically only memory amount + bandwidth matter. For older stuff dual xeons is decent, for newer threadripper is great.
There is a special case with offloading all the experts in an MoE models to system memory while having the prompt processing and remaining tensors on a single beefy gpu. This is how people run large quantizations of huge models locally at good speeds. Not really a considering right now with ram prices as they are.
However we are doing GPU stuff so as
along as it has AVX2 it basically doesn't matter.
Encoding:
Intel is good for cheap AV1 (and h264) encode because they don't cut down their media engine on lower spec cards or limit the amount of transcodes like nvidia does. Critically that means higher spec cards are not better for encoding. AMD has trash quality and is not to be considered.
I remain convinced that
a couple of the cheapest available A/B series intel cards are the correct choice.
RAM:
The pricing on this is a disaster at the moment. I would get ~64gb in the cheap by leveraging a large amount of DIMM slots and the fact that system memory speed isnt' super important, with the intention to
upgrade later after this nonsense ends. The link I gave is just one example. You might even go down to 8gb sticks.
Conclusion:
Server with a ton of PCIe slots, old CPU+RAM to save money, several A310 or A380s for video encode, without more specific inference needs several bang-for-buck-without-gotchas aka 3090's.
Other thoughts:
Expandability:
Not all of the PCIe slots are populated so there are options. The stated inference requirements at the moment are miniscule so I think buying a 2060 or two would suffice and that would free up the 3090s exclusively for really big
mistakes projects. Further the server supports 1TB of ddr4 at 2400. Once the ram bubble pops getting that would be cheap and open up opportunities like the previous mentioned MoE models or whatever else.
PSU:
the supermicro had a proprietary psu pin
Most server motherboards have propretary psu connections in my experience to support multiple hot-swappable PSUs. As a matter of fact most server boards are not even a standard form factor and fit only in the server rack which they were designed for. I don't think this is feasible to try and avoid in a rack mount situation.
3090 vs. B60 vs. whatever else
Why am I recommending 3090s after shitting on B60s? Two reasons. One because of CUDA and software support and their comparable price I would chose a 3090 over a B60 for any situation without a second thought. Two these will not be used for video encoding. Because of that we're freed from any requirements except "best bank for buck for AI inference" which is a (couple) 3090(s). You could swap these for B60s (or any other card dedicated for inference) but if it's not an nvidia 20-series or later it will cause headaches.
Jank:
I'm not sure how jank you are willing to go with this. I think a consumer motherboard with the intel gpus for transcoding and a separate server for AI stuff would be entirely reasonable. Further do you want this AI server to be all buttoned down or are you okay with pcie risers and zip ties to a 20$ amazon wire shelf? I'm going to one integrated server based on your previous purchase and iirc you said you have a server rack.
your avatar is really creepy and gross
It's literally the unit portrait for the SCV in starcraft so I hadn't really noticed. Now that I'm looking at it I see what you mean. Why have you done this to me?