GPUs & CPUs & Enthusiast hardware: Questions, Discussion and fanboy slap-fights - Nvidia & AMD & Intel - Separe but Equal. Intel rides in the back of the bus.

Anyone know anything about this? Are there any GPUs out there that are built for double precision operations? It seems the AI field has embraced even smaller precision floating point math (~3ish decimals? I'm amazed that even works.), which works for them, but isn't really useable for certain physics simulations. Painting pictures on a screen rarely needs anything larger than 32 bit floats.
All of the consumer grade GPUs have intentionally gimped FP64. Looks like AMD gimps the consumer cards half as much:

FP32 (float) 35.58 TFLOPS
FP64 (double) 556.0 GFLOPS (1:64)

FP32 (float) 61.39 TFLOPS
FP64 (double) 1.918 TFLOPS (1:32)

FP32 (float) 82.58 TFLOPS
FP64 (double) 1,290 GFLOPS (1:64)

The good old Radeon VII is beating all of those, and even it was gimped from a 1:2 professional card:

FP32 (float) 13.44 TFLOPS
FP64 (double) 3.360 TFLOPS (1:4)

But consult someone who actually knows what's up.
 
Last edited:
All of the consumer grade GPUs have intentionally gimped FP64. Looks like AMD gimps the consumer cards half as much:

FP32 (float) 35.58 TFLOPS
FP64 (double) 556.0 GFLOPS (1:64)

FP32 (float) 61.39 TFLOPS
FP64 (double) 1.918 TFLOPS (1:32)

FP32 (float) 82.58 TFLOPS
FP64 (double) 1,290 GFLOPS (1:64)

The good old Radeon VII is beating all of those:

FP32 (float) 13.44 TFLOPS
FP64 (double) 3.360 TFLOPS (1:4)

But consult someone who actually knows what's up.
even the intel Arc a310 beats the 3090 at FP64

FP32 (float)2.688 TFLOPS
FP64 (double)672.0 GFLOPS (1:4)

also what kind of ai can I run on one that's on an old Debian server that just uses it for transcoding?
 
even the intel Arc a310 beats the 3090 at FP64

FP32 (float)2.688 TFLOPS
FP64 (double)672.0 GFLOPS (1:4)

also what kind of ai can I run on one that's on an old Debian server that just uses it for transcoding?
An upscaler would be the natural thing to pair with your transcoder. I’ve run one on my workstation, since I don’t think there’s a jellyfin plugin for this yet, and admittedly watching old anime at 4k is really cool.
 
In Intel’s defense, it wasn’t ENTIRELY their fault that their gaming GPUs bombed.

As I recall, much of the ARC effort was based in… Moscow. And when the SMO started, they suddenly had a real clusterfuck on their hands, and spent much of that year relocating staff from Intel in Russia, and replacing those who didn’t want to leave.

That was just one factor in why their cards arrived late but I reckon it was a big one.
 
  • Lunacy
Reactions: Brain Problems
Will an AI-based upscaler ever work well on an AMD though? Sure, asking someone like CDPR to go and start optimizing the game engine, models etc so they'd play nice with FSR is a lot to ask - but if the market share was there, this could be a possibility.
Why wouldn't it? There's no secret to tensor algebra that only Nvidia knows.


Anyone know anything about this?

Yes. Nvidia crippled FP64 on gaming GPUs after Titan Black ate into their business. You'll need at least a workstation GPU, or maybe look for a used Radeon VII or V100.

You shouldn't need doubles if you scale your problem appropriately. DM and maybe I can help.
 
Why wouldn't it? There's no secret to tensor algebra that only Nvidia knows.
Well it's mostly because of the Tensor-like cores, so the reasons are architectural. AMD doesn't have anything like it in the consumer grade, while Intel has the XMX.
 
Well it's mostly because of the Tensor-like cores, so the reasons are architectural. AMD doesn't have anything like it in the consumer grade, while Intel has the XMX.

RDNA3's SIMD units can do matrix operations.
 
Never said it can't work at all, but that it won't work well. AMD GPUs just are rasterization optimized, it is what it is.

A 3050 Ti Mobile can crank through frame-wide inferencing just fine with all of 5.3 FP16 TFLOP/s. This is 1/4 the power of, say a Radeon 6600 XT, which can do 21 FP16 TFLOP/s.

I'm not sure why you think AMD cards can't churn through tensor operations fast enough to inference on a frame buffer of a few million pixels. It's just not all that much arithmetic.
 
Last edited:
A 3050 Ti Mobile can crank through frame-wide inferencing just fine with all of 5.3 FP16 TFLOP/s. This is 1/4 the power of, say a Radeon 6600 XT, which can do 21 FP16 TFLOP/s.
Talking about apples and oranges...
I'm not sure why you think AMD cards can't churn through tensor operations fast enough to inference on a frame buffer of a few million pixels. It's just not all that much arithmetic.
Mixed-precision performance is important, which is lackluster on AMD.
 
Talking about apples and oranges...

No, actually, they're both gaming GPUs that are capable of inferencing.

Mixed-precision performance is important, which is lackluster on AMD.

The 6000 series is good enough to run XeSS with DP4a. And given that the latest 7000 series GPUs smoke 3050s in AI benchmarks, there's no reason they shouldn't be able to run something as lightweight as inferencing-based upscaling.
 
Last edited:
  • Autistic
Reactions: Xentor

Exclusive: How Intel lost the Sony PlayStation business (A)

By Max A. Cherney
Reporting by Max A. Cherney in San Francisco; editing by Kenneth Li, Deepa Babington and Leslie Adler

This private information is unavailable to guests due to policies enforced by third-parties.
This private information is unavailable to guests due to policies enforced by third-parties.
 
Last edited by a moderator:
One thing that I'm running into writing physics simulations on the GPU is that my consumer grade GPUs (3090 RTX) kind of suck at double precision operations. (Double precision (~15 decimal places) operations are about 16 times slower than single precision (~7 decimal places) operations.) It's still faster than the 20-core CPU on a large enough grid (where the crossover appears to be ~10 million cells/elements/what-have-you), but much slower than I'd like.)

Anyway, I'm going to have to figure out how to tune this thing to use single precision, but I'm not sure how yet. The increments are close to the precision limit, especially since my timesteps need to get smaller as my grid gets finer.

Anyone know anything about this? Are there any GPUs out there that are built for double precision operations? It seems the AI field has embraced even smaller precision floating point math (~3ish decimals? I'm amazed that even works.), which works for them, but isn't really useable for certain physics simulations. Painting pictures on a screen rarely needs anything larger than 32 bit floats.

(What would really be cool is a math coprocessor with native 128 bit float operations - then I could forget entirely that I'm not working with continuum numbers. Alas, not needed by the AI or graphics crowd. So if anyone builds one, I'm unlikely to be able to afford it.)
If you really wanted to you could write your own custom routines for software-level 128-bit floating operations. The same thing has been done before for 8-bit, 4-bit, 3-bit, etc. floating point operations in order to run AI models with quantized weights on hardware that lacks support for these data types.
 
If it's true that they are "nerfing" FP64 that hard, then it should be possible to write a bunch of fp64 software emulation that runs faster using fp32 operations. I was thinking about that yesterday.

It may be less of a pain to figure out how to scale (or otherwise stabilize) my problem. It tends to go unstable when the timesteps are larger than a certain von-Neumannish stability limit, but the limit would be uncomfortably close to the precision limit with fp32.
 
Last edited:
If it's true that they are "nerfing" FP64 that hard, then it should be possible to write a bunch of fp64 software emulation that runs faster using fp32 operations. I was thinking about that yesterday.

It may be less of a pain to figure out how to scale (or otherwise stabilize) my problem. It tends to go unstable when the timesteps are larger than a certain von-Neumannish stability limit, but the limit would be uncomfortably close to the precision limit with fp32.
I know there are algorithms that let you use two doubles (2 f64s) as a single number, you could try doing the same thing with 2 f32s. But the tuning might be difficult, how much mantissa you want, etc.
 
If it's true that they are "nerfing" FP64 that hard, then it should be possible to write a bunch of fp64 software emulation that runs faster using fp32 operations. I was thinking about that yesterday.

That's basically what the gaming cards are already doing, and why it's so slow.
 
Intel and AMD were the final two contenders in the bidding process for the contract.
Who were all the other contenders? I can see Nvidia being one as they make the Switch chips, but where did Broadcom come from? I wasn't aware that Broadcom had meaningful GPU experience.
 
Who were all the other contenders? I can see Nvidia being one as they make the Switch chips, but where did Broadcom come from? I wasn't aware that Broadcom had meaningful GPU experience.
There were reports that Sony was considering switching to Arm. Even AMD is possibly making an Arm-based APU, "Sound Wave". Haven't heard of Broadcom making any big moves, but I'm interested to see if a MediaTek + Nvidia combo slap the shit out of Qualcomm's Snapdragon X Elite within a year or two.

No One Is Buying AMD Zen 5 CPUs, So What's Going On?

AMD AGESA 1.2.0.2 BIOS Improves Inter-Core Latency For Zen 5 “Ryzen 9000” CPUs, 58% Reduction & Major Performance Uplifts
With the new BIOS, the average latency drops down by 58% to 75ns when communicating across CCDs and the inter-CCD latency remains the same at 18-20ns.

Zen 5, which had an undeniable mess of a launch, is healing with this BIOS update.
 
Back