Do you happen to know how Nvidia Ampere Teraflops are calculated?
Flops are always calculated in a similar fashion - unless someone starts to use something other than FP32 math for that which would be generally wrong thing to do.
Are double rate fp32 calculations included in a similar fashion? I am reading 'ampere compute figures being inflated' left and right but couldn't find a definitive answer after doing a bit of research.
No, Ampere's FP32 rate is "true" math rate in a sense that there are two SIMDs capable of doing FP32 math in an SM partition while in Turing there was only one such SIMD.
This is different from VOPD where the additional FP throughput comes from additional (repurposed?) ALUs inside the same SIMDs.
The difference comes down to the ability to extract FP instructions from the command stream and to the types of such FP instructions.
Ampere+ can basically run the same FP math on both SIMDs and this math can be from two independent warps/waves (out of those which are scheduled to run on the SM).
RDNA3's VOPD can run a limited subset of FP instructions at double rate and they have to come from the same warp/wave which is running on the SIMD in current clock.
This makes the opportunities for such double speed launches very limited and thus rarely happening in practice.
The "inflated" nature of Ampere flops is different - it's easy to use them but in practice they are excessive for gaming code (which tend to hit other bottlenecks aside from pure flops) and have to be shared with other math types.
The latter is the main reason why the gains between Turing and Ampere were lower than you'd expect from comparing the flops alone - Turing did have 1/2 of flops rate but it had the same INT rate and INT math takes 1/4 to 1/3 of typical gaming code execution.
On Ampere+ this INT workload has to run on the SIMD which can also do FP workload which means that it can't do FP when it's doing INT. Thus on practice this FP32 doubling is lower as the same SIMD must still run INT math which takes quite a chunk of overall execution time.
If you run a purely FP workload Ampere+ easily hits its peak FP throughput figures. But games are never a purely FP workload.