Surprised this thread is so quite.
Anyway, Kepler says the PS5 Pro is RDNA 3.5 with RDNA4 RT/WMMA/Data Prefetcher.
I've gathered up most of the info (both old and new) known so far based on this in one post without using any info from the PS5 Pro leaked document. Here, we can get I good idea what the PS5 Pro is shaping up to be, which seems to be a much better upgrade vs. the PS4 Pro.
RDNA3.5
AMD RDNA 3.5’s LLVM Changes
Scalar Floating Point Instructions
AMD’s GPUs have used a scalar unit to offload operations from the vector ALUs since the original GCN architecture launched in 2011. Scalar operations are typically used for addressing, control flow, and loading constants. AMD therefore only had an integer ALU in their scalar unit. RDNA 3.5 changes this by adding floating point operations to the scalar unit.
RDNA4 RT
Leaker reveals new AMD RDNA4 Ray Tracing features, expected in upcoming PlayStation 5 Pro
RDNA4 WMMA
Examining AMD’s RDNA 4 Changes in LLVM
Better Tensors
AI hype is real these days. Machine learning involves a lot of matrix multiplies, and people have found that inference can be done with lower precision data types while maintaining acceptable accuracy. GPUs have jumped on the hype train with specialized matrix multiplication instructions. RDNA 3’s WMMA (Wave Matrix Multiply Accumulate) use a matrix stored in registers across a wave, much like Nvidia’s equivalent instructions.
RDNA 4 carries these instructions forward with improvements to efficiency, and adds instructions to support 8-bit floating point formats.
Sparsity
RDNA 4 introduces new SWMMAC (Sparse Wave Matrix Multiply Accumulate) instructions to take advantage of sparsity.
RDNA4 Data Prefetcher
Examining AMD’s RDNA 4 Changes in LLVM
Software Prefetch
GPU programs typically enjoy high instruction cache hitrate because they tend to have a smaller code footprint than CPU programs. However, GPU programs suffer more from instruction cache warmup time because they tend to execute for very short durations. RDNA 3 mitigates this by optionally prefetching up to 64x 128-byte cachelines starting from a kernel’s entry point. RDNA 4 increases the possible initial prefetch distance to 256 x 128 bytes. Thus code size covered by the initial prefetch goes from 8 KB to 32 KB.
As far as I know, prefetching only applies to the instruction side. There’s no data-side prefetcher, so RDNA 3 SIMDs rely purely on thread and instruction level parallelism to hide memory latency.
RDNA 4 adds new instructions that let software more flexibly direct prefetches, rather than just going in a straight line. For example, s_prefetch_inst could point instruction prefetch to the target of a probably taken branch. If my interpretation is correct, RDNA 4 could be better at handling large shader programs, with instruction prefetch used to reduce the impact of instruction cache misses.
On the data side, RDNA 4 appears to introduce software prefetch instructions as well.
RDNA3.5 die shot. (Strix Point)
GPU setup:
1 Shader Engine
2 Shader Arrays
4 WGP per Shader Array
8 WGP / 16 CUs total
4 RB+ / 32 ROPs
NPU for reference.
RDNA3 CU details for reference.
Microbenchmarking AMD’s RDNA 3 Graphics Architecture
To scale compute throughput beyond just adding more WGPs, AMD implemented dual issue capability for a subset of common instructions.
WGP Compute Characteristics
Compared to RDNA 2, RDNA 3 obviously has a large advantage in compute throughput. After all, it has a higher WGP count. But potential increases in compute throughput go beyond that, because RDNA 3’s SIMDs gain a limited dual issue capability. Certain common operations can be packaged into a single VOPD (vector operation, dual) instruction in wave32 mode. In wave64 mode, the SIMD will naturally try to start executing a 64-wide wavefront over a single cycle, provided the instruction can be dual issued.
Looking at the RDNA3 ISA documentation, there is only one VOPD instruction that can dual issue packed FP16 instructions along with another that can work with packed BF16 numbers.
These are the 2 VOPD instructions that can use packed math.
This means that the headline 123TF FP16 number will only be seen in very limited scenarios, mainly in AI and ML workloads, although gaming has started to use FP16 more often.