That's like comparing a Dragster with a McLaren P1 on the Nurburgring.
There's such a paradigm shift in GPU occupancy, cache and memory handling between 2012 AMD GCN architecture and Ampere I'm not even sure where I would begin. Not to mention that Jaguars were bandwidth hungry compared to ARM processors which are made for mobile memory in mind to begin with.
AMD GCN's cache and memory was so bad that almost the entirety of the RDNA project was to fix it. It has anemic front end, the geometry engines and rasterisers can't spit out vertices and pixels fast enough to saturate the cores. Shit occupancy, the CUs just can't stay occupied, full of stalls. Its like having a giant pool and filling it with a water hose, that's why PS4 went overkill on bandwidth, while the hole diameter and valve did not get bigger, it has so much pressure that any time its not stalled they are sure to give it data ASAP.
GCN could do an instruction every 4 cycles (SIMD16 completes in 1/4 cycle) while Kepler was 1 instruction every cycle.
GCN had geometry pipeline stalls with any context switch instructions (which vega tried to fix).
Even the infamous Vega with ridiculous bandwidth and memory bus width had 4 geometry engines for 4096 cores. Tahiti which PS4 is based on is 2 per 2048, equivalent.
To give an idea, Kepler basically is the fundation how the division of the basic SM building blocks that carried forward all the way to modern days and back then it had one polymorph engine (geometry engine equivalent) per 48 cuda cores. 1 one per SM. Then 1 per 128 cuda cores in Pascal. etc. Nowhere near the GCN's bonker idea of trying to feed 1024 cores with 1.
GCN was a compute monster, it handled well large work sizes with long durations (big pool), but very few game workloads fall into this category. Simple geometry was not saturing the GP (idle), it had simultaneous bit commands that created huge buffers basically kneecappings parallelism. The larger GPU on PS4 also meant that the SE:CU ratio (shader engines vs compute units) would fill slower, prefering longer running waves which is again, anti-thesis to most gaming workloads.
RDNA's whole point was to revamp the consequences of years of trying to make GCN work.
A shitload happened between Kepler → Maxwell → Pascal → Volta → Turing → Ampere
Ampere especially was a paradigm shift in Nvidia architecture with concurrent raster/RT/ML, Asynchronous to keep GPU near full occupancy, ampere global memory traffic for asynchronous memory copy and reducing memory traffic and also hide data copy latency, etc. Without even going into each generation improvements.
For switch 2 bandwidth :
T239 on switch 2 respects the entire Ampere lineup of the usual 25GB/s TFlops. Which leaves ~25GB/s remaining for CPU which is more than plenty on ARM A78.
With estimated TFlops from the T239 leaks
Handheld 1.7 TFlops * 25 + ~25GB/s for CPU = 67.5 GB/s →
DF estimated 68.26 GB/s
Handheld 3.1 TFlops * 25 + ~25GB/s for CPU = 102.5 GB/s →
DF estimated 102.4 GB/s
More examples of Ampere ~25GB/s per TFlops :
3060 @ 12.74 TFlops for 360 GB/s → 28.25 GB/s/TFlops
3070 @ 20.31 TFlops for 448 GB/s → 22.1 GB/s/TFlops
3080 @ 29.77 TFlops for 760 GB/s → 25.5 GB/s/TFlops
3090 @ 35.58 TFlops for 936 GB/s → 26.3 GB/s/TFlops
Its being fed with bandwidth exactly according to the modern Nvidia architectures' needs.