Switch 2 cpu bottleneck issues: Digital foundry

It's not bandwidth limited Pana. Look at the entire history of ampere bandwidth I posted
Ampere GPUs don't have to share bandwidth with 8 hungry CPU cores. This is going to be a big problem in those CPU hungry open-world games (most multiplat games been open-world nowadays) like Cyberpunk and we can already see the framerate is terrible on S2, similar to PS4.

There is a big memory contention problem when memory is shared between CPU + GPU. One does not simply add separate bandwiths CPU + GPU requirements like that. There is a high penalty involved. This is why those consoles have relatively healthy bandwidth (PS4 176GB/s, PS5 448GB/s, PS5 Pro 576GB/s). And Cerny added specific ML instructions to directly use the fastest GPU caches because main bandwidth is far from enough.

Basically when CPU needs in theory only 20GB/s, it actually needs maybe the double: 40GB/s of bandwidth because 20GB/s (or more) will be lost in the process of sharing it with the GPU.

PS4-GPU-Bandwidth-140-not-176.png
 
Their constant comparison of this thing to a PS4/Xbone is getting irritating now. The area of the game they keep showcasing can't even run on last gen consoles.
They need to change their tune and be professional about these topics, especially the "then buy the original switch" bs from Nintendo.
 
This thread is funny!

Nintendo fans: The Switch 2 can run X1/Ps4 games even better than those consoles!

Also Nintendo fans: Why are you comparing the the Switch 2 to X1/Ps4??!?? It's a handheld not a home console!
 
Ampere GPUs don't have to share bandwidth with 8 hungry CPU cores. This is going to be a big problem in those CPU hungry open-world games (most multiplat games been open-world nowadays) like Cyberpunk and we can already see the framerate is terrible on S2, similar to PS4.

There is a big memory contention problem when memory is shared between CPU + GPU. One does not simply add separate bandwiths CPU + GPU requirements like that. There is a high penalty involved. This is why those consoles have relatively healthy bandwidth (PS4 176GB/s, PS5 448GB/s, PS5 Pro 576GB/s). And Cerny added specific ML instructions to directly use the fastest GPU caches because main bandwidth is far from enough.

Basically when CPU needs in theory only 20GB/s, it actually needs maybe the double: 40GB/s of bandwidth because 20GB/s (or more) will be lost in the process of sharing it with the GPU.

PS4-GPU-Bandwidth-140-not-176.png

Confused Parks And Recreation GIF


ARM is totally different ballpark and that PS4 architecture is of course whack with bandwidth because it's GCN and plagued with the same architecture problems.

Should I list the number of products that used ARM CPUs with low bandwidth LPDDR?

Even ARM's own documentation on A78 puts it at 60GB/s peak for 8 cores

At 3GHz

20GB/s at 1GHz like switch 2 is rumored. They have 25GB/s, with A78C variant which has one cluster rather than two with bigger L3 memory.

AGX Orin is also working on same principles

Hell Amazon's gigantic 96 cores ARM with 536GB/s effectively had it at ~5.6GB/s

ARM is made for this. Cutdown desktop CPUs are not 🤷‍♂️
 
Just restating: The further the architecture in question strays from that found in PC the less reliable their predictions are because their understanding is founded in PC benchmarking and not a deep understanding of system architecture and data pipelining.
 
Last edited:
Let's be real. The switch 2 is going to have a bunch of third party games on it that run and look like crap with low resolutions, low framerates, a mixture a both, and low quality textures in certain areas. Alot of people are going to be playing the switch 2 hooked up to there tv which means these flaws are much more noticeable which in turn means you're not getting a quality experience compared to the competition. If you wanna ignore those flaws because you play it in handheld mode then fine, but anyone expecting this machine to perform miracles were fooling themselves.
The Office Thank You GIF
.
These guys are just generating SEOs for free.
 
Same way og switch was inbetween ps360 and ps4/xbox one gens, switch2 is inbetween ps4/xbox one and ps5/xbox series gens.

What it lacks is strong gpu, which is fully understandable with powerdraw constrains (10W in handheld mode, 40W docked), what it excels in vs last gen is ssd, ram, gpu is so weak/small that its rt capabilities can be skipped over, about dlss- we will see at launch but we all gotta remember dlss isnt magic trick, it takes gpu power to use dlss too, gpu power that switch2 doesnt have much of, even docked.

The thing about ai upscaling is- the higher native res its upscaling from, the better the results are, thats the reason no1 cares about dlss when final resolution is 720p and barely any1 cares about dlss when final resolution is 1080p, u really dont wanna upscale from some 540p or similar tiny native resolutions, obviously endresult will be better than native 540p but far worse from native 1080p.


Thats weakest desktop gpu from ampere family cards, 9tflops so almost 3x switch2 gpu power(lets be generous and say 1,5x switch2 performance coz we can assume switch 2 punches 30 to 50% above its weights coz its one closed platform, devs can code to the metal etc), u can even see it has 276 mm² die vs around 200 mm² whole switch2 APU has

This vid of DA:V is very nice spreadsheet how lowend ampere gpu performs using native 1080p low/medium/high/ultra settings vs dlss quality on the same settings in very controlled run/benchmark.

Keep in mind switch2 even docked is maybe close to 50% as powerful(even if we disregard pure specs and count in coding to the metal one model console optimisation), aka cut those fps numbers in half and thats what probably switch2 docked is capable of.
Whats interesting is how variable the fps is even tru such a small corridor sequence and how big impact dlss does even from native 1080p to dlss quality(720p native upscaled to 1080p), which tells us one thing- dynamic resolution is a must in most of those new AAA demanding games( 1080p low dlss quality we can see 65 to 91fps so over 40% viarianace even tho mostly it hovers in 75 to 85 range).
 
Last edited:
Yeah, they are doing this just for the drama and visuals at this point. Switch 2 is already showing games that don't even works on last gen consoles lol

I've just seen a comparison between Split Fiction on PS5 Pro and Switch 2 and it compare unbelievable well.

You can compare bare numbers, but is completely useless since technology and software have evolved a lot since PS4/X1 gen. Also, we are talking about launch games pretty much, imagine what devs did with Switch 1 hardware in late years compared to launch.
 
Their constant comparison of this thing to a PS4/Xbone is getting irritating now. The area of the game they keep showcasing can't even run on last gen consoles.
It's because they thought it would be a ps4 max and they try to double down on that.
Everybody can see this thing outperforms a ps4 and that it has newer tech and tools compared to ps4.
It doesn't really matter we saw them do the same thing with Xbox and PS5, they have trouble admitting they are wrong luckily we have eyes and can judge for ourselves
 
What kind of CPU is this thing using?
8cores arm clocked barely over 1GHz, in layman terms its definitely above xbox one/x/ps4/pr0 cpu but well below ps5/pr0/xbox series cpu's which are downclocked zen2 archi-8cores/16threads(still at least 3,5GHz).

We can even tell it easily by cp2077 ports- on last gen even now after all those patches game has- it only runs at 20-30fps, on current gen, even series s, it has performance mode that targets 60fps(ofc its not super stable) and current gen consoles can run more demanding expack sons of liberty, unlike lastgen.
Switch2 cp2077 port will have 30fps quality and 40fps performance modes, even if we assume fps gonna dip in those modes and not hold stable 30/40 thats still visibly above last gen consoles, and definitely below current gen consoles :)
Thats vid from 10months ago, still patch 2.12 and base game that is less demanding vs xpack, just look how nasty those visuals on ps4 are and game still dips below 25fps and frametime getting seizures all the time durning traversal with lowspeed vehicle, timestamped:
 
Last edited:
This thread is funny!

Nintendo fans: The Switch 2 can run X1/Ps4 games even better than those consoles!

Also Nintendo fans: Why are you comparing the the Switch 2 to X1/Ps4??!?? It's a handheld not a home console!

Uh? Who here has even ever said the last line you wrote? I haven't seen a single case where handheld I would not compare with X1/PS4. You're confusing maybe the modern consoles.
 
This thread is funny!

Nintendo fans: The Switch 2 can run X1/Ps4 games even better than those consoles!

Also Nintendo fans: Why are you comparing the the Switch 2 to X1/Ps4??!?? It's a handheld not a home console!
Because it's better than both.
 
AMD GCN's cache and memory was so bad that almost the entirety of the RDNA project was to fix it. It has anemic front end, the geometry engines and rasterisers can't spit out vertices and pixels fast enough to saturate the cores. Shit occupancy, the CUs just can't stay occupied, full of stalls. Its like having a giant pool and filling it with a water hose, that's why PS4 went overkill on bandwidth, while the hole diameter and valve did not get bigger, it has so much pressure that any time its not stalled they are sure to give it data ASAP.

GCN's cache and memory were fine for the time. The advantage that Nvidia had with kepler was that it was the first GPU to introduce Delta Color Compression.
AMD did introduce Delta Color compression with the Radeon 285. But this was at a time when Nvidia was nearing the release of Maxwell.
The big change came with Maxwell, which introduced Tile Based Rendering. Something that AMD only introduced with RDNA.

Occupancy was only an issue with GCN on PC, because of two factors. DirectX11 and the standard of the time being work waves of 32.
But on consoles not only the SDKs were made for work waves of 64, but it also made thorough use of Async Compute. So occupancy was very good on the PS4 and Xbone.
It took until Turing for Nvidia to make an Async solution that was able to improve performance.

Geometry was an issue for AMD at the time. Especially with tessellation enabled.
Consoles went around that by using low levels of tesselation, or just not using it.

But GCN had the advantage of having hardware schedulers, that helped reducing load on the CPU side.
Add the low level APIs on consoles, and it removed a lot of work load from the CPU.

But then there is the elephant in the room. There is almost a decade difference between the PS4 and GCN launch vs Ampere and the A78.
It would be really bad, if a CPU and GPU made several years later would not be able to surpass it.
 
Last edited:
That's like comparing a Dragster with a McLaren P1 on the Nurburgring.

There's such a paradigm shift in GPU occupancy, cache and memory handling between 2012 AMD GCN architecture and Ampere I'm not even sure where I would begin. Not to mention that Jaguars were bandwidth hungry compared to ARM processors which are made for mobile memory in mind to begin with.

AMD GCN's cache and memory was so bad that almost the entirety of the RDNA project was to fix it. It has anemic front end, the geometry engines and rasterisers can't spit out vertices and pixels fast enough to saturate the cores. Shit occupancy, the CUs just can't stay occupied, full of stalls. Its like having a giant pool and filling it with a water hose, that's why PS4 went overkill on bandwidth, while the hole diameter and valve did not get bigger, it has so much pressure that any time its not stalled they are sure to give it data ASAP.

GCN could do an instruction every 4 cycles (SIMD16 completes in 1/4 cycle) while Kepler was 1 instruction every cycle.
GCN had geometry pipeline stalls with any context switch instructions (which vega tried to fix).

Even the infamous Vega with ridiculous bandwidth and memory bus width had 4 geometry engines for 4096 cores. Tahiti which PS4 is based on is 2 per 2048, equivalent.
To give an idea, Kepler basically is the fundation how the division of the basic SM building blocks that carried forward all the way to modern days and back then it had one polymorph engine (geometry engine equivalent) per 48 cuda cores. 1 one per SM. Then 1 per 128 cuda cores in Pascal. etc. Nowhere near the GCN's bonker idea of trying to feed 1024 cores with 1.

GCN was a compute monster, it handled well large work sizes with long durations (big pool), but very few game workloads fall into this category. Simple geometry was not saturing the GP (idle), it had simultaneous bit commands that created huge buffers basically kneecappings parallelism. The larger GPU on PS4 also meant that the SE:CU ratio (shader engines vs compute units) would fill slower, prefering longer running waves which is again, anti-thesis to most gaming workloads.

RDNA's whole point was to revamp the consequences of years of trying to make GCN work.

A shitload happened between Kepler → Maxwell → Pascal → Volta → Turing → Ampere

Ampere especially was a paradigm shift in Nvidia architecture with concurrent raster/RT/ML, Asynchronous to keep GPU near full occupancy, ampere global memory traffic for asynchronous memory copy and reducing memory traffic and also hide data copy latency, etc. Without even going into each generation improvements.

For switch 2 bandwidth :

T239 on switch 2 respects the entire Ampere lineup of the usual 25GB/s TFlops. Which leaves ~25GB/s remaining for CPU which is more than plenty on ARM A78.

With estimated TFlops from the T239 leaks

Handheld 1.7 TFlops * 25 + ~25GB/s for CPU = 67.5 GB/s → DF estimated 68.26 GB/s
Handheld 3.1 TFlops * 25 + ~25GB/s for CPU = 102.5 GB/s → DF estimated 102.4 GB/s

More examples of Ampere ~25GB/s per TFlops :

3060 @ 12.74 TFlops for 360 GB/s → 28.25 GB/s/TFlops
3070 @ 20.31 TFlops for 448 GB/s → 22.1 GB/s/TFlops
3080 @ 29.77 TFlops for 760 GB/s → 25.5 GB/s/TFlops
3090 @ 35.58 TFlops for 936 GB/s → 26.3 GB/s/TFlops

Its being fed with bandwidth exactly according to the modern Nvidia architectures' needs.
You are missing a crucial point in your assessment. Switch 2 is a unified memory/SOC design meaning it's prone to contention issue which will inflate the whole bandwidth consumption disproportionally with CPU (bandwidth) usage. Thus, direct comparison to discreet PC parts is misleading. By the way the "bandwidth hungry Jaguar CPU" of PS4 was capped at 18 GB/s bandwidth, look it up. I don't think your argument is quite valid, Switch 2 is fairly bandwidth constrainted in my opinion (not to a degree of something like PS4 PRO though).
 
Last edited:
Glad to hear the system is PS4 in portable mode but docked mode is actually what I'm here for. Many XSS games are being surpassed like SFVI and games like Cyberpunk are way better on Switch 2 than Steamdeck, when docked.

It is important to remember when people shit on Switch 2 graphics they are usually talking about portable mode and intentionally leaving out that it is portable mode they are discussing because they hate the Switch 2, don't want you to buy it, and want to trick you into believing the handheld mode discussion is actually the docked discussion. Not everyone mind you, but many, MANY people hate Nintendo and will be dishonest. Remember none of them have ever touched Swtich 2 and hated Nintendo for years. Also remember they are currently playing and have popularized games like Assassin's Creed Shadows and paid full price for it. They think AC Shadows is better than Zelda. Ask them. Ask them and they will say AC Shadows is better than Zelda. That's how to know who to listen to. That is the litmus test. Remember docked mode is how we will be playing and it is much more powerful in docked mode than anyone expected. We have several examples of it matching or surpassing XSS and steam deck in docked mode.
 
Last edited:
In handheld mode they are targeting a 1080p screen (overkill resolution for me for a handheld but 🤷‍♂️), so 1.7 TFLOPS on a newer nVIDIA architecture vs 1.84 TFLOPS on base PS4 spells good news, but you are then limited by bandwidth (68 GB/s on Switch 2 vs 176 GB/s, both having to share with the CPU too) as well as clockspeed (561 MHz vs 800 MHz).
It's 1.7/3 TF with VOPD (double rate fp32) included on Switch 2/Ampere though, isn't it? Wouldn't this skew the comparison in favor of Switch 2? PS4 doesn't have such a feature. In my own assessment Switch 2 is closer to PS4 when docked rather than PS4 PRO, and closer to One S (a bit above still) as a handheld.
 
Last edited:
Glad to hear the system is PS4 in portable mode but docked mode is actually what I'm here for. Many XSS games are being surpassed like SFVI and games like Cyberpunk are way better on Switch 2 than Steamdeck, when docked.

It is important to remember when people shit on Switch 2 graphics they are usually talking about portable mode and intentionally leaving out that it is portable mode they are discussing because they hate the Switch 2, don't want you to buy it, and want to trick you into believing the handheld mode discussion is actually the docked discussion. Not everyone mind you, but many, MANY people hate Nintendo and will be dishonest. Remember none of them have ever touched Swtich 2 and hated Nintendo for years. Also remember they are currently playing and have popularized games like Assassin's Creed Shadows and paid full price for it. They think AC Shadows is better than Zelda. Ask them. Ask them and they will say AC Shadows is better than Zelda. That's how to know who to listen to. That is the litmus test. Remember docked mode is how we will be playing and it is much more powerful in docked mode than anyone expected. We have several examples of it matching or surpassing XSS and steam deck in docked mode.

I'm still curious if it actually pulls the rumored 40W while docked, that would be quite the unheard of jump up for handhelds.

And yeah, plenty of people use the Deck docked, and to not consider the docked performance of the Switch 2 is being a bit biased.
 
Maybe I'm misunderstanding you but Switch 2 has 12GB of RAM, which is more than PS4 Pro not less.
Yes, it's 3 GB more than PS4 PRO and 4 GB less than Steam Deck. While at Steam Deck, its ZEN 2 CPU should also be an interesting point of comparison to Switch 2's.
 
Last edited:
You are missing a crucial point in your assessment.

John Candy No GIF by Laff


Switch 2 is a unified memory/SOC design meaning it's prone to contention issue which will inflatable the whole bandwidth consumption disproportionally with CPU (bandwidth) usage.

ARM is a lot more isolated than anything you can pull out of your hat when it comes to cutdown desktop CPUs slapped into consoles. ARM's A78 IPC kills Jaguar and has a lonnnnng ass scheduler with strong frontend that would make AMD's 2012 era of Bulldozer/Bobcat/Jaguar blush. Some of the worst IPC in CPU history for a period of time. RISC based to begin with with smaller/fewer instructions. Lower IPC mean its more hungry for bandwidth, like I said in my post, compared to ARM.

Thus, direct comparison to discreet PC parts is misleading.

Sorry, was PS4 discreet?

By the way the "bandwidth hungry Jaguar CPU" of PS4 was capped at 18 GB/s bandwidth, look it up. I don't think your argument is quite valid, Switch 2 is fairly bandwidth constrainted in my opinion (not to a degree of something like PS4 PRO though).

And Jaguar's very simple front end combined with terrible IPC kept stalling. What do you think the entire project of Ryzen tried to fix?
FFS, Jaguar does not even have clustered multi-thread to share execution ressources around cores. There's no defense for this possible.

Your "opinion" is not in the realm of any ARM documentation, AGX Orin or Ampere bandwidth needs.

ARM A78 is bandwidth capped (PEAK) at 60GB/s @ 3GHz on whitepaper. L3 cache cannot even take more. Do I have to math it out for you what it needs at 1GHz like T239 uses? These modern ARM generations have dynamic prefetching modes that would switch to conservative mode in a saturated bandwidth contention and mitigate performance degradation, but even then T239 is exceeding what L3 can even be fed in theoritical peak.

In my above calculations, even with 25GB/s per TFlops for feeding Ampere SMs like the entire family of Ampere uses, including AGX Orin with ARM processors, you have 25GB/s leeway for CPU. Again, you cannot even feed that logically to an ARM A78C cluster.

Again, Amazon's superchip ARM 96 core Graviton 4 @ 2.8GHz with 536GB/s banwidth, goes to ~16GB/s equivalent 1GHz 8 cores.
Somehow the richest company on earth choked their superchip. You got it...
 
The PS4, released in 2013, is still a pretty good console with some beautiful games. So having that in a portable is nice. Having 12GB of RAM instead of 8GB is going to help with ports.
 
The PS4, released in 2013, is still a pretty good console with some beautiful games. So having that in a portable is nice. Having 12GB of RAM instead of 8GB is going to help with ports.
We're getting mostly around PS4 graphics with just better IQ and frame rate these days... well not always but that's about expected for any game due to diminishing return.

Switch 2 being able to run games similarly looking but with more stable performance, better resolution and better graphical settings is expected even if not reaching PS5 levels of raw power.

I'm expecting current gen only games to run on Switch without many issues and not at 600p with DLSS and 25 fps as some want to make believe.
 
GCN's cache and memory were fine for the time. The advantage that Nvidia had with kepler was that it was the first GPU to introduce Delta Color Compression.
AMD did introduce Delta Color compression with the Radeon 285. But this was at a time when Nvidia was nearing the release of Maxwell.
The big change came with Maxwell, which introduced Tile Based Rendering. Something that AMD only introduced with RDNA.

"fine for the time" is yea.. of course and that is why its bandwidth was configured for such a chipset. My point is they're completely different architecture generations. Bringing bandwidth bigger numbers to feed an entirely different architecture as a win doesn't make sense.

Kepler's static scheduling ported to every Nvidia generations afterward and the way SM building block were partitioned and divided was also brought into following generations, with iterative improvements of course.

Occupancy was only an issue with GCN on PC, because of two factors. DirectX11 and the standard of the time being work waves of 32.
But on consoles not only the SDKs were made for work waves of 64, but it also made thorough use of Async Compute. So occupancy was very good on the PS4 and Xbone.
It took until Turing for Nvidia to make an Async solution that was able to improve performance.

Geometry was an issue for AMD at the time. Especially with tessellation enabled.
Consoles went around that by using low levels of tesselation, or just not using it.

AMD strangely focused on compute workload in an era that focused a lot more on graphic workload. While they raised bandwidth from terascale to GCN, Nvidia lowered it from Fermi to Kepler. Complete opposite directions. The problem is not just work waves of 32, but that the architecture is made for large work sizes with long durations. Small work sizes like vertex shaders handling geometry, which is kind of important in a game. Kepler went with more fixed function hardware rather than compute.

Tahiti : 1 primitive/clock into rasterizer → 16 pixels/clock = 1x64-wide wave/4 clocks → feeding 16 CUs → 1 wave per SIMD (4 per CU) 256 clocks
Max occupancy, 10 waves per SIMD : 2560 clocks

Kepler : 1 primitive/clock into rasterizer → 8 pixels/clock = 1x32-wide wave/4 clocks → feeding 2 SMX → 1 wave per SMSP (4 per CU) 32 clocks
Max occupancy, 16 waves per SIMD : 512 clocks

Waves of 32 would halve GCN's I agree, but even wave64 like the above they are so different architecture wise. 5 times the clocks for max occupancy. Insane. Graphics rendering of that era is full of small draw calls.

Was a weird period on AMD side, I cannot explain how both companies took completely tangent directions. AMD tried to improve the situation it with 4 rasterizers per 16CUs ASAP with Hawaii, but strangely went derp with SE:CU ratio with Vega again and then opposite with RDNA. Raja Koduri's Vega magnum opus of SE:CU ratio of 1:16 :messenger_tears_of_joy: :lollipop_anxious_sweat:

Graphic rendering is not really made for these compute monsters. Its limited in parallelism and filled with lots of small draw calls. Cyberpunk 2077 being a modern example of this even though it is doing a LOT of compute shaders, it's the definition of small draw calls/constant geometry update and little parallelism. If you have an article that details a game developer preferring longer running waves I'm all ears.

High resolution was favoring GCN though for high parallelism (but again, for its time). They were too early on generation purpose CUs.

But GCN had the advantage of having hardware schedulers, that helped reducing load on the CPU side.
Add the low level APIs on consoles, and it removed a lot of work load from the CPU.

Yup. Imagine Ampere without the restraints of microsoft OS bloat & directX and how much the architecture changed with memory handling and occupancy.

But then there is the elephant in the room. There is almost a decade difference between the PS4 and GCN launch vs Ampere and the A78.
It would be really bad, if a CPU and GPU made several years later would not be able to surpass it.

And I would never debate that. Its exactly because peoples keep replying to me with some stats such as higher bandwidth, or clocks or whatever you want to name, that they try to find an advantage to PS4. Like, its not happening, not with Ampere+ARM.

PS4 was not badly designed, its a product of its time, including the need to have high bandwidth as it was the direction AMD took after Terascale.

AMD was always kind of the oddball in GPUs during that period. There's a lot of things from GCN that you see returning in modern times, we're going back to a focus on compute over rasterizer but because of the complete reinvention of the pipeline, but it doesn't mean GCN would perform well with modern tasks because it was not made for say, neural shaders or a compute workload that is not rasterizer like ray tracing or mesh shaders. Its like AMD built the frame to receive future supercar engine, but that engine would come in years later and not really be compatible with the frame you built anymore. So while industry shifted to focus on fast fixed function rasterizer they were going for compute (Turing) they went for rasterizer (RDNA, I know, it keeps some stuffs from GCN but still different philosophy) and now they are overly back to roughly where Nvidia is in the focus of rendering.
 
Because it is far more capable than two consoles that run on shit architecture that was outdated back in 2013.

The only relevant comparisons are the Steam Deck, ROG Ally, Series S and the Switch 1 for remastered games.

Lol is it though?? Do you think the Switch 2 is far more capable than horizon or God of War R running on a ps4 lol
 
ARM is a lot more isolated than anything you can pull out of your hat when it comes to cutdown desktop CPUs slapped into consoles. ARM's A78 IPC kills Jaguar and has a lonnnnng ass scheduler with strong frontend that would make AMD's 2012 era of Bulldozer/Bobcat/Jaguar blush. Some of the worst IPC in CPU history for a period of time. RISC based to begin with with smaller/fewer instructions. Lower IPC mean its more hungry for bandwidth, like I said in my post, compared to ARM.

A good scheduler can mask latency access, but it doesn't make up for contention with data from the CPU. The Switch SoC will have one single memory controller and one pool of memory and that means that the controller will have to manage requests from both the CPU and GPU.
ARM does use smaller instructions, but not fewer. In fact it's more, because they are simpler than X86. X86 instruction fetch are longer and more complex but fewer.
There is some confusion there, as the scheduler is a part of the frontend of a CPU.
And what do you mean by length?

And Jaguar's very simple front end combined with terrible IPC kept stalling. What do you think the entire project of Ryzen tried to fix?
FFS, Jaguar does not even have clustered multi-thread to share execution ressources around cores. There's no defense for this possible.

Care to explain what is that "clustered multi-thread to share execution".
 
"fine for the time" is yea.. of course and that is why its bandwidth was configured for such a chipset. My point is they're completely different architecture generations. Bringing bandwidth bigger numbers to feed an entirely different architecture as a win doesn't make sense.

Kepler's static scheduling ported to every Nvidia generations afterward and the way SM building block were partitioned and divided was also brought into following generations, with iterative improvements of course.

Kepler's static scheduling caused a higher overhead on the CPU. It was a tradeoff, compared to the highly complex scheduler of Fermi.
Kepler saved power and die space, but at the cost of leaving more work for the CPU.
And it's something that Nvidia changed over time, giving back to the scheduler, more advanced features.
The scheduler on Ampere is way more advanced than what was in kepler.

AMD strangely focused on compute workload in an era that focused a lot more on graphic workload. While they raised bandwidth from terascale to GCN, Nvidia lowered it from Fermi to Kepler. Complete opposite directions. The problem is not just work waves of 32, but that the architecture is made for large work sizes with long durations. Small work sizes like vertex shaders handling geometry, which is kind of important in a game. Kepler went with more fixed function hardware rather than compute.

Tahiti : 1 primitive/clock into rasterizer → 16 pixels/clock = 1x64-wide wave/4 clocks → feeding 16 CUs → 1 wave per SIMD (4 per CU) 256 clocks
Max occupancy, 10 waves per SIMD : 2560 clocks

Kepler : 1 primitive/clock into rasterizer → 8 pixels/clock = 1x32-wide wave/4 clocks → feeding 2 SMX → 1 wave per SMSP (4 per CU) 32 clocks
Max occupancy, 16 waves per SIMD : 512 clocks

Waves of 32 would halve GCN's I agree, but even wave64 like the above they are so different architecture wise. 5 times the clocks for max occupancy. Insane. Graphics rendering of that era is full of small draw calls.

Was a weird period on AMD side, I cannot explain how both companies took completely tangent directions. AMD tried to improve the situation it with 4 rasterizers per 16CUs ASAP with Hawaii, but strangely went derp with SE:CU ratio with Vega again and then opposite with RDNA. Raja Koduri's Vega magnum opus of SE:CU ratio of 1:16 :messenger_tears_of_joy: :lollipop_anxious_sweat:

Graphic rendering is not really made for these compute monsters. Its limited in parallelism and filled with lots of small draw calls. Cyberpunk 2077 being a modern example of this even though it is doing a LOT of compute shaders, it's the definition of small draw calls/constant geometry update and little parallelism. If you have an article that details a game developer preferring longer running waves I'm all ears.

High resolution was favoring GCN though for high parallelism (but again, for its time). They were too early on generation purpose CUs.

I don't think you understand what work waves, or warps are. This is how shaders are grouped to be sent to the compute units. Not to the ROPs, nor the geometry engines.
The advantage of having a work wave of 64 is that it saves a bit of die space. But it also lowers shader occupancy. On the PS4, with compute async and low level APIs, that is not much of a problem.

Yup. Imagine Ampere without the restraints of microsoft OS bloat & directX and how much the architecture changed with memory handling and occupancy.

Every GPU and CPU would benefit from MS not having so much bloat in Windows.
But the question I was talking about was not that, but the fact that DX11 didn't have Async compute. So for a long time, this feature went unused in most PC games.

And I would never debate that. Its exactly because peoples keep replying to me with some stats such as higher bandwidth, or clocks or whatever you want to name, that they try to find an advantage to PS4. Like, its not happening, not with Ampere+ARM.

PS4 was not badly designed, its a product of its time, including the need to have high bandwidth as it was the direction AMD took after Terascale.

AMD was always kind of the oddball in GPUs during that period. There's a lot of things from GCN that you see returning in modern times, we're going back to a focus on compute over rasterizer but because of the complete reinvention of the pipeline, but it doesn't mean GCN would perform well with modern tasks because it was not made for say, neural shaders or a compute workload that is not rasterizer like ray tracing or mesh shaders. Its like AMD built the frame to receive future supercar engine, but that engine would come in years later and not really be compatible with the frame you built anymore. So while industry shifted to focus on fast fixed function rasterizer they were going for compute (Turing) they went for rasterizer (RDNA, I know, it keeps some stuffs from GCN but still different philosophy) and now they are overly back to roughly where Nvidia is in the focus of rendering.

GCN was a great architecture that introduced a lot of things we take for granted now, such as low level APIs on PC and async compute.
But with DX11 being the default of that era, meant that the best features of GCN were untapped.
 
Kepler's static scheduling caused a higher overhead on the CPU. It was a tradeoff, compared to the highly complex scheduler of Fermi.
Kepler saved power and die space, but at the cost of leaving more work for the CPU.
And it's something that Nvidia changed over time, giving back to the scheduler, more advanced features.
The scheduler on Ampere is way more advanced than what was in kepler.

Yes it shifted in time. Maxwell continued it and I think Pascal went to dynamic.

They had different priorities in time. I think Nvidia was more inline with the immediate graphic rendering trends while AMD had good ideas but at the wrong time.

I don't think you understand what work waves, or warps are. This is how shaders are grouped to be sent to the compute units. Not to the ROPs, nor the geometry engines.
The advantage of having a work wave of 64 is that it saves a bit of die space. But it also lowers shader occupancy. On the PS4, with compute async and low level APIs, that is not much of a problem.

? Its the exact shader array

0FSG4J8.jpeg


Are we talking about different things?

Every GPU and CPU would benefit from MS not having so much bloat in Windows.
But the question I was talking about was not that, but the fact that DX11 didn't have Async compute. So for a long time, this feature went unused in most PC games.

I was on R9 280 for the longest time waiting on that famous DX12 async unlock. Outside of Ashes RTS which was not even a good game, it was pretty barren for a long time. The dev himself actually dropped a cold shower on a lot of speculations going on back then.

AcuEBJ0.jpeg

Saying Async is a modest perf increase

Then Wolfenstein 2 async patch with modest 5%, etc. I mean Definitely AMD prefered DX12/Mantle over DX11, but I can't recall feeling like I made the absolute best choice with that card while I waited an eternity for these things to happen. I was waiting on some kind of "OMG! 50+% improvement!" and it never came.

By time Pascal came it all mattered very little.

GCN was a great architecture that introduced a lot of things we take for granted now, such as low level APIs on PC and async compute.
But with DX11 being the default of that era, meant that the best features of GCN were untapped.

Oh yea for sure. AMD could actually have grabbed the gaming world by the balls as they had consoles in their pocket and they had like 1 engineer in 100 or so studios at their peak? They could have leveraged mantle for making console ports heavily favor AMD, at least that's what I thought when I bought my R9 and... they fumbled.

GCN vs say a Pascal APU I would be sweating but vs Ampere I really don't see an argument that favors AMD here thus this whole discussion over the most obvious fucking statement of all time, its a more modern architecture...
 
The Switch 2 CPU has a decompression block on it, that will help a lot in running current gen engines on it, over what is possible on PS4. The Cpu does not have to work as hard for the same assets to be loaded.
 
Care to explain what is that "clustered multi-thread to share execution".

AMD's "hyper-threading

A good scheduler can mask latency access, but it doesn't make up for contention with data from the CPU. The Switch SoC will have one single memory controller and one pool of memory and that means that the controller will have to manage requests from both the CPU and GPU.
ARM does use smaller instructions, but not fewer. In fact it's more, because they are simpler than X86. X86 instruction fetch are longer and more complex but fewer.
There is some confusion there, as the scheduler is a part of the frontend of a CPU.
And what do you mean by length?

Long scheduler? As in ARM's scheduler for instruction scheduling and execution is massive. 160 entries ROB (re-order buffer). To give a comparison with desktop CPU you would have Skylake with 58 entries. ARM is completely overbuilt to avoid stalls.

Regardless of these micro details, we have the IPC of these processors anyway. ARM 78 for its time (of course modern ARM better always), was just a tad under Zen 3 IPC but a lot better than Zen 2. What saves modern consoles is their high clocks. Jaguar IPC is not even in the chat room. Like I said, it won't compete with Series S in cases where the CPU is a problem just by pure raw clocks, but it will never encounter a scenario where Jaguar, even the higher clocked one in Pro, would outperform it.
 
It's still more comparable the the XB1 than XBSS.

I dont see it based of the raw numbers. I think it's probably "halfway" between the two, "last gen plus" but with smaller gulf than compared to the "last gen plus" the switch 1 had with respect to ps3/360

Are you basing this off what we know about the raw specs? cause cpu should be somewhere halfway in between the two. 8 cores of A78c is way better than 8 cores of jaguar

Switch 2 has more ram than both Xbox one and series S. Memory bandwidth is more in line with Xbox one but that's about it

Gpu raw power, I dunno. A 3ish Tflop ampere based gpu vs a 1.3ishtflop GCN 1.0 gpu? Series S is 4Tflopish rdna2. Seems closer to series S to me. Feature wise, I favour ampere to even if you can't 1:1 compare the cuda to sp

I just don't see the "closer to Xbox one" claim based off raw specs. It's somewhere in between the two in terms of raw power with more modern features, likely better ray tracing than rdna2, and dlss to help out in some cases. Way better feature set than GCN 1.0 I don't think you can argue that

It's also got a much smaller power draw which is what makes this debate about "xbone/ps4 tier" super impressive imo.

It's obviously not as good as current gen but it's also not just Xbox one/ps4 level either which is where the incorrect comparisons come. "Last gen plus", close to half way to series S (which itself is a fair bit weaker than series X and ps5)
 
Here is a question. What games will run on that 120hz screen @ 120hz. It wont run most modern games at 60
 
Let's all face it ....while we all want DF to be totally wrong and "biased" about their predictions they are most likely going to be more right then wrong about the ps4 comparisons. Not only does they have more tech knowledge than 98% of us but they are more importantly industry insider who have had many months of discussions with other industry people about the specs/rumored specs.

Even if they're "wrong" and it turns out to be better than ps4 handheld/ps4 pro docked they will be "closer to right than wrong".

DF's biggest negative quality is not being biased against one company or another like the insecure fanboys like to say, instead their M.O. is generally being soft on all the big companies so as to maintain good relationships and not rock the boat. Been following them for who knows how long and I'm usually annoyed at how they always try to put a positive spin on just about everything. Except, ironically enough Alex who will actually get passionate about PC gaming and will call out devs for that. He's the only one there who has any backbone though. Jon, Oliver, and Tom are all beta cucks in dealing with console coverage imo lol. So Oliver not being afraid to call the Switch 2 a ps4 is actually commendable in a way, knowing it'll upset and army of angry Nintendo fanboys.
 
Top Bottom