• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

AMD Ryzen Thread: Affordable Core Act

tuxfool

Banned
Indeed. With 4 cores as CCX AMD found something of a one size fits all. A single CCX will fare well in laptops thanks to 14 LPP, and for servers the use of HBM will allow an unprecedented amount of cores. Only the corner case of two CCX in gaming usage on an unprepared OS makes it look worse than it is.

They're not using HBM in Naples. They're using a twin dice package where they may or may not be connected by a 2.5D interposer.

By the time they get to use HBM for CPUs/APUs they''ll have corrected the CCX issues.
 

ethomaz

Banned
Looks like Intel is doing the same with their xeon family when looking at their die shots.. would be fun to see if they have the same kind of latency when jumping to another stack of cores.
Which Xeon?

6-core: http://images.anandtech.com/reviews/cpu/intel/SNBE/Core_I7_LGA_2011_Die.jpg (same than 8-core but with 2 core disabled)
8-core: http://images.anandtech.com/doci/8426/HSW-E Die Mapping Hi-Res.jpg
10-core: http://hothardware.com/ContentImages/Article/2470/content/small_broadwell-e-die-map.jpg

All these are Haswell-E.

I believe they only use this type of separated blocks with server CPUs > 12 core (Haswell-EP).
 

Datschge

Member
They're not using HBM in Naples. They're using a twin dice package where they may or may not be connected by a 2.5D interposer.

By the time they get to use HBM for CPUs/APUs they''ll have corrected the CCX issues.
Right, thanks. (Speculated about them eventually doing HBM, HBM2 etc. server parts and apparently just remembered my own crap.)

As for correcting the CCX "issues" I don't know of what nature it will be. The actual hardware issue right now is the inability to target the RAM frequency directly, though speculation is that BIOS manufactures just haven't implemented the respective AGESA code yet (if that's even finished, AMD hasn't released their usual BKDG yet either). That may (or may not if there turns out to be even finer control) affect the speed of the Infinity Fabric as well which again should help reducing the impact inter CCX travels will have on performance. We'll see by how much.

This shows a bimodal distribution, which indicates to us (as we're expecting a log-normal, or approximately normal distribution) that there are actually two overlapping distributions here, separated by an event either occurring or not occurring in a given frame, where that event delays the frame by a little over 1ms.
Very well thought out, Thraktor. What do you think about the bimodal distribution in the Crysis 3 frame time histogram using a 7700K? Looking closely one can see a second peak at [148, 152]:
kZEtNSi.png

Any guesses what could be the cause for that?
 

dr_rus

Member
I just don't get why they would design the Ryzen CPU like this (im no a cpu designer)

It's significantly cheaper. You design one CCX which is essentially a CPU module with cache attached, then you use some universal interconnect to put this "core" and another one in a chip which also have "uncore" stuff. You're done, you have an 8 core CPU (kinda) even though you've designed just one 4 core module.

For a proper 8 core CPU you'll have to design a completely different module than the one which is designed for an 4 core one, and this is either a long or expensive process meaning that you'll have to either launch your 8 core significantly behind the 4 core (or vise versa) or spend almost twice the money to build them both.

There's a reason why Intel introduce HEDT / server CPUs of same architecture with more cores significantly later than the quad cores for notebooks and regular desktops. Even they can't produce both cores at once. Although they do have a nice framework with a ring bus now which allow them to push out CPUs with more cores faster then previously.

Looks like Intel is doing the same with their xeon family when looking at their die shots.. would be fun to see if they have the same kind of latency when jumping to another stack of cores.

All modern Intel CPUs since Nehalem have a common L3 cache meaning that RAM access latencies are similar no matter which core accesses what data.

They're not using HBM in Naples. They're using a twin dice package where they may or may not be connected by a 2.5D interposer.

By the time they get to use HBM for CPUs/APUs they''ll have corrected the CCX issues.

They'll most probably use HBM in a Zen APU which will have only one CCX so no snooping issues since no split L3.

Wrong. Linux is fully aware of Ryzen's topology (mostly since late last year already) and spreads the threads accordingly for best possible performance. Which is exactly why the situation in Windows is so completely ridiculous.

Sigh. It's not about how you spread the threads, it's about THE FACT that threads of some nature will ALWAYS need to access some data in a "far" L3 of the other CCX. This is a hardware problem, it can't be fixed with OS or anything else, only worked around to some degree. Also - where can we see the results of Linux running the same software on Ryzen better than Windows? Even AMD has said already that there is no issues in how Windows 10 schedule work on Ryzen.
 
AMD Running out of Intel Sheckels, Renews Contract to Defame Own Products



Shocking news.



I call it a h/w issue because it's a h/w issue, one which Intel "fixed" back when it switched from Penryn to Nehalem and one which AMD will most certainly fix in future versions of Zen architecture. There is no other way around it, it will always affect Zen's performance.

Actually, should be pretty easy to test the impact of this issue by disabling one CCX completely and comparing this to a CPU with 2+2 configuration. So far I've seen only one benchmark of this (in PCGH's Ryzen review) and it have some interesting results.

It's interesting to see AMD dismiss the Windows 10 thread scheduler as an issue, haven't we seen it perform better in Windows 7? The way how the CCXs are configured is definitely a 'h/w' issue as the cores appear to have difficulty communicating with each other, but I find it odd that we appear to have seen better performance in some cases under Windows 7 compared to 10, yet AMD have dismissed this. Perhaps they've already been rectified?
 

ethomaz

Banned
You have 386 cases of 144-148 fps and 387 cases of 148-152.

What I mean is due the round the 148.5 could enter in the 148-152 instead 144-148.

It is not a big deal because the difference is 1 fps for both range.

It's interesting to see AMD dismiss the Windows 10 thread scheduler as an issue, haven't we seen it perform better in Windows 7? The way how the CCXs are configured is definitely a 'h/w' issue as the cores appear to have difficulty communicating with each other, but I find it odd that we appear to have seen better performance in some cases under Windows 7 compared to 10, yet AMD have dismissed this. Perhaps they've already been rectified?
The vídeos I saw shows the Win7 not using SMT at all... disabling SMT on Win10 seems to increase performance too.

I really wanted to see the 4 cases being tested.
 
Code:
https://www.pcper.com/news/General-Tech/AMD-Running-out-Intel-Sheckels-Renews-Contract-Defame-Own-Products
Not to get sidetracked or lose the plot, but the text in the original version of this piece (credited to Jeremy Hellstrom) is an embarrassing excuse for journalism.

Seems it has since had some degree of professionalism injected into it though, in a thoroughly revised write-up now credited to Sebastian Peak. The URL appears to still be the same.

This launch has brought out a lot of ugly in people across the tech spectrum. Still, it's one thing for random people online to argue and exchange vitriol on GAF, Twitter, etc..
It's another for tech journalists/writers to be so obviously susceptible to the whims of random AMD/Intel/Nvidia diehards online.

Instead of writing under the guise of being sardonic, it could be a better idea to step away for a few moments and have a breather.



Not looking good for the "7700k killer"
Neither AMD or any reasonable and informed person in this thread has ever used that phrase. It goes against logic and isn't particularly useful for this discussion.

At best, people could possibly argue overall price:performance on the 4c/8t Ryzens, but it's too early for that. While AMD's slides showed the R5 1600X to be several hundred MHz higher than what was speculated pre-launch, it remains to be seen what form the quad cores will come in. Even with wafers directly from Samsung or a Global Foundries re-spin, I'd be surprised to see the Ryzen quads have stock and OC clocks high enough to be called "killers." For one, they would likely need to be produced in greater quantity than any of their other non-APU desktop parts. We'll see.





·feist·;231957127 said:
Hardware Unboxed have also posted a response to AdoredTV's ”Ryzen - The Tech Press Loses The Plot" video, comparing the 2500K vs FX8370: https://www.youtube.com/watch?v=76-8-4qcpPo
I'm kind of shocked to see the 2500K at stock speeds beating the 1700X in Deus Ex: Mankind Divided, as that game scales well beyond 4 cores, and it's one of the games that my 2500K (at 4.5GHz) really struggles with.
You may be as equally surprised to see an i3 7350K (2c/4t) outperforming an i7 2700K (4c/8t), or an i5 6600K (4c/4t) outperforming an i7 5960X (8c/16t).

Multiple reviews have frametime and FPS numbers which don't align with what you mentioned about the 2500K. Extrapolate where needed.

https://techreport.com/review/31366/amd-ryzen-7-1800x-ryzen-7-1700x-and-ryzen-7-1700-cpus-reviewed/9

https://www.computerbase.de/2017-03/amd-ryzen-1800x-1700x-1700-test/4/

https://www.guru3d.com/articles-pages/amd-ryzen-7-1700x-review,20.html

http://www.techspot.com/review/1348-amd-ryzen-gaming-performance/page2.html
I think it may just be the way that you phrased this, but I'm not sure what point you're trying to make.
Your surprise was at the 4c/4t Sandy Bridge outperforming an 8c/16t Summit Ridge in a game that scales across cores. The links I posted show a different conclusion.
They also show lesser core/thread Intels matching or exceeding their higher end, multi-core Intel counterparts, in a game that scales across cores. Seems straightforward.
 

kotodama

Member

RYZEN-1700X-59.jpg


Clearly that SMT Off for the 1700X at 200 fps is much better than that SMT On for the 1700X at 200 fps and infinitely better than that 1800X and 7700K who are both also at an unplayable 200 fps. The winner is that 6900K with SMT off, jumps passed 4 processors, going from a slow 200 fps to a blazing 200 fps. In all seriously, the real world differences in other gaming benches would seem hard to tell to the naked eye as per their conclusion.
 
·feist·;232017933 said:
Code:
https://www.pcper.com/news/General-Tech/AMD-Running-out-Intel-Sheckels-Renews-Contract-Defame-Own-Products
Not to get sidetracked or lose the plot, but the text in the original version of this piece (credited to Jeremy Hellstrom) is an embarrassing excuse for journalism.

Seems it has since had some degree of professionalism injected into it though, in a thoroughly revised write-up now credited to Sebastian Peak. The URL appears to still be the same.

This launch has brought out a lot of ugly in people across the tech spectrum. Still, it's one thing for random people online to argue and exchange vitriol on GAF, Twitter, etc..
It's another for tech journalists/writers to be so obviously susceptible to the whims of random AMD/Intel/Nvidia diehards online.

Instead of writing under the guise of being sardonic, it could be a better idea to step away for a few moments and have a breather.



Neither AMD or any reasonable and informed person in this thread has ever used that phrase. It goes against logic and isn't particularly useful for this discussion.

At best, people could possibly argue overall price:performance on the 4c/8t Ryzens, but it's too early for that. While AMD's slides showed the R5 1600X to be several hundred MHz higher than what was speculated pre-launch, it remains to be seen what form the quad cores will come in. Even with wafers directly from Samsung or a Global Foundries re-spin, I'd be surprised to see the Ryzen quads have stock and OC clocks high enough to be called "killers." For one, they would likely need to be produced in greater quantity than any of their other non-APU desktop parts. We'll see.





Your surprise was at the 4c/4t Sandy Bridge outperforming an 8c/16t Summit Ridge in a game that scales across cores. The links I posted show a different conclusion.
They also show lesser core/thread Intels matching or exceeding their higher end, multi-core Intel counterparts, in a game that scales across cores. Seems straightforward.

Some of the pcper guys have been outright called liars and idiots both in their comment section and on twitter. They post on other tech forums as well and they have been told to take the article down how much did Intel pay them and their high level AMD sources are made up. This thing got ugly just cos ppl had built Zen up to be something it's not.

That frankly embarrassing headline is just the culmination of frustration and a play on what someone accused them of doing someone asked how many Intel shekels they were paid to write the article.

Note now that AMD bhas cone out and sais they have no issue with the win10 scheduler ppl still saying it's the win10 scheduler.....

I think some optimisations could be done but we need to stop coming to conclusions and bending reality to fit our conclusion. Then attacking those who don't agree
 

dr_rus

Member
RYZEN-1700X-59.jpg


Clearly that SMT Off for the 1700X at 200 fps is much better than that SMT On for the 1700X at 200 fps and infinitely better than that 1800X and 7700K who are both also at an unplayable 200 fps. The winner is that 6900K with SMT off, jumps passed 4 processors, going from a slow 200 fps to a blazing 200 fps. In all seriously, the real world differences in other gaming benches would seem hard to tell to the naked eye as per their conclusion.

The most interesting part there is the comparison between Ryzen's SMT and BWE's HT tbh.
 

spyshagg

Should not be allowed to breed
There's clearly (350% !) of added delay for each transaction between cores belonging to each CCX.

ping-amd.png




And windows is clearly spreading workloads between 2 CCX when there's room to spare on CCX 1 (the real issue.)

smton2workers.png


Windows clearly sees a distinction between physical and logical threads, but there is clearly a path to optimize code for ryzen when the behavior above is being witnessed. Even if it is 1%, we take it.

Despite statements from AMD's stating nothing needs addressing (and some notable gaf users cherry picking which statements to dissect and which to accept at face value), the data above makes denying the scheduler could be optimized willingly ill-intentioned.

As with many other tech issues, time will tell. So enjoy the ride and make all the intentional biased opinions at your own peril. The internet remembers.
 

dr_rus

Member
And windows is clearly spreading workloads between 2 CCX when there's room to spare on CCX 1 (the real issue.)

smton2workers.png

Windows can't know what will help performance more in this case: the lack of L3 snooping across CCX or four times the amount of L3 per active core. Both options are able to result in a better performance, as hardware.fr results show on the previous page. It's not a clear cut that this behavior of Windows scheduler is wrong.
 

spyshagg

Should not be allowed to breed
Windows can't know what will help performance more in this case: the lack of L3 snooping across CCX or four times the amount of L3 per active core. Both options are able to result in a better performance, as hardware.fr results show on the previous page. It's not a clear cut that this behavior of Windows scheduler is wrong.

Its not clear cut either way.
 

ethomaz

Banned
RYZEN-1700X-59.jpg


Clearly that SMT Off for the 1700X at 200 fps is much better than that SMT On for the 1700X at 200 fps and infinitely better than that 1800X and 7700K who are both also at an unplayable 200 fps. The winner is that 6900K with SMT off, jumps passed 4 processors, going from a slow 200 fps to a blazing 200 fps. In all seriously, the real world differences in other gaming benches would seem hard to tell to the naked eye as per their conclusion.
It is just ridiculous they use a game locked at 200fps to show difference between SMT on/off.
 

ethomaz

Banned
There's clearly (350% !) of added delay for each transaction between cores belonging to each CCX.

ping-amd.png




And windows is clearly spreading workloads between 2 CCX when there's room to spare on CCX 1 (the real issue.)

smton2workers.png


Windows clearly sees a distinction between physical and logical threads, but there is clearly a path to optimize code for ryzen when the behavior above is being witnessed. Even if it is 1%, we take it.

Despite statements from AMD's stating nothing needs addressing (and some notable gaf users cherry picking which statements to dissect and which to accept at face value), the data above makes denying the scheduler could be optimized willingly ill-intentioned.

As with many other tech issues, time will tell. So enjoy the ride and make all the intentional biased opinions at your own peril. The internet remembers.
Windows can't know if putting in different CCX will boost or decrease performance. Some apps will run better on different CCX and others will run better in the same CCX.

That is why AMD said there is nothing wrong with Win10 schedule.
 

spyshagg

Should not be allowed to breed
Hey, that's the benchmark I was looking for since release.

A pretty stark difference, for workloads which feature low-latency communication between more than 4 cores that will probably always present an issue.



The same test with 5960x

ping-intel.png



Ryzen as lower physical core-to-core latency which may explain why sp to mp ratio beats intel in some benchs, maybe.
 

spyshagg

Should not be allowed to breed
Windows can't know if putting in different CCX will boost or decrease performance. Some apps will run better on different CCX and others will run better in the same CCX.

That is why AMD said there is nothing wrong with Win10 schedule.

As its been said many times, it is indeed the same principle as NUMA except NUMA refers to a split in RAM/Cores/cache, and ryzen only between Cores/cache. Dont move threads between sockets unless it needs to.

I may never need fixing but when you target this design paradigm I suppose performance will increase somewhat. Because those 350% latency hits are no joke and have to hurt like Durante said.
 
All this information coming out about the CCX's makes me wonder what the physical layout of the 6-core and 4-core chips will be. I do not know if having mirrored CCX's are mandatory or not, so the 6-core chips can either be a 4x2 or 3x3. In that situation, the 4x2 setup might actually be more beneficial for games as you can park everything on one CCX and not deal with the latency issues crossing CCX's and you won't lose much not having access to the other 2c/4t. I would imagine the 4-core would either be a 4x0 or 2x2 setup. It would seem like a 4x0 setup would pretty much always be better than a 2x2, particularly for games as you would have to constantly go across CCX's on a 2x2 setup. A 4x0 vs 2x2 quad-core could be the difference between Core i5 performance in games versus Pentium performance in games.
 

ethomaz

Banned
As its been said many times, it is indeed the same principle as NUMA except NUMA refers to a split in RAM/Cores/cache, and ryzen only between Cores/cache. Dont move threads between sockets unless it needs to.

I may never need fixing but when you target this design paradigm I suppose performance will increase somewhat. Because those 350% latency hits are no joke and have to hurt like Durante said.
I understood that but for OS Ryzen is an unique socket... there are anything inside Ryzen to say to Windows to use it like two sockets?

Anyway I don't think OS using it like two sockets is the solution because it is a unique socket after all... one bus to RAM/IO and not 2 like in a two socket system... that can even create others issues with Ryzen.

Old Intel 1x1 chips was used like a single socket by OS.

All this information coming out about the CCX's makes me wonder what the physical layout of the 6-core and 4-core chips will be. I do not know if having mirrored CCX's are mandatory or not, so the 6-core chips can either be a 4x2 or 3x3. In that situation, the 4x2 setup might actually be more beneficial for games as you can park everything on one CCX and not deal with the latency issues crossing CCX's and you won't lose much not having access to the other 2c/4t. I would imagine the 4-core would either be a 4x0 or 2x2 setup. It would seem like a 4x0 setup would pretty much always be better than a 2x2, particularly for games as you would have to constantly go across CCX's on a 2x2 setup. A 4x0 vs 2x2 quad-core could be the difference between Core i5 performance in games versus Pentium performance in games.
3x3 and 4x0... 6-core will have one core disabled and 4-core will use only one CCX (about half the size of the 8-core).
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Sigh. It's not about how you spread the threads, it's about THE FACT that threads of some nature will ALWAYS need to access some data in a "far" L3 of the other CCX. This is a hardware problem, it can't be fixed with OS or anything else, only worked around to some degree.
How is that different from having 4 intel cores per NUMA socket in configurations of, say, this Xeon?

One difference is that cross-cluster communication will be faster than cross-socket communication
 

Durante

Member
Computerbase ran their game benchmark set on Windows 7, Windows 10, Windows 10 with HPET (high precision event timer) disabled (as per AMD guidelines), and Windows 10 in "High Performance" energy mode ("Höchstleistung"):
ryzen_wincwuex.png

https://www.computerbase.de/2017-03/ryzen-windows-7-benchmark-core-parking/

It seems like overall there is no large difference, though for a few individual benchmarks Win 7 and Win 10 in "high performance" mode show significantly improved results. Note that those two always track together, so it seems like the differences between Win 7 and 10 are really entirely attributable to more aggressive energy optimization and not scheduling changes. (And you can get the same or better performance as Win 7 in Win 10 by enabling the "high performance" profile)
 

ethomaz

Banned
Computerbase ran their game benchmark set on Windows 7, Windows 10, Windows 10 with HPET (high precision event timer) disabled (as per AMD guidelines), and Windows 10 in "High Performance" energy mode ("Höchstleistung"):
ryzen_wincwuex.png

https://www.computerbase.de/2017-03/ryzen-windows-7-benchmark-core-parking/

It seems like overall there is no large difference, though for a few individual benchmarks Win 7 and Win 10 in "high performance" mode show significantly improved results. Note that those two always track together, so it seems like the differences between Win 7 and 10 are really entirely attributable to more aggressive energy optimization and not scheduling changes. (And you can get the same or better performance as Win 7 in Win 10 by enabling the "high performance" profile)
Interesting... from 14 games tested only 2 (Project Cars and Battlefield 1) showed Win7 ahead Win10.

That is interesting too.

Google Translate: If Windows 7 assigns the threads the same way as Windows 10 (with respect to the actual kernel and SMT), then it is conceivable that Windows 7 disables the logical "SMT threads", but leaves the actual kernels active.
Win7 disable SMT like I suspected in videos showing it not using the even core.
 

Thraktor

Member
No, it doesn't affect only single threaded applications, it affects all of them and does two things: 1) tries to load the fastest core to 100% all the time (I've ran into this in WD2 recently actually, where with TBT3 enabled c1 of my 6850K is always at 100% and with it disabled the load is about equal between all 12 threads); 2) affinitize all heavy threads so that they won't jump cores in the process of execution:

m4pc.png

Fair enough, my reading of TB3 may have been misinformed, but if anything this reinforces my point, which is that excessive thread migration in the Windows 10 scheduler is harming performance on all CPUs (and particularly those with higher core counts). Part 2 above (affinitizing all heavy threads) isn't something that a CPU driver should ever have to do, it's absolutely the OS thread scheduler's responsibility to properly allocate threads to cores in a reasonably efficient manner. Hell, that's a thread scheduler's only responsibility.

Microsoft shouldn't just abrogate their job of writing a properly functioning task scheduler by forcing Intel and AMD to basically write their own.

PCPer tests show that this isn't the case, threads are assigned and migrated as on any other multicore CPU with SMT. The issue arise when a thread is being migrated across CCX but this isn't an issue of OS scheduler - as thread migration is pretty normal, it's an issue of the h/w architecture and there are two options of dealing with this: A) don't let threads go to a second CCX essentially turning Ryzen 7 into a quad core (some unrelated OS work can run on the second CCX in parallel I guess), B) program the s/w in such way that it won't inflict a (large) performance hit in case of such migration.

I'm not claiming that the OS scheduler is treating Ryzen any differently than it is other chips, or that there's any issue with thread allocation (which is the subject of the PCPer article). What I'm pointing out is that the Windows 10 thread scheduler is migrating threads far too often for any processor. It's most visible on Ryzen, because of the dual-ring topology, but it's an issue for Intel as well, otherwise they wouldn't have bothered developing software to override this particular aspect of the thread scheduler's behaviour. I'd be willing to bet that, with proper frame time data for an 8-core Intel CPU, I'd be able to find exactly the same issue present itself, and although the performance cost is likely to be quite a bit less than the ~1ms we're seeing on Ryzen, even a couple of hundred nanoseconds is a non-trivial performance issue when it comes to something as latency-sensitive as a game.

Again, this is a h/w issue and as such it should be "fixed" by the IHV s/w, in this case - by AMD's CPU driver. As I've shown above, it's completely possible to affinitize all work and thus stop Windows scheduler from performing such migrations. If they are the issue of the performance loss then keeping threads running where they've started should provide some benefits at least.

Just because it's possible for AMD to write software to override a problem with the thread scheduler doesn't mean that it's not a problem with the thread scheduler. Windows 10's scheduler can and absolutely should consider the performance cost of migrating threads, and the fact that it appears to be migrating threads with abandon at every opportunity suggests it isn't, which is harming performance of the OS on all hardware, not just Ryzen.

Very well thought out, Thraktor. What do you think about the bimodal distribution in the Crysis 3 frame time histogram using a 7700K? Looking closely one can see a second peak at [148, 152]:
kZEtNSi.png

Any guesses what could be the cause for that?

As discussed above, the same thread migration bug should be there on Intel CPUs as well, but with much less of a performance impact on a quad-core CPU like the 7700K. It shouldn't really be visible at the granularity level of this histogram, though (as the performance difference would be smaller the peaks would be so far together as to basically just merge), so it doesn't make sense as an explanation of the bimodality we're seeing here. With access to the actual frame time data I'd be able to investigate it more fully, in particular by using autocorrelation analysis to determine what, if any, cyclicality there is. It's entirely possible that it's much more difficult to pin down than the scheduler issue, though (and to perform a definitive analysis of the scheduler issue I'd really need full frame time data anyway, histograms aren't really enough for this kind of analysis).
 

dr_rus

Member
How is that different form having 4 intel cores per NUMA socket in configurations of, say, this Xeon?

One difference is that cross-cluster communication will be faster than cross-socket communication

The obvious difference is one being a mainstream CPU for gamers and like and another being an expensive dual socket system targeted at server workloads which aren't games usually. Ryzen is doing more or less fine in everything but the games so for server workloads a NUMA configuration is either expected or doesn't matter much.

Just because it's possible for AMD to write software to override a problem with the thread scheduler doesn't mean that it's not a problem with the thread scheduler. Windows 10's scheduler can and absolutely should consider the performance cost of migrating threads, and the fact that it appears to be migrating threads with abandon at every opportunity suggests it isn't, which is harming performance of the OS on all hardware, not just Ryzen.

It does actually, as you do not fix a specific h/w part issue in a general OS scheduler which should work over a large number of CPU architectures. That's where the driver comes in which abstracts this issue from the scheduler.
 

ethomaz

Banned
How is that different from having 4 intel cores per NUMA socket in configurations of, say, this Xeon?

One difference is that cross-cluster communication will be faster than cross-socket communication
There are a lot of differences.

This Xeon has 15MB L3 cache for each CPU... 30MB L3 cache for dual socket.
Ryzen has 16MB L3 cache total... 8MB L3 cache per CCX.

This Xeon has one bus of 9.6 GT/s for each CPU/socket.
Ryzen has only one bus for the two CCX.

These little thinks helps... (of course the interconnect CCX bus is faster than the external bus of a dual socket... so the CCX communication will be faster than each socket communication but) -- not sure about that... see edit -- the external communication (IO, RAM, etc) will be faster on Xeon because you have a way for each CPU while Ryzen has only one.

Edit - Looks like even the CCX interconnect is slower.
 

atbigelow

Member
I don't understand how people think Windows could "fix" the interconnect speed between the CCXs. You could write code to mitigate it, but you can't "fix" it without a new chip.
 

Datschge

Member
I don't understand how people think Windows could "fix" the interconnect speed between the CCXs. You could write code to mitigate it, but you can't "fix" it without a new chip.
The scheduler should make logical decisions to achieve the best possible performance. For this the scheduler needs to know the CPU topology, which paths are fast paths and which are bottlenecks. There is nothing new or special about Ryzen's topology. If a scheduler is prepared to handle multi sockets, multi chip and multi core systems it's easy to make a scheduler behave in a way to not tank performance on every new topology even without adaptions. That the Windows scheduler knows little about the hardware topology and as such is not prepared is a design decision by Microsoft.

Linux's scheduler does just that and specifically knows the Ryzen topology since late last year, moving threads only when and where it makes sense for performance. Windows' scheduler on the other hand is plain dumb and kept ignorant even about the core parking mechanism which by itself is broken as well.

It's understandable that AMD publicly claims the Windows scheduler is working as expected, AMD relies on Microsoft to even be able to get a significant desktop PC market share for Ryzen. Even if Microsoft doesn't support AMD, for AMD that's still better than Microsoft actively working against them. And the Windows scheduler works as expected, expectedly bad. To different degrees for all topologies by both AMD and Intel.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
No. You make a dangerous assumption here. The CCX interconnect is 22GB/sec and QPI on Sandy Bridge and above is 38.4GB/sec.
And you're spreading misinformation here.

1. You're comparing hearsay BW vs paperspecs.
2. Of that 38.4GB/s QPI paperspecs, half of that goes in one direction - so it's 9.6GT/s * 2 bytes per direction, or 19.2GB/s per direction. So when a core needs something from another socket, that something travels at 19.2GB/s, ideally.
3. The paperspecs of those "22 GB/s" of IF BW that have been circulating the net are (wait for it) 38.4GB/s for a 1.2GHz bus (http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html - bottom of the page)
4. The major issue of NUMA is not BW, it's latency. Good luck achieving the latency of Ryzen's cross-cluster traffic over QPI connecting remote sockets!
 

atbigelow

Member
The scheduler should make logical decisions to achieve the best possible performance. For this the scheduler needs to know the CPU topology, which paths are fast paths and which are bottlenecks. There is nothing new or special about Ryzen's topology. If a scheduler is prepared to handle multi sockets, multi chip and multi core systems it's easy to make a scheduler behave in a way to not tank performance on every new topology even without adaptions. That the Windows scheduler knows little about the hardware topology and as such is not prepared is a design decision by Microsoft.

Linux's scheduler does just that and specifically knows the Ryzen topology since late last year, moving threads only when and where it makes sense for performance. Windows' scheduler on the other hand is plain dumb and kept ignorant even about the core parking mechanism which by itself is broken as well.

It's understandable that AMD publicly claims the Windows scheduler is working as expected, AMD relies on Microsoft to even be able to get a significant desktop PC market share for Ryzen. Even if Microsoft doesn't support AMD, for AMD that's still better than Microsoft actively working against them. And the Windows scheduler works as expected, expectedly bad.

Right, I agree with all this. It's still mitigating the issue. Which is what it should do, absolutely. But there will always be that performance penalty moving across clusters.

(As a note, I've always found threading and cores and clusters fascinating; this is all food for my soul)
 

nubbe

Member
There's clearly (350% !) of added delay for each transaction between cores belonging to each CCX.

ping-amd.png




And windows is clearly spreading workloads between 2 CCX when there's room to spare on CCX 1 (the real issue.)

smton2workers.png


Windows clearly sees a distinction between physical and logical threads, but there is clearly a path to optimize code for ryzen when the behavior above is being witnessed. Even if it is 1%, we take it.

Despite statements from AMD's stating nothing needs addressing (and some notable gaf users cherry picking which statements to dissect and which to accept at face value), the data above makes denying the scheduler could be optimized willingly ill-intentioned.

As with many other tech issues, time will tell. So enjoy the ride and make all the intentional biased opinions at your own peril. The internet remembers.

Games that are latency sensitive to achieve high framerates are going to suffer if it has threads on separate CCX clusters.
Kinda surprised the cluster communication is so slow and bandwidth limited.

Hope they fix it with Ryzen2

Production software seem mostly fine since threads usually do their own thing, but games tend to need to communicate with a primary thread that manages the frame creation

fnNTdO8m.png
 

Datschge

Member
Right, I agree with all this. It's still mitigating the issue. Which is what it should do, absolutely. But there will always be that performance penalty moving across clusters.
Sure there will be a penalty, but why even do it then? Windows scheduler forces spreading threads every 10-15ms for essentially no reason, and ignoring all possible bottlenecks. Even without knowing anything about the topology it could easily made to learn that in this and that case performance became significantly worse than usual and avoid doing that in the future. Instead it keeps doing that in a rapid-firing manner like a headless chicken.

Look, a bottleneck is like a trap in the middle of your way. You can give people eyes to never fall into it (Linux scheduler). You can make a company try to tell all people to know where that trap is to avoid it (AMD helping game developers optimize their code). Or you are like Microsoft and let the scheduler run into the same trap repeatedly again and again.
 

atbigelow

Member
Sure there will be a penalty, but why even do it then? Windows scheduler forces spreading threads every 10-15ms for essentially no reason, and ignoring all possible bottlenecks. Even without knowing anything about the topology it could easily made to learn that in this and that case performance became significantly worse than usual and avoid doing that in the future. Instead it keeps doing that in a rapid-firing manner like a headless chicken.

Look, a bottleneck is like a trap in the middle of your way. You can give people eyes to never fall into it (Linux scheduler). You can make a company try to tell all people to know where that trap is to avoid it (AMD helping game developers optimize their code). Or you are like Microsoft and let the scheduler run into the same trap repeatedly again and again.

Not all workloads are going to fit into the logical cores of a single cluster. Even if Linux's scheduler knows about the cluster transition performance hit, it would still make sense to move things over in certain cases.

I'm just saying that it's a multi-faceted issue; the OSs need to know about it as well as software engineers.
 
And you're spreading misinformation here.

1. You're comparing hearsay BW vs paperspecs.
2. Of that 38.4GB/s QPI paperspecs, half of that goes in one direction - so it's 9.6GT/s * 2 bytes per direction, or 19.2GB/s per direction. So when a core needs something from another socket, that something travels at 19.2GB/s, ideally.
3. The paperspecs of those "22 GB/s" of IF BW that have been circulating the net are (wait for it) 38.4GB/s for a 1.2GHz bus (http://www.hardware.fr/articles/956-23/retour-sous-systeme-memoire.html - bottom of the page)
4. The major issue of NUMA is not BW, it's latency. Good luck achieving the latency of Ryzen's cross-cluster traffic over QPI connecting remote sockets!

Well it's a good thing Intel put 12 cores on a dual ring bus then.
 

Datschge

Member
Not all workloads are going to fit into the logical cores of a single cluster. Even if Linux's scheduler knows about the cluster transition performance hit, it would still make sense to move things over in certain cases.

I'm just saying that it's a multi-faceted issue; the OSs need to know about it as well as software engineers.
Right, and that's exactly the job a scheduler has to do: Solve the riddle how to resolve all demands without dragging down performance as soon as all fast paths are contented. The Windows scheduler doesn't even manage filling the fast paths first, otherwise we would not have benchmarks where 4+0 cores outperform 4+4 cores at the same workload.
 

ethomaz

Banned
That is a pretty weird way to compare because he disabled SMT in the 4+0 and left enable in 4+4.

You for scenarios:

1) 4+0 SMT off
2) 4+0 SMT on
3) 4+4 SMT off
4) 4+4 SMT on

He just tested 1 vs 4.

There is no way to know what is affecting more 4+4 or SMT.

So it's doing 10 frames better with only 4 cores active in Deus ex(96.5 to 86.6), and about 50 frames faster in doom, 50-60 frames slower in battlefield 1, somewhat slower in CS, and a lot slower in Tomb Raider. I'd be interested to see what 4c/8t vs 8c/16t looks like, but a 1500x could actually turn out pretty well from the looks of things.
Some difference could be only SMT on/off instead of 4+0 use... the test is really useless to see what is impacting the performance.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
Well it's a good thing Intel put 12 cores on a dual ring bus then.
Funny that you mention that, since Cluster-on-a-Die techniques are a way to make NUMA-aware sw and schedulers perform better on those dual-ring cpus via partitioning of the dual-ring bus into two on-die NUMA nodes.
 
Top Bottom