Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

0.368508 s

So, only a tiny difference. Does this really use the 256-bit registers? Or do you have another guess on why the speedup isn't bigger (is it actually not compute bound anymore)?
This avx code uses the 128bit xmm registers since the test is based on 4-way vectors; I'd have to redesign the test to make it suitable for 8-way vectors. AVX128 still offers advantages for this test though - autovectorized code is essentially identical to the manual intrinsics version, but avx having 3-operand ops allows for the vector splats to not be 2-op (mov + shuffle) and not introduce an extra dependency. Non-brain-dead vector ISAs ahoy.

Please, verify the updates to test results to see I did not misread/mistype something from the SB results.
 
This avx code uses the 128bit xmm registers since the test is based on 4-way vectors; I'd have to redesign the test to make it suitable for 8-way vectors. AVX128 still offers advantages for this test though - autovectorized code is essentially identical to the manual intrinsics version, but avx having 3-operand ops allows for the vector splats to not be 2-op (mov + shuffle) and not introduce an extra dependency. Non-brain-dead vector ISAs ahoy.

This explains a lot, thanks.

Please, verify the updates to test results to see I did not misread/mistype something from the SB results.

The values are correct.


I guess now we need someone to run the code on a PS3 now?

Compiling this for the PPE shouldn't be a problem, but I'm not sure if it can be brought to an SPE without rewriting/adding some code.
 
I haven't had a chance to try again on the M620. I've been dealing with stupid people trying to interface with our webservice API.
 
I haven't had a chance to try again on the M620. I've been dealing with stupid people trying to interface with our webservice API.
No worries. No deadlines here ; )
 
Still getting the "end of file not at end of a line" error.

I copy/pasted your command line from post 488, so I don't think it's a typo.

Same problem with sse3_nehalem_xmm.txt.
 
Still getting the "end of file not at end of a line" error.

I copy/pasted your command line from post 488, so I don't think it's a typo.

Same problem with sse3_nehalem_xmm.txt.
That should be a warning though, not an error. If that's the only message you get you should have a new testvest_intrinsics binary at the location you run the command at.
 
This explains a lot, thanks.
Compiling this for the PPE shouldn't be a problem, but I'm not sure if it can be brought to an SPE without rewriting/adding some code.
Well honestly, the PPE is the more interesting one, since it help us estimate Xbox360 CPU performance.
 
Not quite (even without the enhanced SIMD), since the Jaguar cores will be clocked higher (1.6 - 2.0 GHz rumored). But yeah, the difference per core will probably not be all too big in the real world. And it's easier to get the maximum performance out of three cores than out of eight.

See Blu's initial results. I'm taking the difference in clockspeed into account.

You have been slightly misled. Extrapolating the results from the test shows an Espresso core should perform equally (on the test) to a 1.6GHz Bobcat, not to a 1.6GHz Jaguar. Jaguar should have twice as wide SIMD fp unit compared to Bobcat, and that should show in the test. Now, how much better Jaguar will be compared to Bobcat remains to be seen.

Yet i was speaking of bobcat, not of jaguar. And there were posts in the "serious discussion" thread (i think) claiming real world performance gain for jaguar was about 10% over bobcat... which i also stated in my post. So i'm not sure what you mean by saying i have been misled.

To compare the Wii CPU to (one core of) Bobcat (which is supposedly the +/-10% less beefy brother of the CPU inside PS4/Durango)
 
No binary... just the two text files.
Which are those?

Yet i was speaking of bobcat, not of jaguar. And there were posts in the "serious discussion" thread (i think) claiming real world performance gain for jaguar was about 10% over bobcat... which i also stated in my post. So i'm not sure what you mean by saying i have been misled.
Ok perhaps I read into your post more than it actually said, but two things need to be emphasized here to avoid potential misunderstandings:

1. The test is largely fp/simd centric - it's not testing anything else but the fp/simd pipeline, admittedly with a fairly rudimentary kind of workload which should be common in various game scenarios - 4x4 matrix catenation (done as multiplication) is commonly found in the 3d world. A good portion of the test is also about how well the compiler can utilize that pipeline.
2. We indeed do not know how much better a Jaguar core would fare compared to a Bobcat core at fp/simd, and at this test, respectively. Saying Jaguar will be slightly better is as justified as saying it will be twice (or more) better - it's all hypothetical.

If there's one thing this test does answer in a very straight forward manner it is the question (often expressed as a definitive statement by forumites): 'Wouldn't nintendo have made a clearly better choice, particularly with regard to simd performance, if they had chosen a contemporary (to the timeframe wiiu was designed) mobile processor as the Bobcat?' I think the answer is quite apparent.
 
sse3_nehalem.txt and sse2_dothan.txt

I downloaded the dothan file to see if it would do anything, but obviously it doesn't and I never deleted it.
Ok, can you paste the output from running that gcc command line then? Also see what 'which as' and 'dpkg-query -L binutils' produce, as there's a slight chance you might be missing the binutils package or have some incomplete such.
 
Ok, can you paste the output from running that gcc command line then? Also see what 'which as' and 'dpkg-query -L binutils' produce, as there's a slight chance you might be missing the binutils package or have some incomplete such.

Cool, I'll get to that soon.
 
Exactly what are you testing? Are you running code on Expresso or you testing the CPU that the PS4/Durango CPU's are supposed to be similar to?
 
We have a 4-core Vishera at the office running Xubuntu 64 bit. I can do a run tomorrow.
I assume that's family 15h? If so, here's the asm listing (autovectorized; -march=bdver1 -mtune=bdver1; bdver2 is not supported by the 4.6.3 I have here). Code (in the matmul loop) is almost identical to the sandy bridge one (slightly different scheduling), so you can try both this and SB's asm for completeness. Thanks!

ed: ok, I just noticed Piledriver should feature FMA3. Is that in addition to FMA4? Regardless I've been unable to make the autovectorizer in 4.6.3 use the FMA extension. Perhaps I should upgrade the compiler to 4.7.x for this one to work.
 
Sorry, not a Piledriver after all. It's actually a Zambezi/OG Bulldozer (FX 4100), but family 15h is right nevertheless. Couldn't run tests yet, machine is loaded. Should go idle late tomorrow or Friday, hopefully.

It seems to have fma4 only, not fma3.


Some choice quotes from /proc/cpuinfo:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core arat cpb npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold

cpu family : 21
model : 1
model name : AMD FX(tm)-4100 Quad-Core Processor
stepping : 2
microcode : 0x6000629
 
FX-4100 posted
elapsed time: 0.488872 s
... on its best run out of ten on ...
the asm listing (autovectorized; -march=bdver1 -mtune=bdver1; bdver2 is not supported by the 4.6.3 I have here).

Sandy bridge version, best run out of ten:
elapsed time: 0.502464 s

This should really be scripted to include a longer warmup period that is not counted towards the measured throughput. Dynamic clock switching probably already plays a significant role in the remaining variance at total execution times around half a second. Hmmm....
 
FX-4100 posted
elapsed time: 0.488872 s
... on its best run out of ten on ...

Sandy bridge version, best run out of ten:
elapsed time: 0.502464 s
Thanks. Not bad even without using the FMA extension. I'm really curious about the effect of the latter, though, so I'll try to factor that in.

This should really be scripted to include a longer warmup period that is not counted towards the measured throughput. Dynamic clock switching probably already plays a significant role in the remaining variance at total execution times around half a second. Hmmm....
You have a point in general, but I don't think the absence of a pre-warm for the clock boost would skew the results just yet, not at the present timing magnitudes anyway. It's easily testable, though - just bump the number of repetitions in the test N-fold (line 988 in the bdver1 listing) and see if the timing increases proportionally. Doing the opposite test (i.e. decreasing the number of repetitions 8-fold) on the bobcat here produces results of variance less than a percent (0.19%) from the linear scaling of the original 60M repetitions for ~4s on the same cpu.
 
Thanks. Not bad even without using the FMA extension. I'm really curious about the effect of the latter, though, so I'll try to factor that in.


You have a point in general, but I don't think the absence of a pre-warm for the clock boost would skew the results just yet, not at the present timing magnitudes anyway. It's easily testable, though - just bump the number of repetitions in the test N-fold (line 988 in the bdver1 listing) and see if the timing increases proportionally. Doing the opposite test (i.e. decreasing the number of repetitions 8-fold) on the bobcat here produces results of variance less than a percent (0.19%) from the linear scaling of the original 60M repetitions for ~4s on the same cpu.

What exactly does all of this lead to?
 
What exactly does all of this lead to?
We don't have a Jaguar on the list yet, if that's what you're referring to. But purely from reading the brochure, a Jaguar should have AVX (which helps a bit with this test, even in its 128 bit form), but should not have FMA (which is arguably the single most significant x86_64 ISA extension relevant to this test, and something the PPC has had natively since the dawn of the ISA).
 
We don't have a Jaguar on the list yet, if that's what you're referring to. But purely from reading the brochure, a Jaguar should have AVX (which helps a bit with this test, even in its 128 bit form), but should not have FMA (which is arguably the single most significant x86_64 ISA extension relevant to this test, and something the PPC has had natively since the dawn of the ISA).
So we would be expected for Jaguar to do better than Bobcat, but not significantly due to the possible lack of FMA, correct?
 
So we would be expected for Jaguar to do better than Bobcat, but not significantly due to the possible lack of FMA, correct?
Well, the comparison to Bobcat should be mostly affected by the twice wider fp SIMD (64 -> 128 bit) everything else should come secondary.
 
Not sure if this has been posted

http://fail0verflow.com/blog/2013/espresso.html


The future of console homebrew (and a shot of Espresso)

Spend $100 and you can get an Ouya, which beats the Wii U’s CPU and doesn’t have too shabby graphics at one third the cost.

Yes, a 1.6GHz quad-core Cortex-A9 with NEON from ~2010 beats a 1.2GHz tri-core PowerPC 750 with paired singles from ~1997 or 2001 (depending on whether you count the PS or not). The PPC750 is a nice core and has lasted long (and beats crap like the Cell PPU and the 360's cores clock-per-clock on integer workloads), but sorry, contemporary mobile architectures have caught up, and the lack of modern SIMD is significant. Performance varies by workload, but I'm willing to bet that they're similar at integer workloads and the Cortex-A9 definitely has more SIMD oomph thanks to NEON.

"PPC" doesn't have a software programmable clock, because PowerPC is an architecture, not a chip. Some PowerPC-compliant chips, like the PPC750FX, have two programmable PLLs and the ability to switch between them to implement clock switching. Some, like the PPC750CL (and the Broadway) only have one PLL and its configuration can only be changed at hard reset time (externally). Initially, we thought/hoped that the Espresso would borrow the 750FX's dual PLLs, but that turned out not to be the case. The CPU die shot also does not show two symmetrical PLLs. The HID1 bits that control the PLLs in the 750FX are not present in the Espresso.

The Starbuck sets the bus frequency (the Latte's SYSPLL) and configures Espresso's multiplier (via configuration pins) and it's stuck there. The bus clock is ~248MHz (almost the same as the Wii, which used a ~243MHz system bus), but the processor bus is wider. In Wii U mode, the multiplier is 5, while in Wii mode it is 3 (and with SYSPLL set to ~243MHz instead of ~248MHz).

The code that sets the Espresso clock multiplier pins in the Cafe2Wii compatibility bootstrap (C2W_SetEspressoPLLConfigGPIO, part of the Wii U mode Starbuck code) sets both a GPIO labeled "ESP10Workaround" and an unknown bit in another register. Unfortunately, toggling the former myself (which is possible in Wii mode) doesn't seem to cause the Espresso to come up with 5x multiplier, and the latter is locked and cannot be changed in Wii mode as far as I can tell (the locking mechanism is unknown, but it might just lock itself when that bit is flipped).

Also, sorry, but a 750 at 2-2.5GHz is insane. You don't get to just stick an old microarchitecture into a newer process and triple the clockspeed. 729MHz to 1.24GHz is the kind of sensible clockspeed bump that we expected. I can guarantee that the Espresso isn't going to switch to a higher clockspeed while in Wii U mode; you guys need to just accept the fact that it runs at 1.24GHz already.
 
There's one thing (oh well, more than one actually) that I still don't understand about this article: why do those guys refer to the WiiU GPU as a "standard, and somewhat outdated Radeon core" ?

Are people in the GPU thread missing something that fail0verflow know/understand better or what?
 

While I'm sure this guy knows what he's talking about it is annoying when there are parts of the chips that are unknown that they still claim to KNOW the function of

Unfortunately, toggling the former myself (which is possible in Wii mode) doesn't seem to cause the Espresso to come up with 5x multiplier, and the latter is locked and cannot be changed in Wii mode as far as I can tell (the locking mechanism is unknown, but it might just lock itself when that bit is flipped).

Also, sorry, but a 750 at 2-2.5GHz is insane. You don't get to just stick an old microarchitecture into a newer process and triple the clockspeed. 729MHz to 1.24GHz is the kind of sensible clockspeed bump that we expected. I can guarantee that the Espresso isn't going to switch to a higher clockspeed while in Wii U mode; you guys need to just accept the fact that it runs at 1.24GHz already.
 
While I'm sure this guy knows what he's talking about it is annoying when there are parts of the chips that are unknown that they still claim to KNOW the function of

While I'm sure they know more than most of us it does seem like they're trying to say they know more than they actually do
 
While I'm sure this guy knows what he's talking about it is annoying when there are parts of the chips that are unknown that they still claim to KNOW the function of

What I don't 100% understand is why would the CPU run at 1.24ghz in the first place (in Wii mode). Why wouldn't it just run at 729mhz... How are they so 100% positive this is the correct clock speed for Wii U mode when they haven't broken into Wii U mode yet? Unless I missed something. It's been ages since that 1.24 mhz report first came out and my memory is hazy.

I don't see why 2 ghz is so ridiculous. It's a drastically smaller process than Broadway, probably with more modern/refined cooling techniques.
 
What I don't 100% understand is why would the CPU run at 1.24ghz in the first place (in Wii mode). Why wouldn't it just run at 729mhz... How are they so 100% positive this is the correct clock speed for Wii U mode when they haven't broken into Wii U mode yet? Unless I missed something. It's been ages since that 1.24 mhz report first came out and my memory is hazy.

I don't see why 2 ghz is so ridiculous. It's a drastically smaller process than Broadway, probably with more modern/refined cooling techniques.

Broadway (well the 750cl) could be clocked up to 1.1ghz, 2ghz for espresso doesnt seem ridiculous at all, not saying it is/can/will be just that comment was rather condisending of them
 
What I don't 100% understand is why would the CPU run at 1.24ghz in the first place (in Wii mode). Why wouldn't it just run at 729mhz... How are they so 100% positive this is the correct clock speed for Wii U mode when they haven't broken into Wii U mode yet? Unless I missed something. It's been ages since that 1.24 mhz report first came out and my memory is hazy.

I don't see why 2 ghz is so ridiculous. It's a drastically smaller process than Broadway, probably with more modern/refined cooling techniques.
It does run at 729MHz in Wii mode. 243MHz * 3. In native mode, it's 248MHz * 5. The problem is that, while it's possible to enable the additional cores and cache in Wii mode, it's not possible to change the frequency or multiplier.
 
It does run at 729MHz in Wii mode. 243MHz * 3. In native mode, it's 248MHz * 5. The problem is that, while it's possible to enable the additional cores and cache in Wii mode, it's not possible to change the frequency or multiplier.

Is it not possible though that as their hacks went through wii mode they might have incorrect details for things such as core frequency?
 
Is it not possible though that as their hacks went through wii mode they might have incorrect details for things such as core frequency?
They have native mode hack as well, but not the time nor motivation to polish it up for release. The CPU stuff is a test. It's useless for pirates, but could be very useful for homebrew developers. If there's enough interest from people who're not into piracy, they'll probably release the real thing.
 
Broadway (well the 750cl) could be clocked up to 1.1ghz, 2ghz for espresso doesnt seem ridiculous at all, not saying it is/can/will be just that comment was rather condisending of them

Maybe the heat output stopped N from clocking it at 2GHZ because they wanted the small sized box
 
What I don't 100% understand is why would the CPU run at 1.24ghz in the first place (in Wii mode). Why wouldn't it just run at 729mhz... How are they so 100% positive this is the correct clock speed for Wii U mode when they haven't broken into Wii U mode yet? Unless I missed something. It's been ages since that 1.24 mhz report first came out and my memory is hazy.

I don't see why 2 ghz is so ridiculous. It's a drastically smaller process than Broadway, probably with more modern/refined cooling techniques.
It runs at 729Mhz in Wii mode... what they say is that it's impossible to go higher than that or than 1,24Ghz even on WiiU mode.

About the A9 being better than the Espresso, I highly doubt it except for some synthetic tests. In theoretical integer performance it may be true, in some small tests it can also be true, but what sets the Espresso apart from the A9 is it's huge caches, and considering that both the Ouya and mobile phones use slow RAM those 2+0.5+0.5 MB of L2 cache configuration will make a huge difference in real world performance.
Those Cortex A9 doesn't even have on-die L2 cache, and the L2 cache is near 1/2 of the Espresso die. That is a HUGE difference that in no way the A9 can compensate for. Even in SIMD code I doubt that the A9 could compete against the Espresso in real world situations where you always tends to have much more data that can't be stored in the L1 caches.
 
I don't see why 2 ghz is so ridiculous. It's a drastically smaller process than Broadway, probably with more modern/refined cooling techniques.

It's not that much about cooling but about pipeline length. It only has 4 or 5 stages. To put it simply, the time electrons need to run through a stage limits the clock rate. It might be possible to overclock it to 2 GHz if the voltage is raised far enough (and heavy cooling is applied to dissipate the heat) and if you're not interested in longevity of the hardware, but that's clearly not what any developer and especially not Nintendo would want. 1.24 GHz fit very well in what one would expect from the advantages of the smaller manufacturing process.
 
Those Cortex A9 doesn't even have on-die L2 cache, and the L2 cache is near 1/2 of the Espresso die. That is a HUGE difference that in no way the A9 can compensate for. Even in SIMD code I doubt that the A9 could compete against the Espresso in real world situations where you always tends to have much more data that can't be stored in the L1 caches.
The Cortex A9 supports up to 4 MB of L2 cache. The configuration in Tegra 3, which was discussed here, has 1 MB.
 
It does run at 729MHz in Wii mode. 243MHz * 3. In native mode, it's 248MHz * 5. The problem is that, while it's possible to enable the additional cores and cache in Wii mode, it's not possible to change the frequency or multiplier.
So, why the bus speed was slightly boosted from 243 to 248MHz? Is that an odd number?
 
The Cortex A9 supports up to 4 MB of L2 cache. The configuration in Tegra 3, which was discussed here, has 1 MB.
Yes, but I said ON-DIE L2 cache. What Cortex A9 supports is to connect a 2nd chip of cache through a dedicated bus&memory controller it has for that.
Having the L2 cache on a different die means a "lot" more cycles to access it (the electric signal has to travel from the main die to the daughter one), and even if it only is a mere difference of 10 cycles per access, at the end of the day this will harm performance enough to be trounced by the Espresso in nearly every sense of the word.
It's not strange that near half of the die area in the Espresso is dedicated to caches. That and its ultra-short pipeline gives it a decisive edge over the A9, no matter how you look at it.

Maybe an A9 with 4MB of L2 cache AND in a SIMD-centric situation could perform better, but not having on-die L2 cache is a HUGE drawback for sure.

Of course, on a mobile phone you wont have that extra chip with cache, and the one in Tegra 3 is still only 1MB to share between 4 cores, so 256KB of memory per core, that's half the amount of L2 cache the "tinny" cores have, and 8 time less cache than the Espresso's main core.

But what's most important is that a certain design is done to achieve certain results. In the case of an ultra-customized design like the one in the WiiU, this huge caches may have an impact bigger than just increase the CPU performance.

I mean, maybe the Espresso's performance won't benefit as much having a core with 2MB than it could have if it had 4 cores with 512Kb of L2 cache instead.
But since what matters is the overall performance of the system, it could be possible that those 2MB of L2 cache have not only the function of increasing the performance of that concrete core but also to reduce hugely the accesses to main memory, thus increasing the main RAM performance (RAM performance decreases from its theoretical peak the more seeks you do in it).

All in all, it's pretty obvious to me that a vanilla A9 without an L2 memory chip won't come even close to Espresso's performance, and even if it have it, that wouldn't make much of a difference due to the huge disadvantage that is to have your cache in a separate die (unless you have the huge 8MB one, then it could compensate it a bit).
But what's most important is that on the WiiU design, the one that makes most the sense is Espresso. Even clearly superior CPUs wouldn't be nearly as efficient in that design as the Espresso is, and that's all that matters here, because you won't find an Espresso CPU in a laptop, they're only found on the WiiU.
 
Could somebody refresh my memory....

I thought the CPU (and GPU for that matter) clock rates were extracted while the Wii U was running in Wii mode. They had to do additional hackery to unlock the other two cores if I call correctly, but have not achieved any of that in Wii U mode as of yet. Fail overflow mentioned being unable to set/change the multiplier...

So how is everyone so cock sure 1.24ghz is the real clock speed? There was no chance it was running underclocked in Wii mode? Maybe there was DEFINITIVE evidence that 1.24 number MUST be correct but my memory is hazy..

Thx.
 
Top Bottom