Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

It's not that easy. You still have much more cache memory on Espresso to bump it's performance, and that's not to speak about the GPU.
How are you planning to emulate the 32+3MB of eDram it has?

I figured the additional cache would pose some kind of barrier, but like I said I'm a layman in this field. For the moment I'm not so much concerned with emulating the entire Wii U console as much as just thinking about the concept of scaling a single core emulator to a multi-core emulator.
 
Everything not floating point related. The more branches you have in your code, the greater the difference.
It also has much more cache memory and much less latency to RAM pools, so it's impossible to determine how performs Espresso in comparison to Xenon exactly unless we had some direct examples from the scene.

For you to have a reference, Broadway performed (in general purpose code) only a 20% less than a Xenon core, without multi-threading.
Considering that Xenon shares the 1MB of L2 cache between all of his cores, the performance per core wouldn't scale linearly when using other cores for your code.
The difference would be HUGE in favour of Espresso.
20%? Considering that the raw power scaling of Espresso is 6 times Broadway, I'd say that puts Espresso a little bit ahead of the last gen CPU.

Seems that Espresso is a lot more capable than people are willing to give it credit for.

No one ever answered though, do the CPU's in the Nintendo consoles have features like MMX, SSE or Altivec?


It's not that easy. You still have much more cache memory on Espresso to bump it's performance, and that's not to speak about the GPU.
How are you planning to emulate the 32+3MB of eDram it has?

I would hope people would try to emulate the PS3 and 360 first seeing as they will not be playing on their next gen counterparts.

The only thing emulating the Wii U will lead to would be a larger drop and third party games and quality as a result of avoiding profit loss caused by pirates.
 
But is there anybody who could add some thoughts on this?
Basically what I'm wondering is if it's possible to estimate the amount of work it would require to go from running 3 simultaneous emulations of Broadway at ~170% speed to running an Espresso emulator at something close to real time?

The biggest issue (besides making it theoretically work while having no access to detailed technical documentation) are that the PPC CPU emulation is already a huge bottleneck in Dolphin as is, so 3x ~170% that needs much beefier machines to even be able to check the emulation quality without taking even more (compatibility breaking) short cuts. Then a major (if not the core) problem in emulation is the accurate emulation of timing between different part. The complexity of that increases thousandfold with multiple cores as well as different levels of fast caches/memories. So working Wii U and Xbox 360 emulation will need a lot of more time (first need to have feasible PC power at easy disposal) while PS3 emulation is even more further off.
 
20%. Considering that the raw power scaling of Exrpesso is 6 times Broadway, I'd say that puts Expresso a little bit ahead of the last gen CPU.
How did you reach to this number? If we take the statement "one Xenon core is 20% faster than Broadway at GP" as a base, it's 1243 / 729 / 1.2 = 1.4208, or IOW a single Expresso core should be 42% faster than a single Xenon core at GP.

No one ever answered though, do the CPU's in the Nintendo consoles have features like MMX, SSE or Altivec?
Did you actually search? 750CL/Broadway and Espresso all feature 2-way SIMD known as paired singles.
 
That in itself would be incredibly troubling, but you don't have player scaling down in other games that I'm aware of.
It might not be too troubling. If it this is the case, it could just be a firmware away from being improved (not the game code, but the system level's network performance). Also consider that netcode can make a huge difference, it's possible the the netcode just wasn't meshing well with the networking on the WiiU (whereas it wasn't a problem on other platforms).
 
How did you reach to this number? If we take the statement "one Xenon core is 20% faster than Broadway at GP" as a base, it's 1243 / 729 / 1.2 = 1.4208, or IOW a single Expresso core should be 42% faster than a single Xenon core at GP.
???

Did you actually search? 750CL/Broadway and Espresso all feature 2-way SIMD known as paired singles.

I'm not familiar with that, and how would you know that Espresso uses the exact feature with no enhancements or alterations? Was there any leaked documentation for Expresso like there was for Broadway?
 
I was referring to the bolded number in your post I quoted. Didn't you post that in response to freezamite's last post?

I'm not familiar with that, and how would you know that Espresso uses the exact feature with no enhancements or alterations? Was there any leaked documentation for Expresso like there was for Broadway?
Expresso surely has paired singles. Whether it has some SIMD beyond that is not known, but the chances are low.
 
20%? Considering that the raw power scaling of Espresso is 6 times Broadway, I'd say that puts Espresso a little bit ahead of the last gen CPU.
What do you mean by raw power scaling?
No one ever answered though, do the CPU's in the Nintendo consoles have features like MMX, SSE or Altivec?
No.

They feature 50-SIMD instructions and the paired singles which seems to be a "Altivec aternative backwards beta/lite" implementation, albeit aimed at 3D graphics acceleration.

This because regular G3's only had 32-bit FPU, and no SIMD instructions.

Altivec is 4-way FPU+SIMD instructions.
 
This because regular G3's only had 32-bit FPU, and no SIMD instructions.
Most PPCs have had 64bit fpus since the 603 days. Actually, it's one of the reasons for Gekko to put those wide FPRs to good use.

Apropos, to whoever was curious about how Gekko fared hypothetically in fpu/simd against xbox OG's Coppermine - I've added a Pentium-M to the test. One can scale to the respective clocks from the quoted results.
 
Most PPCs have had 64bit fpus since the 603 days. Actually, it's one of the reasons for Gekko to put those wide FPRs to good use.

Apropos, to whoever was curious about how Gekko fared hypothetically in fpu/simd against xbox OG's Coppermine - I've added a Pentium-M to the test. One can scale to the respective clocks from the quoted results.

Thanks for that update!

When normalized the dothan (a CPU architecture used in notebooks roughly 9+ years ago) performs on par with the bobcat!

That's absolutely insane, who would've thought that it could go toe to toe with a much more modern CPU like the bobcat! Intel was really sitting on some pretty heady stuff even back then with the Pentium M. Those folks in Haifa, Isreal really knew their stuff and were pretty much responsible for Intel regaining the throne from AMD with the Core 2.

Would it be too much trouble to ask for a Core 2 (Conroe or Penryn) to be added into the charts? Still waiting on the Sandy Bridge or Ivy Bridge samples too :D

Thanks again.
 
I was referring to the bolded number in your post I quoted. Didn't you post that in response to freezamite's last post?

No, I was quoting that from his post. Didn't you read what I was responding to?

There is a misinterpretation going on here: the analysis that Freezamite mentioned stated that Hollywood's performance with GP code is 20% less than Xenon core to core. Krizzx was just emphasizing the 20% part.
 
There is a misinterpretation going on here: the analysis that Freezamite mentioned stated that Hollywood's performance with GP code is 20% less than Xenon core to core. Krizzx was just emphasizing the 20% part.
Ok, I just saw his edit. My bad.
 
When normalized the dothan (a CPU architecture used in notebooks roughly 9+ years ago) performs on par with the bobcat!

That's absolutely insane, who would've thought that it could go toe to toe with a much more modern CPU like the bobcat!

More modern doesn't mean more performance per clock. Bobcat was designed for even lower power consumption combined with an integrated GPU. That didn't leave much room for better IPC, whereas Pentium M was a relatively straightforward advancement of the Pentium 3. Intels' Atom usually performs worse, btw. ;)

Anyway, very interesting benchmarks! Thanks. Broadway performs a little better than I had expected.
 
Glad to hear the test is found interesting by fellow gaffers.

I'll be adding more CPUs as soon as I can get (1) sufficient time on those, and (2) a linux and a proper compiler on them. I don't really want to do largely-divergent toolchains tests - keeping compiler things under control is sufficiently complex as it is. I have access to a C2D, but it's running the wrong OS & cc; I'm waiting to get the chance to test Sandy Bridge, and definitely Atom (surprising how uncommon they are in my environment).
 
Glad to hear the test is found interesting by fellow gaffers.

I'll be adding more CPUs as soon as I can get (1) sufficient time on those, and (2) a linux and a proper compiler on them. I don't really want to do largely-divergent toolchains tests - keeping compiler things under control is sufficiently complex as it is. I have access to a C2D, but it's running the wrong OS & cc; I'm waiting to get the chance to test Sandy Bridge, and definitely Atom (surprising how uncommon they are in my environment).

Good to hear. I'm still rocking a C2D laptop circa 2008-09 for everyday use (surfing, very light gaming, video and music consumption) and I'm interested to know how it fares against the Broadway when normalized in that test.
 
blu said:
How did you reach to this number? If we take the statement "one Xenon core is 20% faster than Broadway at GP" as a base, it's 1243 / 729 / 1.2 = 1.4208, or IOW a single Expresso core should be 42% faster than a single Xenon core at GP.
From here:
http://neosource.1emu.net/forums/index.php?topic=2003.15

And here:
http://www.neogaf.com/forum/showpost.php?p=44966628&postcount=646

Don't know if it's the same person, but it seems reliable enough to me.
Take into account that since L2 cache is shared in Xenon, when you code only for one core you have the whole 1MB of L2 cache available to that concrete core.
Espresso, on the other hand, has double the L2 cache than Broadway for Cores 0 and 2, and 8 times more cache for core 1.

So in a multi-core program scenario, where the L2 in Xbox360 is shared between all the cores/threads, that L2 per core will surely be far less than the whole MB available on that concrete case where a Xenon core it's only 20% faster than Broadway in GPC.

Since caches are much larger in Espresso and latency to main memory is much lower (faster DDR3 accesses + MCM design, and that's not to talk about data being stored in the 32MB eDram pool, which would have even much lesser latencies than the big DDR3 pool) even a 2x estimate would seem conservative to me.
Add to that the fact that Xbox 360 lacked the DSP that the WiiU also have and that in some games even a whole core of the Xenon had to be used for processing sound and the real difference increases even more in favour of Espresso.
 
From here:
http://neosource.1emu.net/forums/index.php?topic=2003.15

And here:
http://www.neogaf.com/forum/showpost.php?p=44966628&postcount=646

Don't know if it's the same person, but it seems reliable enough to me.
Take into account that since L2 cache is shared in Xenon, when you code only for one core you have the whole 1MB of L2 cache available to that concrete core.
Espresso, on the other hand, has double the L2 cache than Broadway for Cores 0 and 2, and 8 times more cache for core 1.

So in a multi-core program scenario, where the L2 in Xbox360 is shared between all the cores/threads, that L2 per core will surely be far less than the whole MB available on that concrete case where a Xenon core it's only 20% faster than Broadway in GPC.

Since caches are much larger in Espresso and latency to main memory is much lower (faster DDR3 accesses + MCM design, and that's not to talk about data being stored in the 32MB eDram pool, which would have even much lesser latencies than the big DDR3 pool) even a 2x estimate would seem conservative to me.
Add to that the fact that Xbox 360 lacked the DSP that the WiiU also have and that in some games even a whole core of the Xenon had to be used for processing sound and the real difference increases even more in favour of Espresso.
I'm aware of the source of the 20% in your post. I was asking where krizzx's 20% were coming from as his pre-edit post allowed misinterpretation (which I fell for, apparently) in the sense of '20% is what an Expresso core should do over a Xenon core at GP'.
 
I'm aware of the source of the 20% in your post. I was asking where krizzx's 20% were coming from as his pre-edit post allowed misinterpretation (which I fell for, apparently) in the sense of '20% is what an Expresso core should do over a Xenon core at GP'.

Preedit? The only thing I changed was that I added a "?" in place of a "." which I mistyped. You're stating it like I completely changed what I wrote to something else. Its still the same post. Just reading the quoted material I was responding to still would have made it clear what I was speaking of question mark or not.
 
Great to see that the list is growing and will continue to grow Blu, excellent work.

Any chance of adding more mobile CPU's like the A15 and A5 or even just a A9?
 
Preedit? The only thing I changed was that I added a "?" in place of a period which I forgot. You're stating it like I completely changed what I wrote. Just reading the quoted material I was responding to still would have made it clear what I was speaking of.
If I had seen the question mark from the start I wouldn't have misinterpreted your post. The question mark would have been a sufficient indication you're referring to the same number the other poster had already given a context to.

Great to see that the list is growing and will continue to grow Blu, excellent work.

Any chance of adding more mobile CPU's like the A15 and A5 or even just a A9?
Thanks, Donnie. I'm adding to it as I come across specimen. Funny fact: the Dothan actually got on the list by accident - I was asking around for an Atom netbook to test on, and a friend offered his. Turned out the netbook had a Dothan ; )
 
Glad to hear the test is found interesting by fellow gaffers.

I'll be adding more CPUs as soon as I can get (1) sufficient time on those, and (2) a linux and a proper compiler on them. I don't really want to do largely-divergent toolchains tests - keeping compiler things under control is sufficiently complex as it is. I have access to a C2D, but it's running the wrong OS & cc; I'm waiting to get the chance to test Sandy Bridge, and definitely Atom (surprising how uncommon they are in my environment).

I have a C2D running Mint 12 if that helps...
 
I have a C2D running Mint 12 if that helps...
Yep, that should do - thanks. I'll build a core2-tuned version tonight and post the asm listing along with a gcc command line to produce the binary. You'll have the pleasure to do the bunch of runs ; ) Just make sure the test machine is not in some power-save mode where it could throttle the CPU.
 
I have a C2D running Mint 12 if that helps...

Yep, that should do - thanks. I'll build a core2-tuned version tonight and post the asm listing along with a gcc command line to produce the binary. You'll have the pleasure to do the bunch of runs ; ) Just make sure the test machine is not in some power-save mode where it could throttle the CPU.

Wewt, thanks guys! Earendil, make sure to show them C2D chips in a good light lol. Maximum performance all the way!
 
The biggest issue (besides making it theoretically work while having no access to detailed technical documentation) are that the PPC CPU emulation is already a huge bottleneck in Dolphin as is, so 3x ~170% that needs much beefier machines to even be able to check the emulation quality without taking even more (compatibility breaking) short cuts. Then a major (if not the core) problem in emulation is the accurate emulation of timing between different part. The complexity of that increases thousandfold with multiple cores as well as different levels of fast caches/memories. So working Wii U and Xbox 360 emulation will need a lot of more time (first need to have feasible PC power at easy disposal) while PS3 emulation is even more further off.

Ah, ok. So If I understand correctly, you're basically saying (among other more technical points) that emulation complexity doesn't scale linearly relative to the number of cores being emulated, and may in fact even scale exponentially with the number of cores.

Thanks!
 
Out of curiosity do we know how much more performance the X360 CPU has vs. the Wii CPU?
If we do, couldn't we come up with a worst-case scenario? I.e. if Wii U's CPU didn't have any improvements compared to the Wii CPU than it'll have X% of the X360 CPU.
 
Out of curiosity do we know how much more performance the X360 CPU has vs. the Wii CPU?
If we do, couldn't we come up with a worst-case scenario? I.e. if Wii U's CPU didn't have any improvements compared to the Wii CPU than it'll have X% of the X360 CPU.

Well, that's somewhat what Blu's tests were all about in the beginning i believe. To compare the Wii CPU to (one core of) Bobcat (which is supposedly the +/-10% less beefy brother of the CPU inside PS4/Durango), and see how it compares for most common game code, and extrapolate to 3 cores and higher clock for WiiU. Turns out that for actual game code, WiiU CPU should be no sloth. I think the outcome was that one WiiU core, would be able to keep up with one PS4/Durango core... Of course, WiiU CPU only has 3 cores and the other two have 8(?).

Also keep in mind, that over 15% of the 360's CPU was used for sound. WiiU has a DSP to take care of the sound.
 
Well well, what do you know - turns out GCC 4.6.3 produces identical asm code from the test app whether compiling for Core2 or for Core i7 (gcc options -march=core2 -mtune=core2 versus -march=corei7 -mtune=corei7). So Nehalem's assembly code as linked in the test results will do for the C2D test just fine. Now, re building the binary, once you have downloaded the assembly listing (I suggest testing both the autovectorized as well as the manually emitted version, if time permits), do:

Code:
gcc -o testvect_intrinsic -x assembler path/to/downloaded/asm/listing -lrt -lstdc++

That produces the binary 'testvect_intrinsic' in the current directory. From there on, launch as:

Code:
echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1" | ./testvect_intrinsic

What the above does is send to the test its expected parameters (so you don't have to type them each time) - arg matrix A (16 elements), arg matrix B (16 elements) and a magic number 1 (used for fooling the compiler we're doing millions of unique matrix multiplications whereas in fact we do one repeatedly). The particular parameters above are a matrix containing the numbers 1 through 16 (row by row), and an identity matrix. Their multiplicative result should be a matrix containing the numbers 1 through 16 (yeah, I know - I have no imagination); if the product printed out is not the expected one then something went terribly wrong ™.

Do the above run a good number of times (say, a dozen or two), write down the best time (which, if the machine was originally at rest, should not really vary much). Post back : )

Well, that's somewhat what Blu's tests were all about in the beginning i believe. To compare the Wii CPU to (one core of) Bobcat (which is supposedly the +/-10% less beefy brother of the CPU inside PS4/Durango), and see how it compares for most common game code, and extrapolate to 3 cores and higher clock for WiiU. Turns out that for actual game code, WiiU CPU should be no sloth. I think the outcome was that one WiiU core, would be able to keep up with one PS4/Durango core... Of course, WiiU CPU only has 3 cores and the other two have 8(?).
You have been slightly misled. Extrapolating the results from the test shows an Espresso core should perform equally (on the test) to a 1.6GHz Bobcat, not to a 1.6GHz Jaguar. Jaguar should have twice as wide SIMD fp unit compared to Bobcat, and that should show in the test. Now, how much better Jaguar will be compared to Bobcat remains to be seen.
 
Well, that's somewhat what Blu's tests were all about in the beginning i believe. To compare the Wii CPU to (one core of) Bobcat (which is supposedly the +/-10% less beefy brother of the CPU inside PS4/Durango), and see how it compares for most common game code, and extrapolate to 3 cores and higher clock for WiiU.

I think it might be a bit far fetched to say that this benchmark is really representative for game code, but it's a lot better than nothing of course and interesting nonetheless.

[...] (which is supposedly the +/-10% less beefy brother of the CPU inside PS4/Durango)[...]

Actually, AMD even promises 15% higher IPC (and 10% higher clock). Addionally the FPUs have been widened from 64-bit to 128-bit allowing for 4-way SSE instructions in one pass instead of two.

I think the outcome was that one WiiU core, would be able to keep up with one PS4/Durango core...

Not quite (even without the enhanced SIMD), since the Jaguar cores will be clocked higher (1.6 - 2.0 GHz rumored). But yeah, the difference per core will probably not be all too big in the real world. And it's easier to get the maximum performance out of three cores than out of eight.
 
Well well, what do you know - turns out GCC 4.6.3 produces identical asm code from the test app whether compiling for Core2 or for Core i7 (gcc options -march=core2 -mtune=core2 versus -march=corei7 -mtune=corei7). So Nehalem's assembly code as linked in the test results will do for the C2D test just fine. Now, re building the binary, once you have downloaded the assembly listing (I suggest testing both the autovectorized as well as the manually emitted version, if time permits), do:

Code:
gcc -o testvect_intrinsic -x assembler path/to/downloaded/asm/listing -lrt -lstdc++

That produces the binary 'testvect_intrinsic' in the current directory. From there on, launch as:

Code:
echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1" | ./testvect_intrinsic

What the above does is send to the test its expected parameters (so you don't have to type them each time) - arg matrix A (16 elements), arg matrix B (16 elements) and a magic number 1 (used for fooling the compiler we're doing millions of unique matrix multiplications whereas in fact we do one repeatedly). The particular parameters above are a matrix containing the numbers 1 through 16 (row by row), and an identity matrix. Their multiplicative result should be a matrix containing the numbers 1 through 16 (yeah, I know - I have no imagination); if the product printed out is not the expected one then something went terribly wrong ™.

Do the above run a good number of times (say, a dozen or two), write down the best time (which, if the machine was originally at rest, should not really vary much). Post back : )

I did some runs of this on my Sandy Bridge @ 3.3 4 GHz (Core i5 2500K) and Ubuntu 10.10.

Results:
0.386927 s (autovectorized)
0.500083 s (manual intrinsics)

..which translates to 1276,8591 1547,708 and 1650,2739 2000,332 on the normalized ranking. This is of course without taking advantage of AVX.
 
Well well, what do you know - turns out GCC 4.6.3 produces identical asm code from the test app whether compiling for Core2 or for Core i7 (gcc options -march=core2 -mtune=core2 versus -march=corei7 -mtune=corei7). So Nehalem's assembly code as linked in the test results will do for the C2D test just fine. Now, re building the binary, once you have downloaded the assembly listing (I suggest testing both the autovectorized as well as the manually emitted version, if time permits), do:

Code:
gcc -o testvect_intrinsic -x assembler path/to/downloaded/asm/listing -lrt -lstdc++

That produces the binary 'testvect_intrinsic' in the current directory. From there on, launch as:

Code:
echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1" | ./testvect_intrinsic

What the above does is send to the test its expected parameters (so you don't have to type them each time) - arg matrix A (16 elements), arg matrix B (16 elements) and a magic number 1 (used for fooling the compiler we're doing millions of unique matrix multiplications whereas in fact we do one repeatedly). The particular parameters above are a matrix containing the numbers 1 through 16 (row by row), and an identity matrix. Their multiplicative result should be a matrix containing the numbers 1 through 16 (yeah, I know - I have no imagination); if the product printed out is not the expected one then something went terribly wrong ™.

Do the above run a good number of times (say, a dozen or two), write down the best time (which, if the machine was originally at rest, should not really vary much). Post back : )


You have been slightly misled. Extrapolating the results from the test shows an Espresso core should perform equally (on the test) to a 1.6GHz Bobcat, not to a 1.6GHz Jaguar. Jaguar should have twice as wide SIMD fp unit compared to Bobcat, and that should show in the test. Now, how much better Jaguar will be compared to Bobcat remains to be seen.

Please forgive me, but where's the download? I know you linked something earlier in the thread, but for the life of me I can't find it.

Thanks
 
I did some runs of this on my Sandy Bridge @ 3.3 4 GHz (Core i5 2500K) and Ubuntu 10.10.

Results:
0.386927 s (autovectorized)
0.500083 s (manual intrinsics)

..which translates to 1276,8591 1745,708 and 1650,2739 2000,332 on the normalized ranking. This is of course without taking advantage of AVX.
Thank you, lightchris. I'll add those to the list. And yes, one has to be very careful about the clock speeds - all new CPUs have some form or another of clock boost. One has to know what that boost is and when that happens.
 
Thank you, lightchris. I'll add those to the list. And yes, one has to be very careful about the clock speeds - all new CPUs have some form or another of clock boost. One has to know what that boost is and when that happens.

Yup. I'm aware of the turbo boost, it's just that cpuinfo_cur_freq seems to disregard the boost while the tool I normally use to observe clock speeds under windows displays the actual current clock.

..and I just noticed I had another typo in the normalized values. The performance increase over Nehalem is at a solid 15%. Ivy Bridge will probably be another 5-10% faster.
 
So I downloaded it, and when I try to compile it, I get a bunch of "bad register name" errors. I copy pasted the pastebin code into a new document called NehalemAutoVectorized.txt and tried to compile it with a sudo. What am I missing (I'm a linux n00b).
 
So I downloaded it, and when I try to compile it, I get a bunch of "bad register name" errors. I copy pasted the pastebin code into a new document called NehalemAutoVectorized.txt and tried to compile it with a sudo. What am I missing (I'm a linux n00b).
First of all make sure you are dealing with the Nehalem code (A: autovectorized; B: manual emit) as some of the other listings are not even x86. Then use the 'download' link at the top of the page instead copy-pasting the code. If you still cannot build it, paste the error output here and we'll figure it out.
 
I downloaded it instead of pasting into a new document. The name was sse3-nehalem.txt and I get the following error:

sse3_nehalem.txt:1079: Error: bad register name `%xmm13'
sse3_nehalem.txt:1081: Error: bad register name `%xmm11'
sse3_nehalem.txt:1082: Error: bad register name `%xmm14'
sse3_nehalem.txt:1083: Error: bad register name `%xmm13'
sse3_nehalem.txt:1084: Error: bad register name `%xmm11'
sse3_nehalem.txt:1085: Error: bad register name `%xmm12'
sse3_nehalem.txt:1086: Error: bad register name `%xmm12'
sse3_nehalem.txt:1091: Error: bad register name `%rax'
sse3_nehalem.txt:1093: Error: bad register name `%rbx'
sse3_nehalem.txt:1095: Error: bad register name `%rbp'
sse3_nehalem.txt:1097: Error: bad register name `%rip)'
sse3_nehalem.txt:1098: Error: bad register name `%rax'
sse3_nehalem.txt:1101: Error: bad register name `%rax'
sse3_nehalem.txt:1103: Error: bad register name `%rax'
sse3_nehalem.txt:1108: Error: bad register name `%rsp'
sse3_nehalem.txt:1112: Error: bad register name `%rbx'
sse3_nehalem.txt:1114: Error: bad register name `%rbp'
sse3_nehalem.txt:1119: Error: bad register name `%rbp'
sse3_nehalem.txt:1121: Error: bad register name `%rbx'
sse3_nehalem.txt:1122: Error: bad register name `%rbp'
sse3_nehalem.txt:1123: Error: bad register name `%rbx'

There's more lines to the error, but they are all pretty much the same.

Here's the command line:

Code:
sudo gcc -o testvect_intrinsic -x assembler sse3_nehalem.txt -lrt -lstdo++

I think I see why. I have a 32bit version of linux. For some reason when I tried to install the 64bit version last year, it failed. So I had to install the 32bit version.
 
I downloaded it instead of pasting into a new document. The name was sse3-nehalem.txt and I get the following error:



There's more lines to the error, but they are all pretty much the same.

Here's the command line:

Code:
sudo gcc -o testvect_intrinsic -x assembler sse3_nehalem.txt -lrt -lstdo++

I think I see why. I have a 32bit version of linux. For some reason when I tried to install the 64bit version last year, it failed. So I had to install the 32bit version.
Indeed. That Nehalem code is 64bit (x86_64). I can build it for 32bit (the Dothan code for instance is 32bit) but that will likely affect the results negatively (increase register pressure, particularly in the manual emit case). Ok, we need a 64bit OS for the C2D. Thank you for the effort nevertheless!
 
Indeed. That Nehalem code is 64bit (x86_64). I can build it for 32bit (the Dothan code for instance is 32bit) but that will likely affect the results negatively (increase register pressure, particularly in the manual emit case). Ok, we need a 64bit OS for the C2D. Thank you for the effort nevertheless!

If it weren't my wife's computer, I'd happily reformat it and try a 64bit OS again. Maybe I'll be able to do it some time in the future.

EDIT: What performance impact would it have if I were able to boot from a USB key?
 
If it weren't my wife's computer, I'd happily reformat it and try a 64bit OS again. Maybe I'll be able to do it some time in the future.

EDIT: What performance impact would it have if I were able to boot from a USB key?
It should be perfectly fine.

Lightchris, let's do the sandy bridge properly. Here's the avx version. Give it a try when/if you can, willya?
 
For some reason, it thinks that the Core2Duo I have is a 32bit processor and will not install the 64bit version. It insists it's an i686 and not an x86_64. It's probably because this laptop has been through hell.

EDIT:

I'm booted into 64bit Linux on the M620 and when I try to compile the Nehalem code, I get:

Warning: end of file not at end of line; newline inserted
/usr/bin/ld: cannot find -lstdo++
collect2: error: ld returned 1 exist status

I downloaded the file, instead of copy/pasting it. Am I using the wrong assembly listing?
 
For some reason, it thinks that the Core2Duo I have is a 32bit processor and will not install the 64bit version. It insists it's an i686 and not an x86_64. It's probably because this laptop has been through hell.

EDIT:

I'm booted into 64bit Linux on the M620 and when I try to compile the Nehalem code, I get:



I downloaded the file, instead of copy/pasting it. Am I using the wrong assembly listing?
That's because I made a typo on the command line I posted :/ It's not -lstdo++, it's -lstdc++

ed: Wait, I didn't, it was your typo ; )
 
It should be perfectly fine.

Lightchris, let's do the sandy bridge properly. Here's the avx version. Give it a try when/if you can, willya?

0.368508 s

So, only a tiny difference. Does this really use the 256-bit registers? Or do you have another guess on why the speedup isn't bigger (is it actually not compute bound anymore)?
 
Top Bottom