Well well, what do you know - turns out GCC 4.6.3 produces identical asm code from the test app whether compiling for Core2 or for Core i7 (gcc options
-march=core2 -mtune=core2 versus
-march=corei7 -mtune=corei7). So Nehalem's assembly code as linked in the test results will do for the C2D test just fine. Now, re building the binary, once you have downloaded the assembly listing (I suggest testing both the autovectorized as well as the manually emitted version, if time permits), do:
Code:
gcc -o testvect_intrinsic -x assembler path/to/downloaded/asm/listing -lrt -lstdc++
That produces the binary 'testvect_intrinsic' in the current directory. From there on, launch as:
Code:
echo "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1" | ./testvect_intrinsic
What the above does is send to the test its expected parameters (so you don't have to type them each time) - arg matrix A (16 elements), arg matrix B (16 elements) and a magic number 1 (used for fooling the compiler we're doing millions of unique matrix multiplications whereas in fact we do one repeatedly). The particular parameters above are a matrix containing the numbers 1 through 16 (row by row), and an identity matrix. Their multiplicative result should be a matrix containing the numbers 1 through 16 (yeah, I know - I have no imagination); if the product printed out is not the expected one then something went terribly wrong .
Do the above run a good number of times (say, a dozen or two), write down the best time (which, if the machine was originally at rest, should not really vary much). Post back : )
You have been slightly misled. Extrapolating the results from the test shows an Espresso core should perform equally (on the test) to a 1.6GHz Bobcat, not to a 1.6GHz Jaguar. Jaguar should have twice as wide SIMD fp unit compared to Bobcat, and that should show in the test. Now, how much better Jaguar will be compared to Bobcat remains to be seen.