GPUs vs Cell
Blogged under Cell by Barry Minor on Wednesday 30 November 2005 at 7:39
pm
Recently I came across a link on
www.gpgpu.org that I found
interesting. It described a method of ray-tracing quaternion Julia
fractals using the floating point power in graphics processing units
(GPUs). The author of the GPU code , Keenan Crane, stated that "This
kind of algorithm is pretty much ideal for the GPU - extremely high
arithmetic intensity and almost zero bandwidth usage". I thought it
would be interesting to port this Nvidia CG code to the Cell processor,
using the public SDK, and see how it performs given that it was ideal
for a GPU. First we directly translated the CG code line for line to C
+ SPE intrinsics. All the CG code structures and data types were
maintained. Then we wrote a CG framework to execute this shader for
Cell that included a backend image compression and network delivery
layer for the finished images. To our surprise, well not really, we
found that using only 7 SPEs for rendering a 3.2 GHz Cell chip could
out run an Nvidia 7800 GT OC card at this task by about 30%. We
reserved one SPE for the image compression and delivery task.
Furthermore the way CG structures it SIMD computation is inefficient as
it causes large percentages of the code to execute in scalar mode. This
is due to the way they structure their vector data, AOS vs SOA. By
converting this CG shader from AOS to SOA form, SIMD utilization was
much higher which resulted in Cell out performing the Nvidia 7800 by a
factor of 5 - 6x using only 7 SPEs for rendering. Given that the Nvidia
7800 GT is listed as having 313 GFLOPs of computational power and seven
3.2 GHz SPEs only have 179.2 GFLOPs this seems impossible but then
again maybe we should start reading more white papers and less
marketing hype.