doncale said:
the Emotion Engine CPU had the 66 million polygon per second (or vertices per second) calculation/transform limit. the Graphics Synthesizer rasterizer has the 75 million polygon per second draw/display limit.
As Panajev has mentioned, the Emotion Engine CPU can actually be made to transform over 100 million polygons (or vertices) per second if doing nothing else. (using both VUs and the FPU i think)
No, using only the two VUs.
Using the FPU of the RISC core you could technically do another 15 MVertices/s (never tried it myself with the FPU, I assumed 20 cycles per vertex for a simple perspective transform).
Edit:
Multiplying a matrix with a Vector, with the FPU... each row you could do one FMUL and three FMADD for a total of 4 FMULs and 12 FMADDs for the whole Matrix... pipelined (the throughput is 1 cycle [you can issue the same instruction again next-cycle] for FMADD and FMULs, but not the latency IIRC) we should be able to do that 18 cycles ( we can do all the FMULs, srtarting from the last row and going up, then start from the end [going up] and do the FMADDs [one FMADD per row] ro reduce data dependency related stalls).
We still need to take care of the division by W (FDIV's speed is one of the limiters of VU0 which does not have the EFU).
We can divide 1 by W (1 FDIV instruction, latency is normally 7 cycles IIRC) and then multiply the result to the Vector we just obtained (broadcast FMUL for the VUs, 4 different FMULS for the FPU).
With the way we did the Matrix Multiplication we have W (in Homogeneous coordinates) ready to be forwarded (I am not sure 100% regarding the 5900i's FPU in the Emotion Engine, but in a lot of pipelined FPUs the result can be forwarded to another instruction before it is written back to the register) right when we have finished the Matrix Multiplication (not a cycle after IIRC
) and we can issue the FDIV instruction and start doing the FMULs right after that.
We can launch the FDIV instruction right after we start the FMADD instruction which computes W in the Matrix Multiplication, it will stall before executio, but it will already be in the pipeline.
7 cycles for the divide (we are processign asingle vertex and not multiple vertexes at the same time which would speed up the transform so mine is not an optimal case) and 4 cycles for the four FMULs.
Total ~= 18 + 11 ~= 29 cycles.
So the actual number for the FPU is closer to 10 MVertices/s.