Hey folks. I see my moniker has popped up in these discussions, so I suppose it's time to step in an clarify. Firstly, yes, I do still believe that a 160:8:8 configuration makes the most sense of what we're looking at. Allow me to give you my reasoning in as succinct a way as possible. And remember, while I do follow technology and whatnot, I am not hardware engineer or a game programmer, so I am largely building off the work of others. Regardless, I've spent a fair amount of my leisure time comparing the die shot, reading the analysis of others, and trying to test different theories in an unbiased fashion.
In the case of the SPU count, this early post over on beyond3D was one of the first to make me raise an eyebrow and question how the shader blocks could possibly hold more than 20 shaders each. Emphasis is mine.
Gipsel said:
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course).
http://beyond3d.com/showpost.php?p=1702908&postcount=4495
Since Gipsel posted this, it was concluded that the SRAM in the SPU blocks is not dual ported. Also, it seems like each SRAM block holds 4kB and not 8kB. I arrived at this by comparing the SRAM blocks to the smaller ones on the bottom of Latte identified by Marcan (check the OP for that image) as 2 kB. The SRAM used as GPRs for the shaders are exactly twice as long as those 2kB blocks. Other than that, they appear identical, so a differing density seems highly unlikely (unlike the SRAM used in the 1MB pool of texture cache in the upper left of the chip - that appears to be more dense and with such a large amount necessary, it's unsurprising).
Thus, it appears that each shader block can only hold 20 SPUs; that is unless Nintendo have actually cut register space to the shaders (the exact opposite of what Matt once reported, although that information seems a bit shaky since it didn't make much sense to people familiar w/ the ISA). As to why they are the size they are, we can really only guess, but there are a few factors which may come into play:
a)We've assumed perfect scaling from the 55nm RV770, which is usually not the case
b)Renesas' 40nm process may be less dense than TSMC's (which is known for being incredibly dense). They may lose some density in making the process eDRAM friendly.
c)There may be extra logic in the shader blocks that runs the shim layer (the compatibility layer that performs translation), as Marcan described it. The 8-bit CPU he mentioned is specifically for converting the Wii video output to the format now used by Radeons. There is other logic on there to handle TEV instruction translation. I don't know exactly where it is, but it could very well be right there in the shader blocks.
d)Other small tweaks could have been implemented that make the shaders somewhat larger. DirectX11 SPUs have some additional logic in there to support the new features of the API, so perhaps Nintendo added something analagous for whatever features above DirectX10.1 they decided to include.
Edit: Also, 55nm is not completely ruled out, but it does seem unlikely. I spoke to Jim Morrison myself on this, and he stated that the differences are very small and require some precise measurements to detect. For what it is worth, I did some measurements on MEM1 and it seems to fall in line with the reported cell size of Renesas' 40nm eDRAM.