WiiU "Latte" GPU Die Photo - GPU Feature Set And Power Analysis

Status
Not open for further replies.
Hello Mr Fourth Storm, I appreciate your efforts! and for the record I don't have any great experience of hardware engineering either. But I do now work in an evil globo-mega-corp which has a lot of those folks. I managed to get a couple of them to look at the die shots and have a chat about my questions. Unfortunately they confirmed that it really is impossible to tell from just those shots very much at all. They did however have some general points on the interpretation. The primary being that with a hand layout, all bets are off with respect to comparison against other dies and even within the same layout! i.e. Every hand layout is really a mix of auto (majority) and manual with the "same" logic varying in density/structure within the same die depending on positioning, heat, clock speed and the importance of those factors. So if you'll forgive my selective quoting...
Hello! I certainly welcome other reasonable takes on the Wii U's innards, even if they come from evil globo-mega-corp employees! So, thanks for running my post by those folks. While I agree that we must be careful when drawing any hard conclusions due to the high amount of variation in chip layouts, I don't agree that it's a useless endeavor. No, we can't say we know everything for sure, but we can make some damn good guesses. For instance, it's pretty much certain that what we've identified as shader blocks, are just in fact, that. One could say that we don't know for sure that e=mc^2 because our whole thought paradigm could be skewed. Just because there's a slight possibility of scientists being completely wrong doesn't mean that it should be given equal ground to everything else that seems to point to that equation being accurate. That's an extreme example, and I am nowhere near as certain about Latte's innards as I am mass-energy equivalence, but the point remains. We have to look at this holistically. Game performance, TDP, GPR configuration in the shaders, and the amount of TMUs (more on them later) all point to a 160 shader machine.

This is getting ranty, but bear with me here. From everything we have seen, it is pretty evident that Nintendo approached Wii U exactly as they did Wii, up to and including case design. They delivered a low TDP machine built around a single hook in the controller. They thought Nintendoland would be their Wii Sports and that the thing would sell like hotcakes. Third parties would have a somewhat easier time porting their games because of the unified shader architecture and dual analog controls. That's pretty much where they stopped. The hardware itself is well designed but low end. All the PR speak (Reggie blaring 1080p at every opportunity and claiming BLOPS2 ran better on Wii U) is transparently just that. Meanwhile, we have devs who are probably sitting around dissing the system over beers just like some on this board. People are people. The "not as many shaders" quote, the Metro Last Light guys deriding the CPU, the FB2/3 debacle - these are real reactions from people. We can't just dismiss them like those devs have some sort of hidden agenda. Then, of course, corporate gets word (perhaps a phone call from Reggie, haha) and the comments are quickly reneged. "Oh no, Wii U is a very capable system" or some similar vaguary. And I'm not saying Wii U isn't capable. The shaders are definitely beyond Xenos/RSX level, it's got more RAM, a better cache setup, and some nice fast eDRAM. But it seems pretty clear that Nintendo were running their benchmarks in aim of ~PS360 performance (not necessarily looking to match those architectures component for component) and when they got there, they said, "Good enough!"

Forgive me for not replying to your other points. It seems we are pretty much in agreement on them. I don't know what a doubly dense eDRAM would look like (the 1 MB pool appears slightly darker, but it's so close my eyes may be playing tricks on me), but I still don't think it's likely in the end. I am not about to put up my conclusions on Wikipedia like it's straight-up fact, but I'm pretty confident in them at this point. I'll eat my socks if those S blocks aren't L1 cache.

Finally, I have followed this pretty closely, but even I can't keep track of every rumor that drops. I don't know where people got the idea that "fixed silicon" was confirmed or that a "custom shader" is somehow likely or necessary in any way (even if their tools suck, that's not intentional. Their aim was to build a system both familiar and easy to program for). I do know the leaked feature/specs sheets describes a pretty standard R700. Marcan also believes the chip to be pretty standard, and he probably knows more than any of us at this point. Poor tools or not, I have a hard time believing that Criterion would have difficulty getting NFS up to par on a 320 shader machine. The final result after all their back and forth with Nintendo and all the work of their world class programmers is a version that barely edges out the 360 version months later. That's pretty telling in my eyes. I don't believe Nintendo designed a lopsided system and the CPU or the DDR3 is bogging down an extraordinary GPU. It's much easier to say that all components are a good fit for one another, and indeed, that's exactly what we have heard.
 
If the Wii U is a 160 shader machine, then to out perform Xenos it's shaders would need to be 200 % more efficient. So the question then is VLIW5 or VLIW4 200% more efficient than what is in the 360? Because we have multiplatform titles performing better, no not all of them , but some are and it's more than just more memory.

I would also think there would be a LOT more complaining about the GPU if it had less shaders than the 360, and was really sitting at 160. So far though we've had a few devs complaining about the CPU, but no one, not a single one I've seen has complained about the GPU.
 
If the Wii U is a 160 shader machine, then to out perform Xenos it's shaders would need to be 200 % more efficient. So the question then is VLIW5 or VLIW4 200% more efficient than what is in the 360? Because we have multiplatform titles performing better, no not all of them , but some are and it's more than just more memory.

I would also think there would be a LOT more complaining about the GPU if it had less shaders than the 360, and was really sitting at 160. So far though we've had a few devs complaining about the CPU, but no one, not a single one I've seen has complained about the GPU.
It should also be noted that we heard few complaints about the GPU even when it was underclocked @ 400MHz. That implies that it may have already been in range or over 360's GPU in performance before the dev tools improved and the GPU was boosted by almost 40%.
 
If the Wii U is a 160 shader machine, then to out perform Xenos it's shaders would need to be 200 % more efficient. So the question then is VLIW5 or VLIW4 200% more efficient than what is in the 360? Because we have multiplatform titles performing better, no not all of them , but some are and it's more than just more memory.

I would also think there would be a LOT more complaining about the GPU if it had less shaders than the 360, and was really sitting at 160. So far though we've had a few devs complaining about the CPU, but no one, not a single one I've seen has complained about the GPU.
How did you get 200%?

By my math it would need to be around 45% more efficient.

176gflop + 45% = 255

That just one small factor. It beat x360 gpu by 100% in other fsctors. The biggest improvement seem to be from the extra ram and bw.

Also I do not see why they would complain about the strongest part of the system.
 
How did you get 200%?

By my math it would need to be around 45% more efficient.

176gflop + 45% = 255

That just one small factor. It beat x360 gpu by 100% in other fsctors. The biggest improvement seem to be from the extra ram and bw.

Also I do not see why they would conplaim about the strongest part of the system.

I don't think the math works that way with physical components. We are looking at a physical hardware limit. Trying to overcome something that physically isn't there at all requires more than just a relative bump in efficiency especially where coding is concerned.

Its like saying it would only require a bump of to 2.4 Ghz for a single core processor to to to match up to a 1.2 Ghz dual core processor. I'm sure it doesn't quite work that way.
 
So I've been in touch with our old friend bgassassin since he left us a while back and he's passed along a few thoughts he's been having regarding 'Latte'.


bg said:
So here is how I see the GPU to the best of my current ability and most likely what I see as my final analysis. I do not like the term "secret sauce", but I do feel that the primary "special" characteristic of Latte is a dual graphics engine. I believe all parts of Latte should be identifiable with other GPUs (possibly even Flipper/Hollywood) as far as functionality, though that does not rule out modifications to that functionality. In other words I feel comfortable in eliminating the idea that there is a block that is doing something we would not normally expect. Well except for one which I will explain what I think it could (also) be doing. I was obviously not able to identify every block, but I wanted to at least try to label the more important blocks and as many as possible. I also think there is some consolidation in certain blocks, primarily caches. Going in semi-order of the pipeline based on Thraktor’s annotations:



Blocks O/R (Command Processor and Ultra-Threaded Dispatch Processor): I lean mostly towards O being the CP and R being the UTDP. While Cayman has two UTDPs, Latte does not have the same amount of ALUs to be fed. I believe in this case Latte's UTDP has been modified due to this. I was informed about that there is an “8-bit CPU” in Latte for Wii BC, but I would propose that the CP was modified for that and is what is showing as that “CPU”.

Blocks U/W (Rasterizer and Hierarchical Z): With these two I am pretty much with W being the Rasterizers and U the Hierarchical Zs. That view is based strictly on position though.


Blocks Q/S/T (Geometry Assembler, Vertex Assembler, Tessellator): I have a tough time deciding which of these blocks are which though I lean heavily towards S being the Tessellators with T being the second option. So as it stands group identification is about the best I can do for now.



Block V (Vertex Fetch Index and/or "Geometry Buffer"): This is the one that might be doing something not normally done in GPUs. This seems to be a rather large amount of SRAM. I couldn't find anything on how large the VFI is normally, but I'd like to think it's not as much as what we see in Latte considering other die shots even though there is one similar in Llano. So the idea of the "Geometry Buffer" comes from this section of Anand's review of Cayman:

http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/6

As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.

Block V in Latte would be Nintendo's attempt at addressing this issue to avoid using the DDR3 memory in this scenario.



Block P (Consolidated Cache): I believe this block contains the Instruction Cache, Constant Cache, and Local Data Shares for the SIMDs.



Block N (SIMDs): The easiest to identify. We have four SIMDs. However for as long as I can remember, I have felt that Nintendo would not use VLIW architecture just like how Xenos did not. There seems to be a few inefficiencies with it. Also Iwata has gone out of his way at least twice of mentioning the GPU has a different architecture and it would seem strange to me to point that out if it were VLIW-based considering how long AMD/ATi has it in their GPUs. In turn I expect this would make attempts to figure out the ALU count by comparing the SIMDs sizes and registers to other VLIW-based GPUs tough. I believe the early discrepancies when measuring the SIMDs is due to that. The question for me is did Nintendo go with a 4-ALU wide or 5-ALU wide architecture? If Nintendo did drop VLIW as I expect, it will be interesting to see what they came up with as that would suggest something of better performance if from nothing more than efficiency gains. Because of this I feel 256 is the minimum amount of ALUs in Latte. Returning back to Iwata I know it is his job to talk up Wii U and its performance, but to claim that Wii U was only at half of its potential if Latte has 160 ALUs would be absurd on his part.



Block J (TMUs): To me the second easiest to identify. I believe at least part of the reason J1 is larger the Data Request Bus is located in J1. Also/or it could be that this block was modified for Wii BC.



Blocks I (Shader Export): Block I IMO is one of the most consistently seen blocks in AMD/ATi die shots. Its position seems relative in some way to the TMUs. Even in the RV770 die shot at the top above the column of TMUs is a block that resembles I. I’ve been torn between this block being either the Global Data Share or Shader Export, but after a short talk with blu and what I decided on with Block D, I settled on this one being the SE.



Block B (ROPs): Xenos shows us that we don’t have to look for two identical blocks for the GPU to have 8 ROPs. I believe this block contains the ROPs due at least in part to BC. Flipper’s “ROP” was located by the eFB. Nintendo would most likely want to keep the ROPs close to the eDRAM (that includes the 32MB portion) because of the potential usage as a framebuffer.



1MB SRAM and 2MB eDRAM Blocks: I do not see Nintendo dedicating this much die space to components that will go unused in Wii U mode. I view is that the SRAM block is used as texture cache and the eDRAM replaces the L2 cache normally seen in GPUs.


Going back to that talk with blu, the Global Data Share could also be located here.


Miscellaneous



Blocks D/E/F (Video): I do not much to say about this though I was bouncing around with what I thought D might be. That said I do see these related to UVD and display controller.



Block Y (Starbuck): There is also not much to say as it looks just like the ARM in Hollywood.



The rest is clear conjecture on my part.



Blocks C/G/L (DSP and Northbridge): I am saying this based almost strictly on the shape of the blocks. I believe one of them is the DSP and one is the NB. I expect that Nintendo kept a NB instead of using memory controllers.




Block X (SouthBridge): This one is out there, but the location seems ideal to me as a SB as I do not see a chip on the board labeled as a SB. However I do not know if GC/Wii had a SB as well.



While I primarily lean more toward no straight up Hollywood parts, ignoring the 1MB of SRAM and 2MB of eDRAM, I cannot completely rule that out either.



Blocks A, H, K and M were the ones I could not come up with anything to even speculate about.


I have also included some partially annotated die shots of Brazos and Llano. Also based on Thraktor’s annotation I tried to identify similar blocks that were in Latte that were not the SIMDs or TMUs. I do think Brazos experiences consolidation as well, if not more, like I proposed with Latte.

Brazos
ZzHE0Ai.jpg


Llano
o7VcTb2.jpg


I think that is about it. We know the API GX2 can be viewed as OpenGL 3.3+ since hardware tessellation and compute shaders were not natively supported in OpenGL till 4.x.
My brain can relax some now.
 
3. Its an out-of-oder CPU that processes multiple instructions per cycle.

Just to be clear, it fetches 4 instructions per cycle, but that is *not* the same as the IPC/instructions per clock rate. That's just the fetch. And in processors those are always "up to" numbers. Also, it dispatches up to two non-branch instructions per cycle. I'm sleepy now and not going to find a source for this, but I think the very most efficient x86 cores with 2 or more instruction fetch per cycle averaged just under 1 instruction per clock. So instructions fetch =/= instructions per clock the way most people talk about it, being my point.

I linked this in our conversation in the GPU thread (which is where you probably got that) but since it was brought up here, this explains the PPC 750 design very well.

http://arstechnica.com/features/2004/10/ppc-2/
 
Just to be clear, it fetches 4 instructions per cycle, but that is *not* the same as the IPC/instructions per clock rate. That's just the fetch. And in processors those are always "up to" numbers. Also, it dispatches up to two non-branch instructions per cycle. I'm sleepy now and not going to find a source for this, but I think the very most efficient x86 cores with 2 or more instruction fetch per cycle averaged just under 1 instruction per clock. So instructions fetch =/= instructions per clock the way most people talk about it, being my point.

I linked this in our conversation in the GPU thread (which is where you probably got that) but since it was brought up here, this explains the PPC 750 design very well.

http://arstechnica.com/features/2004/10/ppc-2/

I'm well aware of this. I've been following the Espresso thread from the start. I was just keeping the explanation simple and to the point. The, not so absolute, nature of the multiple instructions is why I did not include the actual numbers.

I've also looked over that page a few times from when you posted it in the Espresso thread. It only really explains the basic PPC 750 design, and not the enhanced, 2006+, versions of it.
 
Again maybe I'm looking at this wrong BUT

You have 160 shaders vs 240 shaders. That would be 50% more shaders, so every shader would need to do the same work as 1.5 360 shaders just to match the same output. To do better than it in any meaningful way every 1 Wii U shader would need to do the same amount of work as 2 360 shaders. That's a doubling of efficiency. Maybe my use of 200% was the wrong way to state, it but they'd still need to be twice as efficient to best it. We know it's not coming from a clock speed increase since the Wii U's GPU is only clocked 10% higher.

You would need to have shaders that are significantly more efficient to drop 1/3 of your shader count with a 10% clock boost and still have a GPU that can pull off more effects.

And yeah I do think if the Wii U's GPU was weaker than the 360s you'd have heard about it. The only thing developers have complained about is the CPU, and that's from every dev that's had something negative to say about the hardware. It always goes back to the CPU, but none, not a single one has said anything about the GPU.

So again I ask, how improved is VLIW5 over what's in the 360? I think it's a damn valid point if we're going to discuss a 160 shader possibility.
 
Again maybe I'm looking at this wrong BUT

You have 160 shaders vs 240 shaders. That would be 50% more shaders, so every shader would need to do the same work as 1.5 360 shaders just to match the same output. To do better than it in any meaningful way every 1 Wii U shader would need to do the same amount of work as 2 360 shaders. That's a doubling of efficiency. Maybe my use of 200% was the wrong way to state, it but they'd still need to be twice as efficient to best it. We know it's not coming from a clock speed increase since the Wii U's GPU is only clocked 10% higher.

You would need to have shaders that are significantly more efficient to drop 1/3 of your shader count with a 10% clock boost and still have a GPU that can pull off more effects.

And yeah I do think if the Wii U's GPU was weaker than the 360s you'd have heard about it. The only thing developers have complained about is the CPU, and that's from every dev that's had something negative to say about the hardware. It always goes back to the CPU, but none, not a single one has said anything about the GPU.

So again I ask, how improved is VLIW5 over what's in the 360? I think it's a damn valid point if we're going to discuss a 160 shader possibility.

I'm not qualified to explain the intricacies of VLIW5 vs the Vec4+scalar config of Xenos, but I can volunteer a few simple explanations as to how a 160 shader Latte could get the results we see in ports. For one, shaders are very important, but they're not everything that goes into a visual. It's quite reasonable to say that third party cross-platform games are not exploiting all 240 shaders of Xenos to the fullest. Not every game looks like Gears of War. And what would happen if you try to do more than the shaders can handle anyway? Slower framerate - which we've seen in places. However, also keep in mind that Latte is hooked up to some high bandwidth/low latency eDRAM, which is read/write capable, so it's probably saving a bunch of clock cycles just by that alone.

So it could be the GPU hasn't come under direct fire, because it's actually performing above expectations given the numbers on paper. However, we actually have heard some comments about the GPU that don't paint it in a great light. That one Kotaku article likened its performance to DirectX9 (an odd way of putting it, but the point can be extracted that it's pretty much in line with current gen) and also lherre said way back that the GPU lacked horsepower despite being decent in terms of features.

The truth is, we are not going to hear many specific criticisms at all, because devs are under NDAs. There was the "not as many shaders...not as capable" comment by the anonymous dev that was written off as bs, but perhaps prematurely. The Metro devs probably spoke out because they made that choice that they are not interested in Wii U development. They didn't care about burning that bridge.
 
Using simple benchmark we can see the 160 shade r card out perform the 7900 gt in the ps3. Direct compare to x360 is alot harder since it's not comparable to any other card.

Radeon HD 6450 40nm 160:8:4 293
Nvidia 7900 gt. 238


http://www.videocardbenchmark.net/gpu.php?gpu=GeForce+7900+GT/GTO&id=1253
http://www.videocardbenchmark.net/gpu.php?gpu=Radeon+HD+6450&id=267

So 160 would not lead to a lower performing part... in fact it outperform it by 25%. Little extra performance over the ps360 does seem to match what we are seeing.

Moving to 320alu 4650 benchmark 354 55% increase over ps3.
 
Why have people suddenly started shooting so low? Its like we moved from exploring possibilities openly to finding a way to write off the GPU as weaker while still explaining it showing enhancements in some game. The, "it was maxed out at launch and the flaw in the ports were because it couldn't handle them" mentality seems be prevailing.

Exactly why has there been such a shift from 320 shaders to 160? The overall consensus from the beginning of analysis has pretty much just been thrown out the window. It seems odd to me. Maybe the hackers can run code through the GPU to test it. It boggles the mind what could bring about such a sharp flip in belief.

There must have been some new set of info to discredit the initial guesses that I missed. Speaking of initial guess, I haven't seen Thraktor post in here in a while. I wonder what his stance on the current analysis is now.
 
I'm not qualified to explain the intricacies of VLIW5 vs the Vec4+scalar config of Xenos, but I can volunteer a few simple explanations as to how a 160 shader Latte could get the results we see in ports. For one, shaders are very important, but they're not everything that goes into a visual. It's quite reasonable to say that third party cross-platform games are not exploiting all 240 shaders of Xenos to the fullest. Not every game looks like Gears of War. And what would happen if you try to do more than the shaders can handle anyway? Slower framerate - which we've seen in places. However, also keep in mind that Latte is hooked up to some high bandwidth/low latency eDRAM, which is read/write capable, so it's probably saving a bunch of clock cycles just by that alone.

From what I understand, and this is an amateur attempt at an explanation, Xenos never achieved near it's full potential due to the way it's EDRAM was tied directly to it's ROPS, and how it necessiated a lot of read/writes out to main memory. If the memory architecture of the Wii U is leveraged properly, I think they can avoid a lot of that mess an get much closer to maximum utilzation from it. There really isn't any reason it should be underperformong current gen consoles. On a whole it has a better memory design than the 360 and PS3, and has a superior GPU. The CPU might be the most disappointing aspect, but blu's tests seem to indicate it should punch above it's wattage.

I'm also extremely pessimistic whether Nintendo was able to have the tools and support ready to adequality exploit the performance that was already there. Developers being left in the dark about hardware design and given inadequate tools and documentation is enough to explain the climate around Wii U development at the moment. Now that the tools and support are there, Nitnendo is talking to mostly empty crowds. They have a lot of bridges on fire and have to work on rebuilding them. How they do that I have no idea.
 
Using simple benchmark we can see the 160 shade r card out perform the 7900 gt in the ps3. Direct compare to x360 is alot harder since it's not comparable to any other card.

Radeon HD 6450 40nm 160:8:4 293
Nvidia 7900 gt. 238


http://www.videocardbenchmark.net/gpu.php?gpu=GeForce+7900+GT/GTO&id=1253
http://www.videocardbenchmark.net/gpu.php?gpu=Radeon+HD+6450&id=267

So 160 would not lead to a lower performing part... in fact it outperform it by 25%. Little extra performance over the ps360 does seem to match what we are seeing.

Moving to 320alu 4650 benchmark 354 55% increase over ps3.

I don't think its is as simple as that. For one, is the PS3's GPU itself a bit weaker than 360's, and had to use CELL to keep up? If we are going by these numbers, Wii U GPU probably should not beat the customized Xenos.

I'm not qualified to explain the intricacies of VLIW5 vs the Vec4+scalar config of Xenos, but I can volunteer a few simple explanations as to how a 160 shader Latte could get the results we see in ports. For one, shaders are very important, but they're not everything that goes into a visual. It's quite reasonable to say that third party cross-platform games are not exploiting all 240 shaders of Xenos to the fullest. Not every game looks like Gears of War. And what would happen if you try to do more than the shaders can handle anyway? Slower framerate - which we've seen in places. However, also keep in mind that Latte is hooked up to some high bandwidth/low latency eDRAM, which is read/write capable, so it's probably saving a bunch of clock cycles just by that alone.

So it could be the GPU hasn't come under direct fire, because it's actually performing above expectations given the numbers on paper. However, we actually have heard some comments about the GPU that don't paint it in a great light. That one Kotaku article likened its performance to DirectX9 (an odd way of putting it, but the point can be extracted that it's pretty much in line with current gen) and also lherre said way back that the GPU lacked horsepower despite being decent in terms of features.

The truth is, we are not going to hear many specific criticisms at all, because devs are under NDAs. There was the "not as many shaders...not as capable" comment by the anonymous dev that was written off as bs, but perhaps prematurely. The Metro devs probably spoke out because they made that choice that they are not interested in Wii U development. They didn't care about burning that bridge.

If we are going to how powerful Latte before the clockspeed went from 400MHz to 550MHz, it would make sense for it to lack raw power in some things. These GPU's tri-setup is directly tied to clockspeed, so a 400MHz Latte probably pushed ~20% less polygons than Xenos. I also think we are downplaying the effect of developers having sub-par tools for launch titles, the PS3/360 being on its 6/7th-generation of games, and that PC developers like Frozenbyte would and did push the system harder.
 
I'm not qualified to explain the intricacies of VLIW5 vs the Vec4+scalar config of Xenos, but I can volunteer a few simple explanations as to how a 160 shader Latte could get the results we see in ports. For one, shaders are very important, but they're not everything that goes into a visual. It's quite reasonable to say that third party cross-platform games are not exploiting all 240 shaders of Xenos to the fullest. Not every game looks like Gears of War. And what would happen if you try to do more than the shaders can handle anyway? Slower framerate - which we've seen in places. However, also keep in mind that Latte is hooked up to some high bandwidth/low latency eDRAM, which is read/write capable, so it's probably saving a bunch of clock cycles just by that alone.

So it could be the GPU hasn't come under direct fire, because it's actually performing above expectations given the numbers on paper. However, we actually have heard some comments about the GPU that don't paint it in a great light. That one Kotaku article likened its performance to DirectX9 (an odd way of putting it, but the point can be extracted that it's pretty much in line with current gen) and also lherre said way back that the GPU lacked horsepower despite being decent in terms of features.

The truth is, we are not going to hear many specific criticisms at all, because devs are under NDAs. There was the "not as many shaders...not as capable" comment by the anonymous dev that was written off as bs, but perhaps prematurely. The Metro devs probably spoke out because they made that choice that they are not interested in Wii U development. They didn't care about burning that bridge.

Sorry but ports were FINISHED on incomplete hardware. What we have seen from stuff like X would suggest that it is actually achieving a bit more than Xenos, and that efficiency in VLIW5 just isn't there... in fact if it is VLIW5, at least some of the time, only 4 of every 5 shaders could possibly fire, you'd have scheduling issues and Xenos doesn't actually have that problem so you'd have 240 shaders vs only 128 shaders (a majority of the time), it's virtually impossible for it to be 160 and stay R700.

The reality is your assessment hinges on the idea that ports were pushing 125% of what the shaders are capable of, and that is just impossible.

I think it is possible Nintendo went with a VLIW4 set up, or they did what I suggested above and went with a custom thread level parallel instead of the instruction level parallel we see from VLIW parts.

In this case I think BG is correct and they went with a 32ALU or 40ALU per SPU reaching 256 or 320 ALUs. Trinity has 256ALU parts and honestly I think those parts could easily be compared to what we see in these games, especially if we clock the cpu down to 1.8GHz-2GHz and limit the system ram to 2GB DDR3. Of course this isn't a science but it is a lot better than assuming that launch ports were able to extract 100% of the Wii U's GPU power while developers didn't use 240 of xenos shaders. I mean that is what you are saying in the above statement and it still doesn't take into account VLIW5's efficiency problems.

I don't think its is a simple as that. For one, is the PS3's GPU itself a bit weaker than 360's, and had to use CELL to keep up? If we are going by these numbers, Wii U GPU probably should not beat the customized Xenos.

Yeah, from USC's own test here, it would mean at best GPU7 could at best match Xenos, this is something developers have been denouncing and saying Wii U's GPU is fine, EA called it a stop gap in 2011 (unfinished hardware) Vigil called it a superior GPU to xenos, this sort of estimation puts Wii U at or below Xenos.

Oh wow... USC-fan did you even look at the HD 6450? it's cloked at 750MHz, clocking it at 550MHz would actually have it score BELOW the 7900 gt in that benchmark... How ridiculous does 160ALUs seem now?!? the 7900 GT is also clocked at 450MHz, not 500MHz that you'd find in PS3. I think the reality is that this is pretty close to impossible for Wii U using VLIW5 (which HD 6450 uses iirc.
 
I don't think its is a simple as that. For one, is the PS3's GPU itself a bit weaker than 360's, and had to use CELL to keep up? If we are going by these numbers, Wii U GPU probably should not beat the customized Xenos.



If we are going to how powerful Latte before the clockspeed went from 400MHz to 550MHz, it would make sense for it to lack raw power in some things. These GPU's tri-setup is directly tied to clockspeed, so a 400MHz Latte probably pushed ~20% less polygons than Xenos.

What made the x360 gpu better was the unified memory and Edam.
 
Power is still far superior. If it weren't for Microsoft, x86 would be long dead.

Way to talk out your arse.

Modern x86 CPUs like Sandy Bridge, Ivy Bridge ect. keep up with POWER7 quite well!
They are a fuck ton faster than anything non POWER7!

And the lets not forget the simple fact that PPC =/= Modern versions of POWER!

I would not go that far. The A9's in Tegra3 don't have NEON. NV realized their mistake and fixed it in Tegra4 but Tegra3 will remain forever as 'the castrated A9s'.

It is Tegra 2 that lacked Neon, not 3!
 
Sorry but ports were FINISHED on incomplete hardware. What we have seen from stuff like X would suggest that it is actually achieving a bit more than Xenos, and that efficiency in VLIW5 just isn't there... in fact if it is VLIW5, at least some of the time, only 4 of every 5 shaders could possibly fire, you'd have scheduling issues and Xenos doesn't actually have that problem so you'd have 240 shaders vs only 128 shaders (a majority of the time), it's virtually impossible for it to be 160 and stay R700.

The reality is your assessment hinges on the idea that ports were pushing 125% of what the shaders are capable of, and that is just impossible.

I think it is possible Nintendo went with a VLIW4 set up, or they did what I suggested above and went with a custom thread level parallel instead of the instruction level parallel we see from VLIW parts.

In this case I think BG is correct and they went with a 32ALU or 40ALU per SPU reaching 256 or 320 ALUs. Trinity has 256ALU parts and honestly I think those parts could easily be compared to what we see in these games, especially if we clock the cpu down to 1.8GHz-2GHz and limit the system ram to 2GB DDR3. Of course this isn't a science but it is a lot better than assuming that launch ports were able to extract 100% of the Wii U's GPU power while developers didn't use 240 of xenos shaders. I mean that is what you are saying in the above statement and it still doesn't take into account VLIW5's efficiency problems.



Yeah, from USC's own test here, it would mean at best GPU7 could at best match Xenos, this is something developers have been denouncing and saying Wii U's GPU is fine, EA called it a stop gap in 2011 (unfinished hardware) Vigil called it a superior GPU to xenos, this sort of estimation puts Wii U at or below Xenos.

Oh wow... USC-fan did you even look at the HD 6450? it's cloked at 750MHz, clocking it at 550MHz would actually have it score BELOW the 7900 gt in that benchmark... How ridiculous does 160ALUs seem now?!? the 7900 GT is also clocked at 450MHz, not 500MHz that you'd find in PS3. I think the reality is that this is pretty close to impossible for Wii U using VLIW5 (which HD 6450 uses iirc.
6450 was clocked at 625 and is another version clocked at 750.
Also 7900 Gt is clocked at 500

http://www.newegg.com/Product/Product.aspx?Item=N82E16814130281
 
What made the x360 gpu better was the unified memory and Edam.

No... the shaders were more capable. In fact another point against 160 ALUs for Wii U is that in Wii U ports, they are using more shader effects, taxing the ALUs harder than Xenos (stuff like Trine 2 and dues ex) meaning that they are doing more than Xenos ALUs and still achieving 2x the efficiency afterwards is pretty much impossible since Xenos efficiency design should be ~65-70% VLIW5's R700 is ~75-80% efficiency while GCN gets into the 90s, Wii U's would have to be ~125% efficiency for this to work though this is all at the same clocks, so Wii U gains a small bonus from the higher clock but still falls far short of the impossible.
 
No... the shaders were more capable. In fact another point against 160 ALUs for Wii U is that in Wii U ports, they are using more shader effects, taxing the ALUs harder than Xenos (stuff like Trine 2 and dues ex) meaning that they are doing more than Xenos ALUs and still achieving 2x the efficiency afterwards is pretty much impossible since Xenos efficiency design should be ~65-70% VLIW5's R700 is ~75-80% efficiency while GCN gets into the 90s, Wii U's would have to be ~125% efficiency for this to work though this is all at the same clocks, so Wii U gains a small bonus from the higher clock but still falls far short of the impossible.

Do you have anything to back up what you are saying? Seems you are just making up number and statements to prove your point.

Do we know shader performance was the reason for improve ports. Or was it the added ram and bw? Or just the fact they had more development time....

Or a factor of everything above.

About the benchmark. It was just to show that 160alu would outperform the ps3 gpu. Should be noted the wiiu also has a core clock advantage. To rule out 160 based on performance doesn't hold water. Also this doesn't factor in the advance shader effects with a dx10.1 card over dx9.

Also that card is 160:8:4 vs the more powerful 160:8:8 that was talk as wiiu gpu
 
Do you have anything to back up what you are saying? Seems you are just making up number and statements to prove your point.

Do we know shader performance was the reason for improve ports. Or was it the added ram and bw? Or just the fact they had more development time....

VLIW5's efficiency numbers are pretty well known... the 5th shader is almost never used and no shader architecture ever achieves 100% efficiency, as for Xenos, it was from prior analysts of the architecture, however R700 performed above it, it wasn't anything extreme.

We know that there are more shader effects going on, thus more shader work is being done. It's just a logical conclusion, seems pretty impossible to avoid that doing more shader work is going to require the shaders to... do more work.

6450 was clocked at 625 and is another version clocked at 750.
Also 7900 Gt is clocked at 500

http://www.newegg.com/Product/Product.aspx?Item=N82E16814130281

http://www.cnet.com/graphics-cards/nvidia-geforce-7900-gt/4505-8902_7-31768915.html :
You'll find the GeForce 7900 GT available only with 256MB of memory. Some partners sell it overclocked for a small premium, usually no more than $50. At its stock speeds, the GeForce 7900 GT features a 450MHz core clock and a 1,320MHz memory clock. Not only are those speeds an increase over the GeForce 7800 GT's (400MHz core, 1,000MHz memory)

However you are right about the HD 6450, it is still a 75MHz overclock here. The 7900 that is linked on that benchmark through amazon though is this card: http://www.amazon.com/dp/B000IGAMVS/?tag=neogaf0e-20 which is model: VCG7900SXPB and can be found on newegg here: http://www.newegg.com/Product/Product.aspx?Item=N82E16814133186.

You are talking about a 125MHz clock difference between those two cards, seeing that performance shouldn't surprise ANYONE.
 
Isn't RSX on par with 7600GT and not 7900GT?

wikipedia:
550 MHz on 90 nm process (shrunk to 65 nm in 2008[4] and to 40 nm in 2010[5])
Based on G71 Chip in turn based on the 7800 but with cut down features like lower memory bandwidth and only as many ROPs as the lower end 7600.
 
Isn't RSX on par with 7600GT and not 7900GT?

wikipedia:

That puts it more on par with the 7800, since memory bandwidth is custom in the PS3 and the clock is 100MHz higher than the 7800 as well... putting it at a debatable performance with 7900. (ROPs should also be custom)
 
That puts it more on par with the 7800, since memory bandwidth is custom in the PS3 and the clock is 100MHz higher than the 7800 as well... putting it at a debatable performance with 7900. (ROPs should also be custom)

7600GT had higher clock than 7800GT.
560Mhz vs 400Mhz
 
That puts it more on par with the 7800, since memory bandwidth is custom in the PS3 and the clock is 100MHz higher than the 7800 as well... putting it at a debatable performance with 7900. (ROPs should also be custom)

No, the RSX does have very low bandwidth.
It has only 22GBs of bandwdith vs 32GBs for the slowest G70 based card and 42GBs for the slowest G71 (what the RSX is based on) card.

And nothing points to the ROPs being any different.
 
Way to talk out your arse.

Modern x86 CPUs like Sandy Bridge, Ivy Bridge ect. keep up with POWER7 quite well!
They are a fuck ton faster than anything non POWER7!

And the lets not forget the simple fact that PPC =/= Modern versions of POWER!
It doesn't matter how fast any given x86 is, the architecture still sucks balls.
 
PS3's RSX was a generation behind 360's Xenos. Xenos has a unified shader architecture and twice the triangle-setup of RSX. You can look all of the details up on in Beyond3d's forum.
Yes it was the first unified shade card.

While there was many advantages the main that made the most difference was unified memory and Edram.
7600GT had higher clock than 7800GT.
560Mhz vs 400Mhz

That may be a closer match but it been years since I look at the rsx. Thanks

If we go with the 7600 gt benchmark score 190 vs 293 for amd 6450
For 7800gt it's 195

That proves my point even more....
 
Isn't RSX on par with 7600GT and not 7900GT?

wikipedia:

RSX is basically 7800GT with 1/2 the memory bandwidth because the bus width was chopped in half. It's also the least interesting part of the PS3's architecture, since the Cell processor has bandwidth to burn to main memory and so many PS3 games work around the relative weakness of the RSX by shifting post-processing of rendered frames to the SPUs on Cell.

The big reason why MSAA is largely not used on PS3 after the initial launch period is that there's not enough memory bandwidth from RSX to do that and have a decent framerate. See Heavenly Sword for an early example of forcing MSAA and killing performance. FXAA is nice because it's easy on memory bandwidth, and more complex techniques exist such as what FFXIII and Uncharted use to get the same effect as MSAA from post-processing on Cell.
 
7600GT had higher clock than 7800GT.
560Mhz vs 400Mhz

PS3's GPU is based on 7800, not the 7600, clock speed is higher on a HD 7770 than a 7870 but you won't see people debating which is the stronger card, if PS3 uses a 7800 card with a 100MHz overclock (the 400MHz you mentioned for 7800) than it is debatable if it performs close to the 7900 or not.

No, the RSX does have very low bandwidth.
It has only 22GBs of bandwdith vs 32GBs for the slowest G70 based card and 42GBs for the slowest G71 (what the RSX is based on) card.

And nothing points to the ROPs being any different.

The bandwidth and ROPs would matter more with higher resolutions and is probably why we see sub hd games on PS3 (in cases lower than 360's version of the same game)

I also didn't come up with the benchmark listed and shows that a 7900GS can keep up with an HD 6450 at a similar clock, the problem with USC-fan's benchmark is that the clock speeds show a 125MHz difference, and a 75MHz improvement over Wii U's supposed 160 ALU GPU. Xenos was superior to RSX anyways, and clearly shows how irresponsible a 160 ALU estimation for the Wii U, especially when the best evidence for this is rushed launch ports finished on work in progress DEV KITs with poor dev tools.

Yes it was the first unified shade card.

While there was many advantages the main that made the most difference was unified memory and Edram.

That may be a closer match but it been years since I look at the rsx. Thanks

If we go with the 7600 gt benchmark score 190 vs 293 for amd 6450


That proves my point even more....

We would be poor minded to do so, since PS3's GPU is 7800 based and has a 100MHz overclock on that chip. again I point to the HD7770 vs the HD 7870 or even the HD 7850 with it's 860MHz clock vs the 1GHz clock of the 7770.
 
No, it clearly does not!
How does a fast chip make the architecture any more sane? x86 is an ancient, register starved fallacy with tons of extensions bolted on top over the last decades to get decent performance out of those things. A lot of time and money is wasted on working around its limitations.
 
How does a fast chip make the architecture any more sane? x86 is an ancient, register starved fallacy with tons of extensions bolted on top over the last decades to get decent performance out of those things. A lot of time and money is wasted on working around its limitations.

Ever heard of the saying "the proof is in the pudding"?
 
How does a fast chip make the architecture any more sane? x86 is an ancient, register starved fallacy with tons of extensions bolted on top over the last decades to get decent performance out of those things. A lot of time and money is wasted on working around its limitations.

Sheer inertia prevents x86 from ever being replaced. Even Intel themselves couldn't replace x86. It's better to deal with it than complain about it now.
 
Sheer inertia prevents x86 from ever being replaced. Even Intel themselves couldn't replace x86. It's better to deal with it than complain about it now.

I don't think that what you said is contested. I think it's just that it's absurd for us to pretend that it's an efficient design at a similar clock speed as compared to the custom PPC/Power 7 hybrid Wii U CPU.
 
This pudding doesn't contain the proof you're looking for.

Then why are all the other processors based off different ISAs not massively faster?

I don't think that what you said is contested. I think it's just that it's absurd for us to pretend that it's an efficient design at a similar clock speed as compared to the custom PPC/Power 7 hybrid Wii U CPU.

The WiiU CPU is not POWER7!
IBM said so themselves!
 
Hybrid. As in takes some from both. That has been confirmed. Even so, hardly the point. Take Power 7 out of the picture completely, and it's still true.
 
Hybrid. As in takes some from both. That has been confirmed. Even so, hardly the point. Take Power 7 out of the picture completely, and it's still true.

No, nothing has been confirmed and what appears to be the WiiU CPU would get crushed by a SB/IB/Haswell running at the same clock speed.
 
I don't think that what you said is contested. I think it's just that it's absurd for us to pretend that it's an efficient design at a similar clock speed as compared to the custom PPC/Power 7 hybrid Wii U CPU.
No, it really isn't. What's absurd is pretending x86 was a core architecture. It's not. Jaguar, a recent, modern x86 implementation, is a core architecture. And it's a more efficient than Wii U's CPU.
 
No, it really isn't. What's absurd is pretending x86 was a core architecture. It's not. Jaguar, a recent, modern x86 implementation, is a core architecture. And it's a more efficient than Wii U's CPU.

The only thing better about jaguar is it's vector unit. It loses in just about everything else, that is why it has to have so many cores just for gaming. Vector utilization is highly expected to change this generation as well since GPGPU is being explored more greatly.

Wii U's CPU uses less power, a lower clock and should match up well with Jaguar cores found in PS4 in a core to core match up, except where vector heavy code is concerned. This is largely because it's pipeline is extremely small 3 to 4 times smaller than jaguar's IIRC.
 
Sheer inertia prevents x86 from ever being replaced. Even Intel themselves couldn't replace x86. It's better to deal with it than complain about it now.
I don't remember complaining about it. All I said is that it makes no sense to switch from PPC to x86 unless you absolutely have to.


No, it really isn't. What's absurd is pretending x86 was a core architecture. It's not. Jaguar, a recent, modern x86 implementation, is a core architecture. And it's a more efficient than Wii U's CPU.
Jaguar is a microarchitecture, the architecture (or ISA if you prefer) is still x86 - or more precisely amd64, which is one of the aforementioned workarounds trying to "fix" x86. And I don't see how Jaguar is more efficient. It's more powerful when it comes to SIMD, but that has more to do with the fact that Jaguar actually has a real SIMD unit. Unlike Espresso.
 
And I don't see how Jaguar is more efficient. It's more powerful when it comes to SIMD, but that has more to do with the fact that Jaguar actually has a real SIMD unit. Unlike Espresso.

I'd be surprised if Jaguar didn't have more complex out of order capatibilites or better branch prediction, for example. While I agree that it also may have disadvantages, there is a reason why Sony and Microsoft picked Jaguar. Like Nintendo they could have also went to IBM and worked with them on a CPU similar to Espresso, just with more cores and a slightly higher clock speed. Nintendo had an additional reason to do just this: They wanted to keep backwards compatibility. The reason Sony and Microsoft turned away from PPC and favored x86 instead has to be that Jaguar it is the more efficient choice for their consoles.
 
I'd be surprised if Jaguar didn't have more complex out of order capatibilites or better branch prediction, for example. While I agree that it also may have disadvantages, there is a reason why Sony and Microsoft picked Jaguar. Like Nintendo they could have also went to IBM and worked with them on a CPU similar to Espresso, just with more cores and a slightly higher clock speed. Nintendo had an additional reason to do just this: They wanted to keep backwards compatibility. The reason Sony and Microsoft turned away from PPC and favored x86 instead has to be that Jaguar it is the more efficient choice for their consoles.

or they went with jaguar as it was a cheaper match with their GPUs
 
I'd be surprised if Jaguar didn't have more complex out of order capatibilites or better branch prediction, for example. While I agree that it also may have disadvantages, there is a reason why Sony and Microsoft picked Jaguar. Like Nintendo they could have also went to IBM and worked with them on a CPU similar to Espresso, just with more cores and a slightly higher clock speed. Nintendo had an additional reason to do just this: They wanted to keep backwards compatibility. The reason Sony and Microsoft turned away from PPC and favored x86 instead has to be that Jaguar it is the more efficient choice for their consoles.

Price and developer convenience was a huge factor in the choice I'd reckon.

It really seems like Nintendo is the only console manufacturer left with a completely custom design this gen, which bites them in the ass right now.
 
True, I forgot that for a moment. If they wanted an APU from a single source and not just an MCM, AMD was the only option.

i suppose it would've be difficult (for ms) going with an ibm cpu as it may have seemed the best choice for bc but as far as i understand the ppe is even more of a technical dead end than the 750 was
 
Price and developer convenience was a huge factor in the choice I'd reckon.

I don't see why the ISA should have impact on that. 99.99% of the time you don't see assembler code anyway. And on all current consoles devs are used to PPC, not x86.

It really seems like Nintendo is the only console manufacturer left with a completely custom design this gen, which bites them in the ass right now.

The PS4 APU seems to be fairly custom, too. Of course not in the same way earlier Sony consoles were, but you can say the same about Wii U.

edit:

i suppose it would've be difficult (for ms) going with an ibm cpu as it may have seemed the best choice for bc but as far as i understand the ppe is even more of a technical dead end than the 750 was

They could've also walked the 750 path. That wouldn't keep them the bc, but if it were more efficient than Jaguar, they could've done it nonetheless.
 
I'd be surprised if Jaguar didn't have more complex out of order capatibilites or better branch prediction, for example. While I agree that it also may have disadvantages, there is a reason why Sony and Microsoft picked Jaguar. Like Nintendo they could have also went to IBM and worked with them on a CPU similar to Espresso, just with more cores and a slightly higher clock speed. Nintendo had an additional reason to do just this: They wanted to keep backwards compatibility. The reason Sony and Microsoft turned away from PPC and favored x86 instead has to be that Jaguar it is the more efficient choice for their consoles.
Nintendo could have kept just one PPC for BC and used it as IO processor in native mode if they wanted to switch to x86.

Also, don't forget that Jaguar has a pipeline more than four times as deep as Espresso. Better OoO capabilities and branch prediction are not so much features as they are necessities.
 
Status
Not open for further replies.
Top Bottom