WiiU "Latte" GPU Die Photo - GPU Feature Set And Power Analysis

Status
Not open for further replies.
This is what we are debating no? That a powerful console will have third party support regardless, that its a fact right?
Well, then that shouldn't be the center of the debate because it's not a fact.

Technical hurdles don't rule them out the presence of third party games. There are also other (eg. financial, human resources) aspects for that matter to be considered. But they do affect them. What sounds pleasible is that the gamecube hypothetically would have recieved less third party games, if the hardware was inferior to the PS2, the lead platform for which most games in that era were developed.
Another point is that technical hurdles are to be seen wider than just power/performance. Hurdles can be caused by unfriendly devtools, medium storage, the availability of a HDD, online infrastructure as well. It explains why the Xbox got more third party support than the GC.
 
PS2 was a powerful console. Gamecube was a powerful console. The two don't cancel out.

How many consoles weaker than PS2 were supported? Not many if at all.

PS3/360 are in the same boat. Supported for their power. Wii could not handle that power. Wii U can.

PS4/720 will bring about all new power. Wii U will not be powerful enough. Thus, we run back to the problem of power.

As for cross gen (I assume you mean PS3/PS4?), Wii U can only get PS3/360 ports. It wont see any PS4/720 ones.

No matter what you say, there is a baseline in all of this. It was Nintendo and only Nintendo who decided to not match the platform of power developers supported.
The Wii U is basically in between graphical generations, in while its raw power is not too far from current-gen, but is closer in GPU features and architecture (which are generally a lot closer to the previous gen than it traditionally has been.) This opens up some interesting options for developers... if they care enough, but the system's biggest issue is not its power at this time. We will see if Nintendo turns that around.

You have everything else on the Wii U with much less geometry. And I wouldn't expect Platinum too make big progress with tech on Wii U, given all their previous releases.



Darksiders 2 does.
Darksiders 2 also does not have v-sync, and did not fix the bugs that was discovered with the other consoles. The team probably got away with these things due to THQ being close to death.

but but they had darksiders 2 running on wii u and the gamepad in only 2 weeks !!!!

Well as you can see with this example, it's not that simple to port or develop like some people here believes or implies sometimes.

Yep. Sad thing is that it was that team that oversimplified Wii U development to begin with.

I'll take the wait and see approach with this (although with tessellation, this is going to be blurred).
Yes. The Wii U may be able to do that too, so Im curious on how that will go.
 
I'm pretty sure everything we know about the wiiU points to feature parity with the PS4 and Xbox durangorbinfinity. Those two consoles won't be capable of a single effect that the wiiU could not do. The only thing they will do is the same effects on a larger scale meaning if a dev wanted they could easily downport it.
 
I'm pretty sure everything we know about the wiiU points to feature parity with the PS4 and Xbox durangorbinfinity. Those two consoles won't be capable of a single effect that the wiiU could not do. The only thing they will do is the same effects on a larger scale meaning if a dev wanted they could easily downport it.

that won't work because next-gen all single-player games are going to be on the scale of MMOs and all MMOs are going to be Rift-supported near-reality experiences. that'll be a lot of work to scale down.
 
The debate about 160 ALUs and 320 ALUs has come up again, so since Fourth Storm and others have started to think it has 160 ALUs, have they come up with an explaination why they would be ~70% bigger than R700 ALUs?

Seeing as how ~30-35 R700 series ALUs would fit without change in the SPUs shown on our die shots, isn't it more reasonable to expect some R700 logic to have been removed? or a newer process lowing for tighter packing of physical logic? because if you look at the die shot it isn't pretty, things have been deformed to fit into areas of the chip to save on room.

Math is a great tool for understanding things, but it usually requires all the variables gathered into an equation for that to lead to success, and I highly doubt that these ALUs are so much bigger than R700 ALUs unless they are not R700 ALUs at all, and rather R800 but matt has said this isn't the case and he is a trusted source, so 160 ALUs is likely wrong. It is probably a tighter packing, combined with the new lower powered 40nm process (that didn't exist in 2009) and some unnecessary logic being removed, obviously PC GPUs do things that game console GPUs do not need to do.

This is a very weird GPU but 160 ALUs has never made sense, even when I first declared it in the beginning of this thread. One thing I can say for sure though is that this is a very low powered piece of tech with embedded design written all over it, not to mention the MCM bringing the power down further as well as design advancements and the lower powered 40nm process.

The size of the SPUs would actually indicate VLIW4 (32ALUs in each SPU) but no one wants to even humor the suggestion because of memory units supplied with the SPUs. for size though, 30 VLIW5 R700 ALUs or 32 VLIW4 R900 ALUs or 35 tightly packed VLIW5 R700 ALUs or 40 tightly packed and some logic removed R700 ALUs all fit the bill better than 20 R700 ALUs (per SPU).

If it was 20 ALUs, they just couldn't be R700, they are far too big... Calling them that would be completely dishonest imo, they would have to fill that extra logic space with rocks, or change the r700 design so drastically, you couldn't call it R700 any longer. R800 had larger ALUs and is VLIW5 but would have the issue of being DX11 compatible and Matt would have to either be lying to us or simply have outdated or misinformation.

I think on this topic all of the above options listed in the 5th paragraph is more likely than the idea posed in the 6th. I still would like to hear from fourth storm how he approached the problem.
 
that won't work because next-gen all single-player games are going to be on the scale of MMOs and all MMOs are going to be Rift-supported near-reality experiences. that'll be a lot of work to scale down.
Dude off topic but a rift first person MMO could possibly be one of the best ideas I have heard in ages. I'm getting sword art online vibes here.
 
Couldnt they technically shove the GPU in the next portable? After more die shrinks and stuff etc?

For months now I've had this exact theory, DS4 would just be Wii U's hardware shrunken down to 10nm chip (likely all on the same die) it is only using 30 watts when not using the drive right now, so it would be down to 2 or 3 watts by then. (this would be possible in 2016)

It makes a great deal of sense for Nintendo going forward too, because you can scale up and evolve Wii U's architecture and add more cores and a higher frequency to the CPU for the successor of the Wii U.

All their VC work and account/OS work done on Wii U could move forward. It could also be adapted to phones shortly afterwards if that is the direction they go with a 3rd platform (Iwata has mentioned wanting to make more than just a handheld and console going forward)

The main benefit though would be producing games for more than one platform at a time. You could have a team work on say Mario U's sequel for U towards the end of the consoles life, finish the game and have the developers tweak the game like galaxy 2 and release it on 3DS's successor without issue. It would also work for the successor of Wii U as well, using the resources from 1 team to produce multiple titles over the course of 6 months or a year of extra development rather than starting over from scratch.

Of course you could even produce the same game for both platforms, something 3rd parties might target, or certain Nintendo games like Smash series.
 
I agree that this seems to be the way they're headed. I don't know how 3rd parties would deal with that though, with the rest of the world on x86. Probably not well.
 
The debate about 160 ALUs and 320 ALUs has come up again, so since Fourth Storm and others have started to think it has 160 ALUs, have they come up with an explaination why they would be ~70% bigger than R700 ALUs?

Seeing as how ~30-35 R700 series ALUs would fit without change in the SPUs shown on our die shots, isn't it more reasonable to expect some R700 logic to have been removed? or a newer process lowing for tighter packing of physical logic? because if you look at the die shot it isn't pretty, things have been deformed to fit into areas of the chip to save on room.

Math is a great tool for understanding things, but it usually requires all the variables gathered into an equation for that to lead to success, and I highly doubt that these ALUs are so much bigger than R700 ALUs unless they are not R700 ALUs at all, and rather R800 but matt has said this isn't the case and he is a trusted source, so 160 ALUs is likely wrong. It is probably a tighter packing, combined with the new lower powered 40nm process (that didn't exist in 2009) and some unnecessary logic being removed, obviously PC GPUs do things that game console GPUs do not need to do.

This is a very weird GPU but 160 ALUs has never made sense, even when I first declared it in the beginning of this thread. One thing I can say for sure though is that this is a very low powered piece of tech with embedded design written all over it, not to mention the MCM bringing the power down further as well as design advancements and the lower powered 40nm process.

The size of the SPUs would actually indicate VLIW4 (32ALUs in each SPU) but no one wants to even humor the suggestion because of memory units supplied with the SPUs. for size though, 30 VLIW5 R700 ALUs or 32 VLIW4 R900 ALUs or 35 tightly packed VLIW5 R700 ALUs or 40 tightly packed and some logic removed R700 ALUs all fit the bill better than 20 R700 ALUs (per SPU).

If it was 20 ALUs, they just couldn't be R700, they are far too big... Calling them that would be completely dishonest imo, they would have to fill that extra logic space with rocks, or change the r700 design so drastically, you couldn't call it R700 any longer. R800 had larger ALUs and is VLIW5 but would have the issue of being DX11 compatible and Matt would have to either be lying to us or simply have outdated or misinformation.

I think on this topic all of the above options listed in the 5th paragraph is more likely than the idea posed in the 6th. I still would like to hear from fourth storm how he approached the problem.

very good post, could it even be possible by triming unnecessary logic and packing more tightly due to the improved process to actually have more than 40 ALUs in each?
 
I agree that this seems to be the way they're headed. I don't know how 3rd parties would deal with that though, with the rest of the world on x86. Probably not well.

The rest of the world would actually largely be on ARM in just a few years. Samsung Galaxy S3 sold 30million units in a month last October iirc. That is something console makers just can't compete with, and once these phones reach parity with 360 (probably just as 360 development will die down in 2 or 3 years) another technology will allow phones to invade the living room, this being Wireless HD, phones could start coming with wireless HDMI sticks allowing the phone to automatically connect to the home TV and play a session of Call of Duty with their favorite Bluetooth controller. (some TVs will already have wireless HD tech in the set allowing for phones to directly connect without the need of extra hardware)

The reality is x86, ARM and PPC have compilers built already, allowing code to be easily converted for the different CPU architectures. This shouldn't be a barrier at all.

very good post, could it even be possible by triming unnecessary logic and packing more tightly due to the improved process to actually have more than 40 ALUs in each?

I'm not an expert on how much you could trim R700, but since it is VLIW5, you'd need to have it divisible by 5, so I wouldn't expect it.
 
The debate about 160 ALUs and 320 ALUs has come up again, so since Fourth Storm and others have started to think it has 160 ALUs, have they come up with an explaination why they would be ~70% bigger than R700 ALUs?

Seeing as how ~30-35 R700 series ALUs would fit without change in the SPUs shown on our die shots, isn't it more reasonable to expect some R700 logic to have been removed? or a newer process lowing for tighter packing of physical logic? because if you look at the die shot it isn't pretty, things have been deformed to fit into areas of the chip to save on room.

Math is a great tool for understanding things, but it usually requires all the variables gathered into an equation for that to lead to success, and I highly doubt that these ALUs are so much bigger than R700 ALUs unless they are not R700 ALUs at all, and rather R800 but matt has said this isn't the case and he is a trusted source, so 160 ALUs is likely wrong. It is probably a tighter packing, combined with the new lower powered 40nm process (that didn't exist in 2009) and some unnecessary logic being removed, obviously PC GPUs do things that game console GPUs do not need to do.

This is a very weird GPU but 160 ALUs has never made sense, even when I first declared it in the beginning of this thread. One thing I can say for sure though is that this is a very low powered piece of tech with embedded design written all over it, not to mention the MCM bringing the power down further as well as design advancements and the lower powered 40nm process.

The size of the SPUs would actually indicate VLIW4 (32ALUs in each SPU) but no one wants to even humor the suggestion because of memory units supplied with the SPUs. for size though, 30 VLIW5 R700 ALUs or 32 VLIW4 R900 ALUs or 35 tightly packed VLIW5 R700 ALUs or 40 tightly packed and some logic removed R700 ALUs all fit the bill better than 20 R700 ALUs (per SPU).

If it was 20 ALUs, they just couldn't be R700, they are far too big... Calling them that would be completely dishonest imo, they would have to fill that extra logic space with rocks, or change the r700 design so drastically, you couldn't call it R700 any longer. R800 had larger ALUs and is VLIW5 but would have the issue of being DX11 compatible and Matt would have to either be lying to us or simply have outdated or misinformation.

I think on this topic all of the above options listed in the 5th paragraph is more likely than the idea posed in the 6th. I still would like to hear from fourth storm how he approached the problem.
yes I thought just like you about the 30 fit perfectly but.... It was shot down on beyond3d.

" Yes. The SPs come in groups of 5, I call them VLIW groups. 20 SPs are 4 groups, 40 SPs are 8 groups, both a power of 2. 30 SPs would require 6 groups per block, which is quite unlikely considerung the power of two number of SRAM banks for the registers (which generally scale with the number of groups)."

http://beyond3d.com/showthread.php?t=60501&page=195

40 was also shot down on beyond3d base on it size. So that leaves only 20.
 
yes I thought just like you about the 30 fit perfectly but.... It was shot down on beyond3d.

" Yes. The SPs come in groups of 5, I call them VLIW groups. 20 SPs are 4 groups, 40 SPs are 8 groups, both a power of 2. 30 SPs would require 6 groups per block, which is quite unlikely considerung the power of two number of SRAM banks for the registers (which generally scale with the number of groups)."

http://beyond3d.com/showthread.php?t=60501&page=195

Generally yes, however 32ALUs with 2 disabled or not present could be done. The real issue though is 20 doesn't fit unless they are much larger than R700.
 
yes I thought just like you about the 30 fit perfectly but.... It was shot down on beyond3d.

" Yes. The SPs come in groups of 5, I call them VLIW groups. 20 SPs are 4 groups, 40 SPs are 8 groups, both a power of 2. 30 SPs would require 6 groups per block, which is quite unlikely considerung the power of two number of SRAM banks for the registers (which generally scale with the number of groups)."

http://beyond3d.com/showthread.php?t=60501&page=195

40 was also shot down on beyond3d base on it size. So that leaves only 20.

using the size argument against 40 is short sighted IMO, using zombie's figures its not a huge difference and we do know its on a very rarely used process that can be used for increased densities
 
Someone more knowledgeable might be able to answer this, but is it possible that they took r700 and evolved it into something other than vliw5? I've speculated in the past that they could even be organizing the 4 main alus in vliw5 and grouping them in series of 4, making 2 groupings in each spu which is similar to VLIW4 GCN mix. Meaning 32ALUs in each spu. We need to remember this is a custom chip and designs could get a bit bizarre.
 
Really the only thing that fit is 20 per block @ 55nm.....

I dont think we can ever find out the true number since most people have stop looking months ago at beyond3d and here.

I still think 55nm process was not fully ruled out. If it was 55nm the gpu parts would fit perfect....
 
I'm pretty sure everything we know about the wiiU points to feature parity with the PS4 and Xbox durangorbinfinity. Those two consoles won't be capable of a single effect that the wiiU could not do. The only thing they will do is the same effects on a larger scale meaning if a dev wanted they could easily downport it.

You can downport anything to a point where the result is so far removed from the original thing that it's unrecognizable. PS4-XBOXi will provide 4-10 times the performance of the WiiU, at that point it's called butchering, not porting..
 
Hey folks. I see my moniker has popped up in these discussions, so I suppose it's time to step in an clarify. Firstly, yes, I do still believe that a 160:8:8 configuration makes the most sense of what we're looking at. Allow me to give you my reasoning in as succinct a way as possible. And remember, while I do follow technology and whatnot, I am not hardware engineer or a game programmer, so I am largely building off the work of others. Regardless, I've spent a fair amount of my leisure time comparing the die shot, reading the analysis of others, and trying to test different theories in an unbiased fashion.

In the case of the SPU count, this early post over on beyond3D was one of the first to make me raise an eyebrow and question how the shader blocks could possibly hold more than 20 shaders each. Emphasis is mine.

Gipsel said:
I agree. But I must also say that the GPU and the layout of the SIMDs looks a bit strange. The size of the SIMD blocks would be consistent with a ~15% higher density layout than one sees in Brazos. Not completely impossible given the maturity of 40nm, AMD's experience with it, and the low clock target, especially if it uses an older iteration of the VLIW architecture (DX10.1 R700 generation instead of DX11 R800 generation) as base.
But there is more. I think function noticed already the halved number of register banks in the SIMDs compared to other implementations of the VLIW architecture. I glossed over that by saying than each one holds simply twice the amount of data (8kB instead of 4kB) and everything is fine. It's not like the SRAM stuff takes significantly less space on the WiiU die than it takes on Brazos (it's roughly in line with the assumed generally higher density).
But thinking about it, each VLIW group needs parallel access to a certain number (four) of individually addressed register banks each cycle. The easiest way to implement this is to use physically separate banks. That saves the hassle of implementing multiported SRAM (but is also the source of some register read port restrictions of the VLIW architectures). Anyway, if each visible SIMD block would be indeed 40 SPs (8 VLIW groups), there should be 32 register banks (as there are on Brazos as well as Llano and Trinity [btw., Trinity's layout of the register files of the half SIMD blocks looks really close to the register files of GCN's blocks containing two vALUs]). But there are only 16 (but obviously twice the size if we are going with the 15% increased density). So either they are dual ported (then the increased density over Brazos is even more amazing) or something really fishy is going on. Before the Chipworks guy said the GPU die is 40nm TSMC (they should be able to tell), I would have proposed to think again about that crazy sounding idea of a 55nm die (with then only 160SPs of course).

http://beyond3d.com/showpost.php?p=1702908&postcount=4495

Since Gipsel posted this, it was concluded that the SRAM in the SPU blocks is not dual ported. Also, it seems like each SRAM block holds 4kB and not 8kB. I arrived at this by comparing the SRAM blocks to the smaller ones on the bottom of Latte identified by Marcan (check the OP for that image) as 2 kB. The SRAM used as GPRs for the shaders are exactly twice as long as those 2kB blocks. Other than that, they appear identical, so a differing density seems highly unlikely (unlike the SRAM used in the 1MB pool of texture cache in the upper left of the chip - that appears to be more dense and with such a large amount necessary, it's unsurprising).

Thus, it appears that each shader block can only hold 20 SPUs; that is unless Nintendo have actually cut register space to the shaders (the exact opposite of what Matt once reported, although that information seems a bit shaky since it didn't make much sense to people familiar w/ the ISA). As to why they are the size they are, we can really only guess, but there are a few factors which may come into play:

a)We've assumed perfect scaling from the 55nm RV770, which is usually not the case
b)Renesas' 40nm process may be less dense than TSMC's (which is known for being incredibly dense). They may lose some density in making the process eDRAM friendly.
c)There may be extra logic in the shader blocks that runs the shim layer (the compatibility layer that performs translation), as Marcan described it. The 8-bit CPU he mentioned is specifically for converting the Wii video output to the format now used by Radeons. There is other logic on there to handle TEV instruction translation. I don't know exactly where it is, but it could very well be right there in the shader blocks.
d)Other small tweaks could have been implemented that make the shaders somewhat larger. DirectX11 SPUs have some additional logic in there to support the new features of the API, so perhaps Nintendo added something analagous for whatever features above DirectX10.1 they decided to include.

Edit: Also, 55nm is not completely ruled out, but it does seem unlikely. I spoke to Jim Morrison myself on this, and he stated that the differences are very small and require some precise measurements to detect. For what it is worth, I did some measurements on MEM1 and it seems to fall in line with the reported cell size of Renesas' 40nm eDRAM.
 
Really the only thing that fit is 20 per block @ 55nm.....

I dont think we can ever find out the true number since most people have stop looking months ago at beyond3d and here.

I still think 55nm process was not fully ruled out. If it was 55nm the gpu parts would fit perfect....

seriously no, there is no way nintendo would've designed a system so much around low power usage and gone with 55nm
 
Really the only thing that fit is 20 per block @ 55nm.....

I dont think we can ever find out the true number since most people have stop looking months ago at beyond3d and here.

I still think 55nm process was not fully ruled out. If it was 55nm the gpu parts would fit perfect....

Is it the only thing that conceivably fits, or is it just what you're hoping for?

As was said, those would be way bigger than stock R700. And 55nm just seems wrong considering Nintendo's priorities with the system.

EDIT: beaten
 
Is it the only thing that conceivably fits, or is it just what you're hoping for?

As was said, those would be way bigger than stock R700. And 55nm just seems wrong considering Nintendo's priorities with the system.

EDIT: beaten

Well given his history of posts...
 
You can downport anything to a point where the result is so far removed from the original thing that it's unrecognizable. PS4-XBOXi will provide 4-10 times the performance of the WiiU, at that point it's called butchering, not porting..
The single most important differential is going to be the RAM volume. Everything else is manageable.
 
Thanks for the reply, 160ALUs is possible, I guess it really comes down to the seam unit sizes being non dual ported (didn't know that had been reasonably concluded) it also being 4kb would fit, not sure how likely 55nm is, the 32mb of embedded cache would be very dense, impossibly so afaik, one of the reasons we figured it was a low powered 40nm process IIRC is because 32mb is impossible at that density. The shader units are very odd, it is also possible that the 40nm process used here is simply more dense than tsmc's. Too bad there are no rogue AMD employees around to just give us the truth.
 
The single most important differential is going to be the RAM volume. Everything else is manageable.

What if there are serious amounts of gameplay relevant physics or a.i. involved?
Probably not relevant for many types of games, but I wouldn't rule that out completely.
 
Thanks for the reply, 160ALUs is possible, I guess it really comes down to the seam unit sizes being non dual ported (didn't know that had been reasonably concluded) it also being 4kb would fit, not sure how likely 55nm is, the 32mb of embedded cache would be very dense, impossibly so afaik, one of the reasons we figured it was a low powered 40nm process IIRC is because 32mb is impossible at that density. The shader units are very odd, it is also possible that the 40nm process used here is simply more dense than tsmc's. Too bad there are no rogue AMD employees around to just give us the truth.

It's also possible that we're dealing with a VLIW4 chip, too. I'm don't have a keen enough understanding to look at a chip and tell the difference, but AMD went to VLIW4 for it's APU's primarily for more energy efficiency and to save transistors, both of which seem to along with Nintendo's design goals.

Edit: AMD does have a few APU's that are in a 256:16:8 configuration.
 
What if there are serious amounts of gameplay relevant physics or a.i. involved?
Probably not relevant for many types of games, but I wouldn't rule that out completely.
I was speaking generally. Othewise of course, there might be titles which make use of all their available resources to such a degree that it would not make sense to downport them.
 
The more I look at Wii U, the more sure I become that Nintendo was building a handheld for 2017 than a console for next gen. It won't have a baring on Wii U's sales I don't think because of the improvements to GPUs over the last 4 years it will easily output beyond Xenos even at 160ALUs thanks to all of Xenos bottlenecks (only thing that wasn't bottleneck'd in the system was the ROPs IMO)

Wii U games should have better lighting, textures, larger environments, ect. While being able to put it in a smaller package in 4 years to resell it as a new handheld.

It's also possible that we're dealing with a VLIW4 chip, too. I'm don't have a keen enough understanding to look at a chip and tell the difference, but AMD went to VLIW4 for it's APU's primarily for more energy efficiency and to save transistors, both of which seem to along with Nintendo's design goals.

Edit: AMD does have a few APU's that are in a 256:16:8 configuration.

I don't really understand why Nintendo would go with VLIW5, especially with size being a concern. VLIW4 ALUs iirc comes in groups of 16, so you'd have 32ALUs for each block. I don't know how the memory would be effected for this though, I skipped that generation of AMD cards so my understanding of them is just basic.
 
Thanks for the reply, 160ALUs is possible, I guess it really comes down to the seam unit sizes being non dual ported (didn't know that had been reasonably concluded) it also being 4kb would fit, not sure how likely 55nm is, the 32mb of embedded cache would be very dense, impossibly so afaik, one of the reasons we figured it was a low powered 40nm process IIRC is because 32mb is impossible at that density. The shader units are very odd, it is also possible that the 40nm process used here is simply more dense than tsmc's. Too bad there are no rogue AMD employees around to just give us the truth.

It's also possible that we're dealing with a VLIW4 chip, too. I'm don't have a keen enough understanding to look at a chip and tell the difference, but AMD went to VLIW4 for it's APU's primarily for more energy efficiency and to save transistors, both of which seem to along with Nintendo's design goals.

Edit: AMD does have a few APU's that are in a 256:16:8 configuration.

The register space in the shader blocks is indicative of exactly 20 SPUs each. Each shader block has 64kB SRAM by my measurements. I don't know about VLIW4, but I doubt they decreased register space in those cards. There should simply be more SRAM banks in those blocks if they contain anything over 20 SPUs.

Oh, and there is some dual ported SRAM on Latte to compare against. It's the stuff w/ the reddish tint.
 
The more I look at Wii U, the more sure I become that Nintendo was building a handheld for 2017 than a console for next gen. It won't have a baring on Wii U's sales I don't think because of the improvements to GPUs over the last 4 years it will easily output beyond Xenos even at 160ALUs thanks to all of Xenos bottlenecks (only thing that wasn't bottleneck'd in the system was the ROPs IMO)

Wii U games should have better lighting, textures, larger environments, ect. While being able to put it in a smaller package in 4 years to resell it as a new handheld.



I don't really understand why Nintendo would go with VLIW5, especially with size being a concern. VLIW4 ALUs iirc comes in groups of 16, so you'd have 32ALUs for each block. I don't know how the memory would be effected for this though, I skipped that generation of AMD cards so my understanding of them is just basic.

If they only have to focus on software for one console they would probably manage a decent output even with no major third party support.
 
The register space in the shader blocks is indicative of exactly 20 SPUs each. Each shader block has 64kB SRAM by my measurements. I don't know about VLIW4, but I doubt they decreased register space in those cards. There should simply be more SRAM banks in those blocks if they contain anything over 20 SPUs.

Oh, and there is some dual ported SRAM on Latte to compare against. It's the stuff w/ the reddish tint.
Also 20SP sounds lazy and unlike Nintendo when you consider how custom their hardware has been.
VLIW4 is a little modified VLIW5: on 5 there was 4 units for simple instructions (x,y,z,w) and a fifth one (t) for the complicated ones. On VLIW4 they dropped the T unit and upgraded all the remaining X,Y,Z,W units to be able to handle those complicated transcendental instuctions. So there left 4 equivalent units bunded in a VLIW.

GCN was designed from scratch and it's roughly like an x86 processor with 2048bit SSE support.
- You can program the scalar alu like a classic x86: you can change the program counter for example. In VLIW the only program flow elements was an IF/ELSE block, a LOOP block, and an EXIT instruction. On GCN you can have subroutines for example, this way much bigger programs can fit in the small 32KB instruction cache.
- The scalar ALU works in paralell with 64 element vector alu. It is possible to make a loop that wastes only 1 cycle for the loop management code. On VLIW the loop overhead can be 10-40 cycles long even.
- No complicated register access. On VLIW it was very complicated to feed 3*4 input parameters as it was read from 3x 16byte parts of the register bank.
- 50% smaller instruction encoding (there are 32bit instrictions too, not just 64bit ones). That's why the instruction cache was reduced from 48KB to 32KB. And less cache means more space for additional computing units

- As others said earlier: absolutely no need for code vectorization. Every 16wide SIMD unit will process 4*16 workitems (1 wavefront) in a pipeline with 4 stages. That is 2x more than in VLIW and that why GCN needs 2x more minimum workitems than VLIW. For a Tahiti it's a minimum of 8192 workitems.
- If you like to program in asm, GCN is much simpler to program than VLIW. Back then I haven't got enough courage for the extremely complicated VLIW asm, but this new language is even simpler than AMD_IL. It's very well designed, can't say a bad thing about it.

If they went this route it could make sense... is TEV going to be easily compatible with VLIW5? because those shaders can't all do the same tasks while VLIW4 and GCN are better suited for this task IMO. anyways if this is the case it would solve the size and cache size of the SPUs. I just think they are too big to be 20ALUs, especially now that we know TEV logic is being rerouted by an 8bit CPU, this is something you'd likely not keep inside the SPU.

If they only have to focus on software for one console they would probably manage a decent output even with no major third party support.
Exactly my thinking, they would have 2 platforms but since the code and resources would be compatible it would nearly double their output, not to mention that they could launch hardware without droughts since future hardware could be upscaled/evolved forms of Wii U.
 
The more I look at Wii U, the more sure I become that Nintendo was building a handheld for 2017 than a console for next gen. It won't have a baring on Wii U's sales I don't think because of the improvements to GPUs over the last 4 years it will easily output beyond Xenos even at 160ALUs thanks to all of Xenos bottlenecks (only thing that wasn't bottleneck'd in the system was the ROPs IMO)

Wii U games should have better lighting, textures, larger environments, ect. While being able to put it in a smaller package in 4 years to resell it as a new handheld.
I always thought Nintendo had been investing heavily in miniaturization from the start, seeing how the trend seems to be swaying more for mobile devices that can connect to everything than stationary setups. That said, Wii U is a rather costly endevour for them and its showing (most of their weaknesses). The question is whether they could solve theses issues.
 
The register space in the shader blocks is indicative of exactly 20 SPUs each. Each shader block has 64kB SRAM by my measurements. I don't know about VLIW4, but I doubt they decreased register space in those cards. There should simply be more SRAM banks in those blocks if they contain anything over 20 SPUs.

Oh, and there is some dual ported SRAM on Latte to compare against. It's the stuff w/ the reddish tint.

In was under the impression that VLIW4 and GCN grouped SRAM in 64kb registers. I know that's the case for GCN compute units, as I read the AMD GCN white paper. VLIW4 clusters are also about 10% smaller than VLIW5 clusters on the same process.

I'm not saying I know anything new. I'm just not sure all of our assumptions are correct. I could also be very wrong.
 
In was under the impression that VLIW4 and GCN grouped SRAM in 64kb registers. I know that's the case for GCN compute units, as I read the AMD GCN white paper. VLIW4 clusters are also about 10% smaller than VLIW5 clusters on the same process.

I'm not saying I know anything new. I'm just not sure all of our assumptions are correct. I could also be very wrong.

No you are right about gcn. I think there is just a lot of people stuck on trying to shoehorn this onto an old AMD design when there is better more energy efficient designs available. (not to say fourth storm is doing this, he is trying his best to make logical conclusions about the die shots)

my quote above is from realhet @ http://devgurus.amd.com/message/1296839
 
You have everything else on the Wii U with much less geometry. And I wouldn't expect Platinum too make big progress with tech on Wii U, given all their previous releases.



Darksiders 2 does.

Darksiders 2 was ransacked with glitches, issues, and flaws, and on both a technical and graphical level was doing no where near as much as any other port released on the Wii U.

That is an example of a bad, unfinished port and nothing more. It would be like saying the PC was weaker than the Gamecube because the RE4 port was missing so many things from it.

The debate about 160 ALUs and 320 ALUs has come up again, so since Fourth Storm and others have started to think it has 160 ALUs, have they come up with an explaination why they would be ~70% bigger than R700 ALUs?

Seeing as how ~30-35 R700 series ALUs would fit without change in the SPUs shown on our die shots, isn't it more reasonable to expect some R700 logic to have been removed? or a newer process lowing for tighter packing of physical logic? because if you look at the die shot it isn't pretty, things have been deformed to fit into areas of the chip to save on room.

Math is a great tool for understanding things, but it usually requires all the variables gathered into an equation for that to lead to success, and I highly doubt that these ALUs are so much bigger than R700 ALUs unless they are not R700 ALUs at all, and rather R800 but matt has said this isn't the case and he is a trusted source, so 160 ALUs is likely wrong. It is probably a tighter packing, combined with the new lower powered 40nm process (that didn't exist in 2009) and some unnecessary logic being removed, obviously PC GPUs do things that game console GPUs do not need to do.

This is a very weird GPU but 160 ALUs has never made sense, even when I first declared it in the beginning of this thread. One thing I can say for sure though is that this is a very low powered piece of tech with embedded design written all over it, not to mention the MCM bringing the power down further as well as design advancements and the lower powered 40nm process.

The size of the SPUs would actually indicate VLIW4 (32ALUs in each SPU) but no one wants to even humor the suggestion because of memory units supplied with the SPUs. for size though, 30 VLIW5 R700 ALUs or 32 VLIW4 R900 ALUs or 35 tightly packed VLIW5 R700 ALUs or 40 tightly packed and some logic removed R700 ALUs all fit the bill better than 20 R700 ALUs (per SPU).

If it was 20 ALUs, they just couldn't be R700, they are far too big... Calling them that would be completely dishonest imo, they would have to fill that extra logic space with rocks, or change the r700 design so drastically, you couldn't call it R700 any longer. R800 had larger ALUs and is VLIW5 but would have the issue of being DX11 compatible and Matt would have to either be lying to us or simply have outdated or misinformation.

I think on this topic all of the above options listed in the 5th paragraph is more likely than the idea posed in the 6th. I still would like to hear from fourth storm how he approached the problem.

This is why precisely why I lean heavily towards it being derived from the HD5550. Everything just seemed to match up perfectly. The only thing keeping it from being a match are people's insistence that it is based on the R7XX instead of R8XX. Given that it is so heavily customized, it could very well be based on both. People just seem to have this aversion to upping anything beyond the lowest logical estimate with Nintendo hardware.

http://www.neogaf.com/forum/showthread.php?p=49751859&highlight=320#post49751859
http://www.neogaf.com/forum/showpost.php?p=47955116&postcount=2487
 
This is why precisely why I lean heavily towards it being derived from the HD5550. Everything just seemed to match up personally. The only thing keeping it from being a match are people insistance that it is based on the R7XX instead of R8XX

http://www.neogaf.com/forum/showthread.php?p=49751859&highlight=320#post49751859
http://www.neogaf.com/forum/showpost.php?p=47955116&postcount=2487
Actually I'm thinking it is highly custom and shares similarities to GCN. The SPUs are too big to house only 20ALUs a piece, but VLIW5 requires more memory on each unit, however GCN reduced memory sizes and VLIW5 also isn't a good match for TEV since the Shaders are not all even and would limit any sort of brute force approach needed to match TEV.

It is likely 32ALUs per SPU with a custom design borrowing from GCN (already planned in 2009) giving 256ALUs with a performance around 282GFLOPs.
 
From what I understand it would be impossible to clock this cpu to 3.25 ghz because of how short the pipeline is.

Moreso than that, it requires you to believe Nintendo intentionally released a product knowing that it was horribly gimped. It then had developers invest thousands of dollars developing for this gimped version of the platform, and then planned 6-8 months after release to unleash it's full potential.

It defies credulity, both in terms of business sense and technological limitations.
 
From what I understand it would be impossible to clock this cpu to 3.25 ghz because of how short the pipeline is.

It would also make it extremely fast, something that would compete with any top end CPU. (not flops wise, though it would do a lot better than it currently would obviously)

Anyways it isn't realistic, you'd probably need to shrink the die to like 7nm to get that sort of clock out of it which won't be possible until the end of the decade.
 
Please I am only asking this to be damn sure, but the CPU and GPU rumor of a clock bump has been debunked correct?

I think it is highly unlikely. It's kind of sad that my Samsung Galaxy S2 has the same Clockspeed as the Wii U CPU...Maybe even higher.


However, it would have been better if Espresso had more than 3 Cores...
 
It really isn't as sad as you think it is.

AXEFTUk.jpg

I know nothing about Hardware stuff, but I noticed, that the 2 front USB Ports are connected to the GPU/CPU, but the ones on the back aren't. Would it be possible if Nintendo released a small "USB CPU, Ram or GPU" - Upgrade that used both front USB Ports at the same time (for maximum Bandwidth)? Small enough to even close the front thingy, so it would be hidden.
 
I know nothing about Hardware stuff, but I noticed, that the 2 front USB Ports are connected to the GPU/CPU, but the ones on the back aren't. Would it be possible if Nintendo released a small "USB CPU, Ram or GPU" - Upgrade that used both front USB Ports at the same time (for maximum Bandwidth)? Small enough to even close the front thingy, so it would be hidden.

No. USB was not built for such things. Far too slow.
 
I think it is highly unlikely. It's kind of sad that my Samsung Galaxy S2 has the same Clockspeed as the Wii U CPU...Maybe even higher.


However, it would have been better if Espresso had more than 3 Cores...

The clock hasn't been the main indicator of power/performance in a decade. If it were, then the PS4/Xbox3 CPU's would be weaker than the 360/PS3 CPUs.

I find it ironic how people compare clocks when talking about the 360/PS3 vs the Wii U(preaching that the low clock makes it weaker) but not when comparing them to their next gen counterparts..

Please I am only asking this to be damn sure, but the CPU and GPU rumor of a clock bump has been debunked correct?

No, it hasn't. It depends on how you look at it as well. A lot of people here only focus on the 3 Ghz bump claim, but completely refrain from addressing their being a lower bump.

What we do know as that numerous people have testified to and increase in performance in various games on the Wii U since the update. It could very well just be a small bump from 1.2 to 1.3 or 1.6 which would be feasible. This should be taken to the CPU thread though. http://www.neogaf.com/forum/showthread.php?t=513471 This thread is about the GPU.
 
Power consumprion also something that points at 160alu. This would match the sub 15 watts for the gpu.

We don't hav e anything at all that said they moved from r700 base. Saying they moved to gcn or whatever just muddies the water.
 
Status
Not open for further replies.
Top Bottom