WiiU "Latte" GPU Die Photo - GPU Feature Set And Power Analysis

Status
Not open for further replies.
jett rockets occlusion got defeated by lostinblu showing it was baked. krizzx then said global illumination is an example of it and used in galaxy and metroid prime 3 (no evidence, but i dont know what it is/how obvious it was or known) so guess that needs discussion/refuting/recategorising.
They don't use global illumination.

Various reasons not to, but the principle for global illumination is accuracy, you're going from good old light sources (the 8 light sources I mentioned last page) into something else; something that reflects and refracts in an accurate way; that's far past Wii capabilities, it's far past Wii U capabilities in a usable way too.

Those Wii games lighting didn't focus on accuracy, in fact the rim shading in Mario Galaxy is inacurate, and that's fine, because it wasn't a realistic game and it looked good. But it's not global illumination, it puzzles me why anyone would claim it is.

UE4 was touting "real-time global illumination" via voxels as a next gen feature; and they were pissed because the specs for PS4 and XBone were under what they expected rendering Global illumination something that might just not happen. The notion of Mario Galaxy or Metroid Prime using it is a fanfic.

"Real" Ambient Occlusion (not SSAO) is a subcategory of Global Illumination yet they're not the same thing, you can have Global Illumination and not even go there, you just can't have "real" Ambient Occlusion sans it.


And I'll say this, I'm not the biggest expert in this thread either, I'm clearly overstepping my boundaries on this one and thus I might be doing some errors, as my knowledge on this is superficial (I render things in 3D, so I have to know what the hell global illumination is vs direct lights, but I don't know the inner workings for real-time; I know though it is intensive and we're working around it with Voxels a technique I'll admit not having lost much time studying yet. Also concepts are not always 1=1 on "offline" rendering tech vs real-time implementations; but still. No such thing as "global illumination" in Wii games.
yeah im trying to mediate somewhat but I know im only contributing to the brawl in all since no technical knowledge =/ Think I'ma stop posting again and watch the calamities from afar.
Nothing wrong with your participation mind you, I have no problem trying to explain it for you since this was brought up; I just resent the fact stuff like this is being brought up, because it makes us go on loops to disregard with facts what someone thought probable via "it looked like it was" and won't even admit to being wrong.

That's really my issue here, without starting a fight all over again.
 
I"m aware. I was just saying that because people make to much fuss out of memory and storage space these days.

RAM will not be the deciding fact in where or not game gets ported. Not unless we are dealing with the laziest devs on Earth.

Though, wasn't that 64kb demo made to run on pretty archaic hardware? I remember seeing that things years and years ago.

I think you're conflating the download size of a game and the amount of space it takes in RAM. Or you're incorrectly lumping them together at least - they're totally different things that have very little to do with each other.
 
Actually, this was not confirmed.

Multiple display support (that was in the press release) doesn't have to infer Eyefinity (which is just being able to hook up to 3-6 displays). You can still output to more than one monitor on older GPU cards.

Regardless though, I don't think it means much considering Wii U only needs to output to one screen (and was never explicitly stated to do more).

Edit: Was referring to the gamepad. But woah and behold, Nintendo was only pushing for one display (off tv) so I'm still kinda right!
Depends on wether you take a watch.impress article for a fact or not.

They had lots of good technical articles in the past albeit in japanese. But yes, I can see your point, it's not something Nintendo or ATi are claiming on public spreadsheets (perhaps not even on dev documentation) and that opens the hypotesis of it not being there; but we have to assume some things and I assume Eyefinity is likely in because it would come in handy for a console that has 2 viewports at all times and supports a third. Of course, not having it doesn't render whatever it's doing impossible, so the fact is doing it doesn't necessarily prove anything other than being convenient and having been mentioned on some pre-release tech driven article.

I think it's more likely than shader model 5 or OpenGL 4.x (things that I don't completely discard being in, as the Latte is not an R700 and it's AMD suceeding part based on the same architecture went further), but perhaps claiming it has was an error on my part, as it's something that requires this type of disclaimer along with it.

Either way I stand by what I said, of course I'd like a confirmation/second source based on different information to be 100% sure though.
 
lostinblue, you can get proper AO perfectly naturally in path tracing.
 
Shin'en have attested to the huge speed boost that the eDRAM allows.

Of course it's a huge speed boost, over not using the eDRAM and sticking to the DDR3 alone. The quote has no context, and the developer has not worked on other 8G systems. You linked an article saying Sony considered 1TB/s eDRAM and said that would be one high performing option, although harder to program for, and used it as proof the Wii Us eDRAM offers such benefit? I thought we determined that the pinout for the eDRAM means it's either ~70 or ~130GB/s, and more likely the former. This isn't going to be bridging the gap with GDDR5 any time soon.

Anywho, I'm glad that guy is banned. Literally dozens of pages of him arguing with everyone who brought up reality, always backtracking, never admitting he was wrong, turning things around and making it seem like everyone else was being the problem when really it was him. Good riddance. That alone will change the tone around here for the better, as it got better a few pages back when he was being quiet, then he started up again.
 
Of course it's a huge speed boost, over not using the eDRAM and sticking to the DDR3 alone. The quote has no context, and the developer has not worked on other 8G systems. You linked an article saying Sony considered 1TB/s eDRAM and said that would be one high performing option, although harder to program for, and used it as proof the Wii Us eDRAM offers such benefit? I thought we determined that the pinout for the eDRAM means it's either ~70 or ~130GB/s, and more likely the former. This isn't going to be bridging the gap with GDDR5 any time soon.

Anywho, I'm glad that guy is banned. Literally dozens of pages of him arguing with everyone who brought up reality, always backtracking, never admitting he was wrong, turning things around and making it seem like everyone else was being the problem when really it was him. Good riddance. That alone will change the tone around here for the better, as it got better a few pages back when he was being quiet, then he started up again.

although they're the two best guesses I don't think we had anything to suggest the lower was more likely did we?
 
although they're the two best guesses I don't think we had anything to suggest the lower was more likely did we?

I can't remember and can't be bothered to look it up at the moment to be honest, I could be wrong. But I seem to remember the 70GB/s being the main theory, while the 130GB/s one was a bit of a sideliner?

Although, does it seem likely it would be as fast or faster as the Xbox One eDRAM if everything else is so much slower?
 
I can't remember and can't be bothered to look it up at the moment to be honest, I could be wrong.

Although, does it seem likely it would be the same speed as the Xbox One eDRAM if everything else is so much slower?

without having any evidence that's a pretty poor reason to base an assumption on, Nintendo could easily be using higher bandwidth there to overcome shortcomings elsewhere, just because most specs are below xBone doesn't mean every spec has to be (or is)
 
without having any evidence that's a pretty poor reason to base an assumption on, Nintendo could easily be using higher bandwidth there to overcome shortcomings elsewhere, just because most specs are below xBone doesn't mean every spec has to be (or is)

Agreed, and like I edited in, I just seem to remember 70GB/s being the main theory by the bus width or something, the 130GB/s being the less likely one. I can't remember why the two numbers were a possibility.
 
without having any evidence that's a pretty poor reason to base an assumption on, Nintendo could easily be using higher bandwidth there to overcome shortcomings elsewhere, just because most specs are below xBone doesn't mean every spec has to be (or is)

The extra bandwidth is not much use for a GPU that cannot use it, and if you accept that its based on the R700 then theres a good chance it can't effectively use a large amount of bandwidth as it is more likely to be bottlenecked in other area's.
 
The extra bandwidth is not much use for a GPU that cannot use it, and if you accept that its based on the R700 then theres a good chance it can't effectively use a large amount of bandwidth as it is more likely to be bottlenecked in other area's.

Although it's said to be based on the R700 it doesn't look much like one, does it..? Unless I've missed something wasn't the general consensus of opinion that the die shot is closer to Brazos..?
 
Although it's said to be based on the R700 it doesn't look much like one, does it..? Unless I've missed something wasn't the general consensus of opinion that the die shot is closer to Brazos..?
Quoting myself.

'Derived from' doesn't rule out customisations. The likelihood that the core tech of things such as the shader cores would be developed from scratch or even radically customised however, is low. Evidence points to a baseline of R700: those chips or something like them were in dev kits, the name popped up repeatedly in rumours and documentation and even now post launch there is talk of DirectX10.1 equivalency. I know you'll respond by bringing up the instances in which DirectX11 was mentioned, but the point is that an R700 derived chip in a closed box environment could easily use an API that exposes parts of the hardware (e.g. tessellation and compute shaders) that were DirectX11 exclusive on PC. If it were derived from a fully DirectX11 compliant chip to begin with, why would 10.1 be mentioned at all? As for Brazos providing the most component matches, I'll quote Fourth Storm:
What do you think most of the analysis from bgassassin, blu, Thraktor, z0m, myself, etc was grounded on if not comparative analysis with other dies? The Brazos die is how I have reached many of the conclusions I hold on Latte. You are drawing the wrong conclusion from the similarities, however. The things which Latte and Brazos have are commonalities shared with all modern GPUs. There's nothing we've identified on Brazos and Latte that would be lacking, for example, in the R700 series. Where Brazos has helped us is that the die photo is much sharper than the die photo going around of R700.
 
I wish we could get more dev entries behind the WiiU version of Project Cars. There was some light shed but not much. It was still better than this current debate (WiiU vs. PS4...just how many times?).
 
lostinblue, you can get proper AO perfectly naturally in path tracing.
I knew I was overstepping my boundaries on that one, yes, looking it up it makes sense to go that way rather than the more expensive path I suggested. I really need to dig a little more on those method's inner workings. Thanks for the heads up.

Still too expensive for Wii though. :)
 
I knew I was overstepping my boundaries on that one, yes, looking it up it makes sense to go that way rather than the more expensive path I suggested. I really need to dig a little more on those method's inner workings. Thanks for the heads up.

Still too expensive for Wii though. :)
Naturally.

It's still 1998 tech at its core. I've always been amazed at what could be achieved on it though. Factor 5 did some truly impressive things with it. Achieving honest to goodness normal mapping on a GPU with no way of Dot3 precision. Using the CPU or T&L unit (not sure which actually) for per pixel reads. Their light scattering approximation was impressive too.

Then Rare using the insanely high 1tSRAM bandwidth to layer models and the environments with concentric shells to achieve fur and grass shading.

This is the era of hardware that I have an aptitude with. Emotion Engines and Graphics Synthesizers, TEV's T&L's, and Gekko's, Nvidia2a's and hijacked stock P3's.
 
OT but geez I can't get over how good the fire in pikmin looks
zlCfzRH_CmwH65SjUh

How did they get it to look like that? just texturework?
 
Got a video of that for better analyzing?

The very bright spots with the surrounding glow look like emissive material or bloom which is done through a shader or texture. Some of the fire also looks to be blend in with the depth of field (Gaussian blur?). The floaty bits might be a texture with alpha applied to it.
 
Has anyone tried to guess the transistor count on the GPU? Don't we know the transistor count per cell of edram and can extrapolate what 32 MB would entail?
 
Got a video of that for better analyzing?

The very bright spots with the surrounding glow look like emissive material or bloom which is done through a shader or texture. Some of the fire also looks to be blend in with the depth of field (Gaussian blur?). The floaty bits might be a texture with alpha applied to it.
Im on my phone eight now, will provide a video later
 
The extra bandwidth is not much use for a GPU that cannot use it, and if you accept that its based on the R700 then theres a good chance it can't effectively use a large amount of bandwidth as it is more likely to be bottlenecked in other area's.
What????? Again with that "it can't use xxx part because it doesn't has power enough"? The minimal bandwidth requirement for a FLOP is 96 bits, or 12 bytes. 12 bytes * 176 * 1000 (kilo) * 1000 (mega) * 1000 (giga) / 1024 (memory kilo) / 1024 (memory mega) / 1024 (memory giga) / 1024 (memory tera) = 1,92 TB/s
Of course there are the internal caches, but even then, to say that having 32MB of fast speed would be useless, even when going for the lowest raw performance possible for the GPU (in my opinion and considering its die size, I see it closer to 352 Gflops than 176 Gflops at 55nm or 40nm so badly done that it has in 2012 and at less clock speed less density than TSMC's in 2008) is like those claims about unusable tessellators and unusable DX11 features.

I mean, according to some of you nearly everything about this GPU would be unusable. Nor the DX11 features (that's USC-fan), nor the tessellator, nor the memory if it has more than 70 GB/s.... what the hell is that XD
It's like some of you want this console to be below current gen or in case of it being not credible at all, as close to it as possible at any costs. You're trying very hard to negate anything that could put this console over that.

Look, I'm not saying there is a whole TB/s of bandwidth there, but while Nintendo doesn't push its hardware in terms of raw specs it's also the company that more invests in terms of relative bandwidth available. The WiiU CPU for example, and that's one of the few things we know for sure, has a core with a 4 times bigger cache than what the PS4 and the Xbox One have.

If there's something that Nintendo prioritizes is the memory architecture. Look, I'm not saying it has that or that for sure, but claiming its impossible with arguments like "the GPU wouldn't use it" hasn't any sense.
 
What????? Again with that "it can't use xxx part because it doesn't has power"? The minimal bandwidth requirement for a FLOP is 96 bits, or 12 bytes. 12 bytes * 176 * 1000 (kilo) * 1000 (mega) * 1000 (giga) / 1024 (memory kilo) / 1024 (memory mega) / 1024 (memory giga) / 1024 (memory tera) = 1,92 TB/s

Where is this from, it sounds like crap to me.
 
Where is this from, it sounds like crap to me.
Where is this from? A flop is a floating point operation. It takes 2 32-bit floats and returns another 32 bit float as a result, so this is at least 96 bits of bandwidth for every FLOP (and of course, this is the minimum needed because then there are instructions like "write the data onto memory" that also takes internal extra bandwidth).
It sounds like crap to you, you say? Well, it's at least a bit less crappy than your claims about "parts of the GPU being useless" because you say so.
 
The extra bandwidth is not much use for a GPU that cannot use it, and if you accept that its based on the R700 then theres a good chance it can't effectively use a large amount of bandwidth as it is more likely to be bottlenecked in other area's.
That's complete and utter nonsense.


Has anyone tried to guess the transistor count on the GPU? Don't we know the transistor count per cell of edram and can extrapolate what 32 MB would entail?
It's easier to use the SRAM:
1MB = 1,048,576 bytes = 8,388,608 bit = 50,331,648 transistors
 
That's complete and utter nonsense.



It's easier to use the SRAM:
1MB = 1,048,576 bytes = 8,388,608 bit = 50,331,648 transistors

It depends on what you mean, but if people seriously think that 1TB/s of bandwidth will elevate the Wii U in new ways go for it, but it won't do much due to being far to slow in other areas to use so much bandwidth.

Where is this from? A flop is a floating point operation. It takes 2 32-bit floats and returns another 32 bit float as a result, so this is at least 96 bits for FLOP (and of course, this is the minimum needed because then there are operations like "write the data onto memory that also takes extra bandwidth").
It sounds like crap to you? Well, it's at least a bit less crappy than your claims about "parts of the GPU being useless" because you say so.

Not all FLOPs work like how you think, some like MADD's which are what most GPU's are based on don't take two operands and return 1.

http://en.wikipedia.org/wiki/Multiply–accumulate_operation
 
Not all FLOPs work like how you think, some like MADD's which are what most GPU's are based on don't take two operands and return 1.

http://en.wikipedia.org/wiki/Multiply–accumulate_operation
In this case it's even more expensive. In a MADD you have values A, B and C on the registers. You need to read A, B and C and then at least write the new value of A, and that is if the intermediate value of BxC that then is added to A hasn't to be written back in order to do the add part of the operation.
Even without counting the BxC intermediate value, you at least have 96 bits to read and 32 bits to write, which would be 128 bits of bandwidth per flop. That's of course if the MADD isn't counted as 2 FLOPs.

Yeah, I remember that SRAM is 6 transistors per cell (bit), isn't eDRAM potentially 1? Anyone know?
DRAM is 1 transistor and 1 capacitor as far as I know, and then the extra logic in order to refresh the chip or in case of the pseudo-static dram used on the WiiU, the buffers that hide the refresh process of it. This is why it's not as easy to calculate it as if we were dealing with eSRAM.

What we know for sure is that at least there are 8 transistors per byte. 8 * 32 * 1024 * 1024 = 268.435.456, so let's say 268,5 million transistors approximately.
 
In this case it's even more expensive. In a MADD you have values A, B and C on the registers. You need to read A, B and C and then at least write the new value of A, and that is if the intermediate value of BxC that then is added to A hasn't to be written back in order to do the add part of the operation.
Even without counting the BxC intermediate value, you at least have 96 bits to read and 32 bits to write, which would be 128 bits of bandwidth per flop. That's of course if the MADD isn't counted as 2 FLOPs.

I still think your maths is off, the L1 cache doesn't even provide 1TB/s of bandwidth (or scaled in such a way) in GCN and I don't think that AMD is stupid enough to make that mistake.
 
I still think your maths is off, the L1 cache doesn't even provide 1TB/s of bandwidth (or scaled in such a way) in GCN and I don't think that AMD is stupid enough to make that mistake.
That's because there are the memory registers, which are the ones that take the most part of the impact.
That being said, those registers are of course really tinny, as also are the internal caches.
A 32MB pool of 1TB/s would of course help a lot the Latte, and I'm not saying by any means that this is the internal bandwidth of that pool of memory.
It may be a bit higher than the eSRAM pool on the Xbox One because this is where Nintendo puts all of its strength, but I doubt that its much higher than that.
Of course, a 176-352 GFlop GPU doesn't need the same bandwidth than a 1.2 TFlop GPU (but then the comparison can't be made as directly, because of course there are other factors like the fillrate that of course won't be that far apart from one GPU to the other).
 
A 32MB pool of 1TB/s would of course help a lot the Latte, and I'm not saying by any means that this is the internal bandwidth of that pool of memory.
It may be a bit higher than the eSRAM pool on the Xbox One because this is where Nintendo puts all of its strength, but I doubt that its much higher than that.
I think the issue in this thread has been posters like krizzx saying (or in his case, implying) just that based on often unrelated pieces of information and making beyond best-case inferences on that. Just the last few pages he linked to Cerny's technical speech where he cited the TB/s speed to prove that the WiiU's eDRAM has performance far beyond, which got this whole discussion started.

I'd also like to clarify that not everyone who disagrees about the power of the WiiU thinks it's some underpowered lemon. I think it's well engineered for its apparent goals of power at high efficiency; I only disagree with the extent of how powerful it is.
 
I'd also like to clarify that not everyone who disagrees about the power of the WiiU thinks it's some underpowered lemon. I think it's well engineered for its apparent goals of power at high efficiency; I only disagree with the extent of how powerful it is.
I know that, and that's why I speak only for the few that are doing the same that Krizzx did but on the opposite direction (mostly USC-fan and KidBeta).
 
Van Owen does that too, although I haven't seen him post on here for a while. It's almost as if Nintendo has killed their families and raped their dogs lol
 
What's most impressive about Pikmin 3 is the diffuse mapping for the fruit imo. Makes the fruit look VERY realistic. Last gen the PS3, 360 and Wii had pretty much 2 settings for diffuse mapping, on and off, which basically made things too shiny. The fruit in Pikmin 3 has varying degrees of shinyness, even on the same models if they have stalks.
 
What's most impressive about Pikmin 3 is the diffuse mapping for the fruit imo. Makes the fruit look VERY realistic. Last gen the PS3, 360 and Wii had pretty much 2 settings for diffuse mapping, on and off, which basically made things too shiny. The fruit in Pikmin 3 has varying degrees of shinyness, even on the same models if they have stalks.
Isn't a diffuse map the actual texture? I believe you're thinking of specular mapping.

I'd need to see some evidence of what you're saying though regarding last-gen systems, as I know Skyrim uses gradient specular mapping (meaning the map doesn't simply use black/white but shades of grey to determine shine strength; unsure of the correct term off the top of my head.)
 
I know that, and that's why I speak only for the few that are doing the same that Krizzx did but on the opposite direction (mostly USC-fan and KidBeta).

haha, no. krizzx single handedly destroyed this thread, now that he is gone, maybe a real conversation can happen again.
 
Agreed, and like I edited in, I just seem to remember 70GB/s being the main theory by the bus width or something.

I don't recall exactly the reasoning others came up with it, but 2 render back ends (8 rops) @550MHz writing + reading 32bpp ( colour + depth) comes out to needing 70.4GB/s.

The I/O for the 32MB eDRAM should correspond to those vertical yellow strips alongside the memory arrays. (Horizontal strips on the smaller array above)
 
In this case it's even more expensive. In a MADD you have values A, B and C on the registers. You need to read A, B and C and then at least write the new value of A, and that is if the intermediate value of BxC that then is added to A hasn't to be written back in order to do the add part of the operation.
Even without counting the BxC intermediate value, you at least have 96 bits to read and 32 bits to write, which would be 128 bits of bandwidth per flop. That's of course if the MADD isn't counted as 2 FLOPs.
Things like multiply-accumulates are counted as multiple FLOPs, though. A 3-part MAC (a*b+c*d+e*f) can be considered as much as 5 FLOPs, as it consists of 3 floating-point multiplies and 2 floating-point adds (if you imagine constructing it out of 2-input ops). Or to put it another way, if you did a 3-element MAC on a very simple floating-point unit, you'd have to use up five operations to do it.
Stuff like MACs aren't exactly obscure instructions, either. For instance, they're basically the thing that typical DSPs do, since things like FIR and IIR filters are entirely constructed from multiply-accumulates.

The bigger issue is that I'm not entirely sure in what situation you think it would make sense to have bandwidth equal to your FP data use rate. There's a ton of serialized data usage and input re-use in many algorithms, and even within some computational hardware itself (as can happen with those MACs we were just mentioning). Even if you were able to construct some sort of insane design where every ALU had register-esque access to a large memory pool, you'd hardly ever come close to drawing an amount of "unique"-ish (i.e.isn't already cycled perfectly well by registers or caches in current designs) data that's comparable to the design peak bandwidth for the main memory. Sure, if you got rid of the aforementioned caches and registers and forced your processors to always draw data to and from said main memory pool, you'd see extremely high main memory bandwidth use. But a pretty good chunk of that bandwidth usage is already sufficiently covered by small internal memory pools* in actual designs; all you've done is shoved it to some other place, and in a way which is terribly impractical in the real world.
What I'm saying is, having a main memory bandwidth of 1/100th your FP data use rate does not mean that your FP hardware is only operating at 1% utilization. You could easily come up with algorithms/processors where a processor is seeing around 100% FP utilization in that situation.

*Or, in situations where the FP operations are themselves serialized in hardware, you wouldn't even necessarily need memory units to hold onto intermediate values; they could just get passed from the output transistors of one chunk of FP logic to the input transistors of the next. A 3-stage MAC could very well only involve six reads and one write for five floating-point ops.
If you wanted to, you could take this to an extreme extent. For instance, you could design a block of hardware to carry out an operation like (a+b)*(b+c)*(c+a) which, depending on how it's counted, uses more FLOPs than memory accesses.

What's most impressive about Pikmin 3 is the diffuse mapping for the fruit imo. Makes the fruit look VERY realistic. Last gen the PS3, 360 and Wii had pretty much 2 settings for diffuse mapping, on and off, which basically made things too shiny.
I would say that the shaders in PS360 are a lot more versatile than you think they are, except that I'm not even sure what you're talking about when you say "diffuse mapping." Where "diffuse map" does get used, it's usually used to refer to an object's main colour texture, which is clearly not what you're thinking of.

Suffice it to say, though, even the sixth-gen consoles gave extremely healthy levels of control over how texture-based effects were blended.
 
Oops, my bad, should have been specular then!

Haven't played Skyrim but there are plenty of PS3 and 360 games in the past which have atrocious specular mapping (where things appear too shiny)...although I guess that's down to bad art rather than a problem with their respective GPUs.

Regarding the eDRAM bandwidth and the lower estimate, would this be unrealistic now that Shin'en have confirmed that both the GPU and CPU can access that 32MB..? It would have to be higher than 70Gb/s to prevent bottlenecks, wouldn't it..?
 
Oops, my bad, should have been specular then!

Haven't played Skyrim but there are plenty of PS3 and 360 games in the past which have atrocious specular mapping (where things appear too shiny)...although I guess that's down to bad art rather than a problem with their respective GPUs.

Regarding the eDRAM bandwidth and the lower estimate, would this be unrealistic now that Shin'en have confirmed that both the GPU and CPU can access that 32MB..? It would have to be higher than 70Gb/s to prevent bottlenecks, wouldn't it..?

hmmmm while i'm not technical that does make sense, so perhaps it is 130 after all
 
Haven't played Skyrim but there are plenty of PS3 and 360 games in the past which have atrocious specular mapping (where things appear too shiny)...although I guess that's down to bad art rather than a problem with their respective GPUs.
Well, we don't necessarily have to throw out all technical discussion. There are a lot of different approaches to specular reflections out there, some more costly than others. It's not just graphical asset design and twisting of a single intensity knob.

Some reflection models benefit more from the combination of HDR and bloom to represent intensities higher than white level, some allow lookups for different reflection shapes based on different light and material types, sometimes specular maps are used to achieve better material diversity over a surface than what you'd get without them, some approaches to sampling normal maps might give better-looking results than others...
 
I don't recall exactly the reasoning others came up with it, but 2 render back ends (8 rops) @550MHz writing + reading 32bpp ( colour + depth) comes out to needing 70.4GB/s.
That was pretty much the reasoning.
 
Things like multiply-accumulates are counted as multiple FLOPs, though. A 3-part MAC (a*b+c*d+e*f) can be considered as much as 5 FLOPs, as it consists of 3 floating-point multiplies and 2 floating-point adds (if you imagine constructing it out of 2-input ops). Or to put it another way, if you did a 3-element MAC on a very simple floating-point unit, you'd have to use up five operations to do it.
Stuff like MACs aren't exactly obscure instructions, either. For instance, they're basically the thing that typical DSPs do, since things like FIR and IIR filters are entirely constructed from multiply-accumulates.

The bigger issue is that I'm not entirely sure in what situation you think it would make sense to have bandwidth equal to your FP data use rate. There's a ton of serialized data usage and input re-use in many algorithms, and even within some computational hardware itself (as can happen with those MACs we were just mentioning). Even if you were able to construct some sort of insane design where every ALU had register-esque access to a large memory pool, you'd hardly ever come close to drawing an amount of "unique"-ish (i.e.isn't already cycled perfectly well by registers or caches in current designs) data that's comparable to the design peak bandwidth for the main memory. Sure, if you got rid of the aforementioned caches and registers and forced your processors to always draw data to and from said main memory pool, you'd see extremely high main memory bandwidth use. But a pretty good chunk of that bandwidth usage is already sufficiently covered by small internal memory pools* in actual designs; all you've done is shoved it to some other place, and in a way which is terribly impractical in the real world.
What I'm saying is, having a main memory bandwidth of 1/100th your FP data use rate does not mean that your FP hardware is only operating at 1% utilization. You could easily come up with algorithms/processors where a processor is seeing around 100% FP utilization in that situation.

*Or, in situations where the FP operations are themselves serialized in hardware, you wouldn't even necessarily need memory units to hold onto intermediate values; they could just get passed from the output transistors of one chunk of FP logic to the input transistors of the next. A 3-stage MAC could very well only involve six reads and one write for five floating-point ops.
If you wanted to, you could take this to an extreme extent. For instance, you could design a block of hardware to carry out an operation like (a+b)*(b+c)*(c+a) which, depending on how it's counted, uses more FLOPs than memory accesses
And that's why I said that most of that bandwidth requirement would be spent on the registers of each SPU and on the caches it also has.
Thank's for confirming the fact that those operations count as multiple flops, then the bandwidth required will go down for sure, and if there's dedicated logic on it then it may be even less demanding than that (that's why I didn't count the intermediate operation result as 64 extra bandwidth bits for write/read).

But still, its just a few MB that have to sustain all that (L1 + L2 in terms of memory space, because the memory on the registers will be duplicated), and every time those memories get filled then you have to go to the main memory.
Having an intermediate pool of memory at a great speed is of course of help. My point wasn't that without that pool of memory the GPU can't be usable, but that there's room enough for any console GPU to take advantage of it.
Furthermore, not only flops are bandwidth consuming, there are tons of other operations like texturing or tesselating that consume bandwidth.

I insist in that what I'm saying is that the claim "if the eDRAM had more than 70GB/s of total bandwidth the GPU wouldn't be capable of using it" is false. There is a lot of room between the internal caches and that 70 GB/s bandwidth for the eDRAM to have a direct impact on bandwidth.
If the claim would have been "between a 2 TB/s eDRAM and a 4 TB/s eDRAM the difference on the WiiU GPU would be negligible due to the low power of it" then I would've said "well, that's a good point". But limiting it to a 70 GB/s figure...
 
Status
Not open for further replies.
Top Bottom