Wii U CPU |Espresso| Die Photo - Courtesy of Chipworks

Gah! What a tease, indeed. I'd love to see more details.

Come on Marcan, proxy yourself into a different country and post it anonymously somewhere :P

I'm still firmly in between "horrible, weak" and being convinced it's not, the "nice graphics" that we've seen it throw up seem mostly GPU-side (ie Mario 3D world with lots of lighting but nothing seemingly CPU intensive). I do think however it may have been better for them to just use Jaguar with the rest of them and drop BC.
 
As far as I knew, one of the improvements between Gekko and Broadway was that Broadway's L2 cache was 8-way set associative.

If it was 2-way set associative like the Gekko then the Espresso could be able to emulate it perfectly and my statement, although still true in the sense that 4-way set associative is not even close to be enough nowadays if you wan't any good cache utilization, would be completely wrong.

Eh, if you're talking a cache size of 2MBs or less, the difference in miss rate between 4-way and 8-way shouldn't be enormous. The latest i5's and i7's are "only" 8-way as well iirc.
 
I do think however it may have been better for them to just use Jaguar with the rest of them and drop BC.


Jaguar wasn't ready at the time. While other x86 solutions still might have been better if BC wasn't a concern I think that Espresso is decent enough to somewhat compete with PS360. And that has probably been the main goal. A stronger CPU without a stronger GPU wouldn't be all that useful anyway.
 
Jaguar wasn't ready at the time. While other x86 solutions still might have been better if BC wasn't a concern I think that Espresso is decent enough to somewhat compete with PS360. And that has probably been the main goal. A stronger CPU without a stronger GPU wouldn't be all that useful anyway.

Doh, of course you're right. Perhaps Bobcat than, it wouldn't lack that much performance that Jaguar offers and it would make the 8th gen consoles all that much easier to port between. And maybe using 32nm, rather than the old 45nm fab, and no eDRAM and just use fast system memory instead so that all that GPU die area used for eDRAM could be used for actual GPU stuff instead, like the XBO vs PS4 GPU die area....But we can dream about what its specs should have been all day. I still want to know more about what they actually are like. Hope that Espresso manual leaks fast.
 
Does anybody have an idea how long the deal between IBM and Nintendo from back in 1999 lasts? Might this be the last CPU we see from them? I don't think the deal would be more than 15 years, but maybe they made a deal for 4-5 console generations or something like that.
 
Does anybody have an idea how long the deal between IBM and Nintendo from back in 1999 lasts? Might this be the last CPU we see from them? I don't think the deal would be more than 15 years, but maybe they made a deal for 4-5 console generations or something like that.
Regardless of their contract, I wouldn't be surprised if the next Nintendo console used an eight core Espresso.


Most of us have known that Wii U has a "horrible, slow CPU" for awhile.
Coders worth their salt disagree, so maybe we should just ignore that asinine statement by some random dude who might or might not have looked in the manual for three minutes in the first place:

"We didn't think this would be possible - the little non-SIMD CPU that could!" "...great CPU core!"
http://www.radgametools.com/bnkhist.htm
 
We *know* the core is a bog standard 750 (with a few patches for the new L2 subsys/multicore). There is no point arguing further.

We do know that Espresso is a full relayout with a modern process, so it won't look anything like the old 750.

I really need to stop getting into Twitter arguments about console hardware. These people won't take "I've seen the code" as an argument. :/

https://twitter.com/marcan42

We kind of knew this all along but I guess most of us still expected some differences apart from accommodations for the cache/multicore, but it's a "bog standard" 750 according to marcan.
 
What's impressive about this? That the CPU can play back video?

Keep fighting the good fight. -_-

https://twitter.com/marcan42

We kind of knew this all along but I guess most of us still expected some differences apart from accommodations for the cache/multicore, but it's a "bog standard" 750 according to marcan.

Don't the first two quotes contradict to a degree? If your understanding of them warrants, could you elaborate a bit (even though they're obviously not your quotes originally)?
 
Don't the first two quotes contradict to a degree? If your understanding of them warrants, could you elaborate a bit (even though they're obviously not your quotes originally)?

Relayout doesn't mean anything is new in the CPU core by itself, it's just optimized for a new fabrication process and perhaps something like small size, or higher clock speeds than the original, etc.

It's happened with GPUs, I forget which generation, but simply re-laying the transistors allowed AMD to hit higher clocks on the same process once. It doesn't add anything new to the architecture nor does it add to instructions per clock, it's the same architecture after all, it just tunes it for whatever aspect. It could be as simple as having to change it to port it over to the new (for the 750) fab.

I'm trying to put it into better words...I guess, think of the architecture as the "macro" overview of the whole processor, think of the layout as the micro within all the macro structures, changing the layout of the micro doesn't fundamentally change what the processor is designed to do, it may just help it stay stable at higher clocks or stuff like that.

I think Marcan was just trying to address the "it looks different than ancient 750s therefore it's new" crowd.
 
Interesting. It says specifically that it has enhancement to improve floating point performance and data transfer capability.

That alone means it is much more than 3 Broadways sandwiched together.

Also an increase of 4 way cache over the 2 way in Broadway. This is CPU is looking a lot nicer now.
 
As far as I knew, one of the improvements between Gekko and Broadway was that Broadway's L2 cache was 8-way set associative.

If it was 2-way set associative like the Gekko then the Espresso could be able to emulate it perfectly and my statement, although still true in the sense that 4-way set associative is not even close to be enough nowadays if you wan't any good cache utilization, would be completely wrong.

Broadway definitely used the same 2 way associative L2, 8 way would have been massive overkill considering the size of the cache and the CPU itself.

Associativity is always a trade off, you want to minimise cache misses, but you also dont want to bog your CPU down unnecessary. For instance going to 4 way may reduce cache misses significantly over 2 way, but depending on the cache and CPU core moving to 8 way may not reduce misses much more. However it will have the CPU spending more time searching lots of cache entries. So theirs a sweet spot depending on the cache and CPU core characteristics.
 
Interesting. It says specifically that it has enhancement to improve floating point performance and data transfer capability.

That alone means it is much more than 3 Broadways sandwiched together.

So did the Wii and Gamecube versions of the 750. As Marcan has stated, the unique FPU is exactly the same as Broadway.

Where are you coming with this from, what is this "it" that is showing you new information? The fact that Nintendo-IBM made a new FPU for the Gamecube was already widely known, was carried over to the Wii, and now the Wii U.

And in fact, the PowerPC 750CL later used the exact same core as the gamecube.
 
Van Owen said:
I'm being serious, I really don't know what Bink playback is supposed to tell us.
..And yet it doesn't stop you from commenting on it. How Van Owen of you.

That a console released in late 2012 can play 1080p video at 30hz? I have a $99 roku that can do that.
See? Clueless to the bone.

The Bobcat I'm typing this on cannot playback 1080p video at 30hz (its GPU OTOH doesn't break a sweat at that). Apparently you must have a very unique Roku.

/off-topic

Apologies for my early comment - I had entirely forgotten Boadway's L2 was 2-way. Latte is a notable step up for any seasoned Broadway coder.
 
Actually I've been thinking about what would happen if nintendo decides to stay with Power chips for the next machine (if they make one). Could the recent Power 8 form the basis of a new chip while still providing BC?
 
Actually I've been thinking about what would happen if nintendo decides to stay with Power chips for the next machine (if they make one). Could the recent Power 8 form the basis of a new chip while still providing BC?

Just like everyone should have known from the "Based on Power(inserted 7)" thing from the start of Espresso, it would have to be so fundamentally different from Power8 as to barely be related to it at all. Power8 is a very big core, and it's completely optimized for high clocks and high performance at the expense of power, both things modern Nintendo does not do.

Their next machine is too far out to speculate much on, but I could see them finally dropping Power next time. Heck, Espresso would probably be so small on 14nm they could add it in as a compatibility chip.

Or an alternate theory based on how small Espresso and even Latte would be on 14nm would be that the Wii U becomes a mobile product, able to connect to your TV like the Nvidia shield, while the stationary console is updated to a Wii Balls Out U which uses a similar CPU/GPU just as the name implies, balls out.
 
Actually I've been thinking about what would happen if nintendo decides to stay with Power chips for the next machine (if they make one). Could the recent Power 8 form the basis of a new chip while still providing BC?
BC aside, to this day ppc have certain advantages in the embedded market where performance/watt matters much more and software dev cycles are generally longer than in the desktop world (so developers can learn to make the most out of the hw). That said AArch64 has been positioning itself as the architecture for the embedded world of the next decade. Not necessarily because it's so much better than ppc32/64, but because IBM (the biggest player behind ppc) totally dropped the ball on the ppc side so hard. Another potent embedded design was bound to fill that void and AArch64 does fit the bill (while Intel only wish they did that with Atom). Of course, Nintendo have the monetary power to order their own CPUs (just like Gekko's entire lineage was custom made for Nintendo by IBM) till thy kingdom come. Whether they'll want that is another matter. A custom(-ized) CPU design is one thing, the needed sw infrastructure (compilers, debuggers, profiles) - another.
 
I'm being serious, I really don't know what Bink playback is supposed to tell us. That a console released in late 2012 can play 1080p video at 30hz? I have a $99 roku that can do that.
Yet it almost certainly couldn't decode Bink 2 video at 1080p. Because Bink runs on the CPU, not on dedicated decoder logic (which is what the Roku uses, and something the Wii U has as well - in both cases, the dedicated logic is limited to h.264 and VC-1). On top of that, Bink 2 is almost 100% SIMD, which makes the whole thing even more impressive considering neither Broadway nor Espresso have "real" SIMD units in the first place.

At the end of the day, what it's important is that the "slow and horrible" statement comes from an anonymous source inside 4A Games, a studio with zero shipped titles on Nintendo hardware, and they admitted that they never even started the Metro port despite its early announcement. RAD on the other hand developed low level stuff on all platforms for many, many years. They are much more credible, and they call it a "great CPU core". Taken at face value, what RAD says has far more weight - but at the end of the day, both statements are worthless because they're both missing context and are not comparative.
 
BC aside, to this day ppc have certain advantages in the embedded market where performance/watt matters much more and software dev cycles are generally longer than in the desktop world (so developers can learn to make the most out of the hw). That said AArch64 has been positioning itself as the architecture for the embedded world of the next decade. Not necessarily because it's so much better than ppc32/64, but because IBM (the biggest player behind ppc) totally dropped the ball on the ppc side so hard. Another potent embedded design was bound to fill that void and AArch64 does fit the bill (while Intel only wish they did that with Atom). Of course, Nintendo have the monetary power to order their own CPUs (just like Gekko's entire lineage was custom made for Nintendo by IBM) till thy kingdom come. Whether they'll want that is another matter. A custom(-ized) CPU design is one thing, the needed sw infrastructure (compilers, debuggers, profiles) - another.

ARM tech could very well be the next chip as nintendo do like using their chips and have done for a long time. However, there could be the possibility of nintendo using some form of the "Express" Power 7 chips (or maybe a future express power 8) as they were designed for low power consumption.
One interesting avenue could be from Intel. They must be slightly worried by the aggressive stance AMD has taken with regards to consoles. Intel aren't a company that likes others having a piece of the pie. Maybe they will approach nintendo with a good deal.
 
Is this Bink 2 thing available for testing? I gave up after looking for 17 seconds.

What does run it well? ie anything after a midrange Penryn Core 2, or so on?

Actually, is the Wii U running it on 1080p even *that* impressive, seeing this? Jaguar cores aren't exactly high perf either, and it's only using 4 of thier 8 cores, so only an advantage of one core over the Wii U to pull off flawless 4k.
Since it's so fast, Bink 2 can play 4K video (3840x2160) on PCs, Sony PS4 and Xbox One - using 4 cores, it can decode a 4K frame in 5 ms on PC, and 10 ms on PS4 or Xbox! Crazy fast!



One interesting avenue could be from Intel. They must be slightly worried by the aggressive stance AMD has taken with regards to consoles. Intel aren't a company that likes others having a piece of the pie. Maybe they will approach nintendo with a good deal.


I think one needs to remember why Intel was booted from consoles in the first place, as was Nvidia. Intel doesn't like touching low margin spaces, and consoles are rather low margin for chip makers. Intel also likes dominating and having complete control of their hardware, which like Nvidia was why their console partner (MS) dumped them, as they could not shrink and modify as needed and were instead bound to Intels/Nvidias schedule.
 
Don't think anyone's saying its impressive. Its just an example of a developer being surprised by how good the CPU's performance is in comparison to how they thought it would be when they looked at its specs. Since the whole "horrible CPU" thing came from a developer that never actually developed on the system, just looked at the specs, its quite a good example too.

Would be interesting to know how many ms it takes Espresso to produce a 1080p frame. Also I thought XBox One CPU cores were faster than PS4 CPU cores?

EDIT: Looking at their site I found this interesting:

Version 2.2e/1.992e (09-09-2013)
Added Wii-U support for Bink 2 - play 30 Hz 1080p or 60 Hz 720p video! We didn't think this would be possible - the little non-SIMD CPU that could!

2.3a/1.993a (11-15-2013)

Big new version: new file format, new features, new platforms!
Bink now allows you to use 3-way, 4-way and 8-way multicore slicing at compression time. This means you can create videos that scale across more cores. So now, with 8-way slicing, you can get 4K video decompression down to 4ms on a PC! Also, with 3-way slicing, Xbox 360 and Wii-U get much faster too.
 
Would be interesting to know how many ms it takes Espresso to produce a 1080p frame, also I thought XBox One CPU cores were faster than PS4 CPU cores?

PS4 core clock is still not officially disclosed. But if it is 1.6, the XBO clock speed bump still only puts it single digit percentage points ahead, not enough to significantly alter playback if we're talking about 10ms processing per frame for each.

So what does it actually take to run 1080p video on this codec? What processor family starts running it well? If Jaguar can play even 4k content on it, I'm guessing the bar isn't all that high. First gen Core ULV, perhaps?
 
Disappointing that the first bit of interesting news about the CPU fizzled out after so little. The GPU gets discussed to death still.

No more reaction to the "bog standard" comment? I always took it being based off the PPC750 to mean like the Core 2 was based off the Pentium 3, but not to be nearly the exact same core (just with a relayout for the new fab, and some stuff for more cache/multicore) higher clocked. FPU seems to be the same one they developed for the Gamecube version, too (krizzx still hasn't responded to that, where was this "new" thing coming from?).

Which version of reality had all of you anticipated?

Gah, I wish that manual would leak.
 
No idea what Bink requires, their isn't much info released on it that I can find. Can't even see for sure what resolution 360 or PS3 support.

Disappointing that the first bit of interesting news about the CPU fizzled out after so little. The GPU gets discussed to death still.

No more reaction to the "bog standard" comment? I always took it being based off the PPC750 to mean like the Core 2 was based off the Pentium 3, but not to be nearly the exact same core (just with a relayout for the new fab, and some stuff for more cache/multicore) higher clocked. FPU seems to be the same one they developed for the Gamecube version, too (krizzx still hasn't responded to that, where was this "new" thing coming from?).

Which version of reality had all of you anticipated?

Gah, I wish that manual would leak.

I think the wording is a bit misleading since Gekko isn't even a bog standard PPC750 never mind Broadway. But I think what he's getting at is that Espresso is three Broadway cores each clocked 71% faster with new logic for multi-core support and an improved cache. No doubt he's mostly bang on if he's seen the full developer manual (might be some small changes he doesn't know about but that's probably it). Performance wise we're probably looking at something around 6x as powerful as Broadway.

Incidentally saw some talk about whether or not Nintendo will stay with this kind of CPU for future hardware. I'm not sure, I think perhaps keeping this CPU core had more behind it than just backwards compatibility. Nintendo were already making a massive step for them by moving away from fixed function GPU's to a programmable unified shader architecture, plus they were also moving to a multi core CPU. That's was obviously going to take some getting used to, so perhaps they decided they needed to keep the CPU core extremely familiar so as not to increase their learning curve even further. By the time their next hardware arrives programmable shaders will be very familiar to them, so the learning curve of a CPU core change on its own shouldn't be so scary anymore. Or maybe the next console will just use 9 Espresso cores at 2.2Ghz with a few extra features added ect, who knows..
 
Disappointing that the first bit of interesting news about the CPU fizzled out after so little. The GPU gets discussed to death still.

No more reaction to the "bog standard" comment? I always took it being based off the PPC750 to mean like the Core 2 was based off the Pentium 3, but not to be nearly the exact same core (just with a relayout for the new fab, and some stuff for more cache/multicore) higher clocked. FPU seems to be the same one they developed for the Gamecube version, too (krizzx still hasn't responded to that, where was this "new" thing coming from?).

Which version of reality had all of you anticipated?

Gah, I wish that manual would leak.
I'm not sure what you're looking for? It's "bog standard" in that it's essentially still a Gekko/ Broadway/ CXe/ CL. There have been changes besides SMP, like the bigger cache that is now four-way associative and uses the MERSI cache coherency protocol (MPC74xx and PPC970) instead of the old MEI protocol (PPC750), the chip also has been re-synthesized from scratch using a contemporary library which should improve its efficiency a little, and it has a few new registers, most (all?) of which seem related to SMP, the known changes to the cache subsystem and backwards compatibility - and that's pretty much it. There have been no fundamental changes, there are no additional execution units like VMX or VSX, it's still 32bit and so on...
 
The PowerPC 750 (a.k.a., the G3)

The PowerPC 750, known to Apple users as the G3, is a design based heavily on the 603/603e. Its four-stage pipeline is the same as that of the 603/603e, and many of the features of its front-end and back-end will be familiar from the previous article's discussion of the older processor. Nonetheless, the 750 sports a few very powerful improvements over the 603e that made it faster than even the 604e.

PowerPC 750 summary table

Introduction date: November 10, 1997
Process: 0.25 micron
Transistor Count: 6.35 million
Die size: 167mm2
Clock speed at introduction: 233-266MHz
Cache sizes: 64KB unified L1, 512KB L2
First appeared in: Power Macintosh G3/233

The 750's significant improvement in performance over the 603/603e is the result of a number of factors, not the least of which are the improvements that IBM made to the 750's integer and floating-point capabilities.

A quick glance at the 750's layout will reveal that its execution core is wider than that of the 603. More specifically, where the 603 has a single integer unit the 750 has two, a simple integer unit (SIU) and complex integer unit (CIU). The 750's complex integer unit handles all integer instructions, while the simple integer unit handles all integer instructions except multiply and divide. Most of the integer instructions that execute in the SIU are single-cycle instructions.

Like the 603 (and the 604), the 750's floating-point unit can execute all single-precision floating-point operations, including multiply, with a latency of three cycles. Unlike the 603, though, the 750 doesn't have to insert a pipeline bubble after every third instruction in its pipeline. Double-precision floating-point operations, with the exception of operations involving multiplication, also take three cycles on the 750. Double-precision multiply and multiply-add operations take four cycles, because the 750 doesn't have a full double-precision FPU.

The 750's load-store unit and system register unit perform the functions described above for the 603, so they don't merit further comment.

The 750's front end and instruction window

The 750 fetches up to four instructions per cycle into its six-entry instruction queue (c.f. the 603's six-entry IQ), and it dispatches up to two non-branch instructions per cycle from the IQ's two bottom entries. The dispatch logic follows the four dispatch rules described above when deciding when an instruction is eligible to dispatch, and each dispatched instruction is assigned an entry in the 750's six-entry reorder buffer (compare the 603's five-entry ROB).


Figure POWERPC.4: The PowerPC 750

As on the 603 and 604, newly-dispatched instructions enter the reservation station of the execution unit to which they have been dispatched, where they wait for their operands to become available so that they can issue. The 750's reservation station configuration is similar to that of the 603, in that with the exception of the two-entry reservation station attached to the 750's LSU, all of the execution units have a single-entry reservation station. And like the 603, the 750's branch unit has no reservation station.

Because the 750's instruction window is so small, it has half the rename registers of the 604. Nonetheless, the 750's six general-purpose and floating-point rename register still put it ahead of the 603's number of rename registers (five GPR and four FPR). Like the 603, the 750 has one rename register each for the CR, LR, and CTR.

You would think that the 750's smaller reservation stations and shorter ROB would put it at a disadvantage with respect to the 604, which has a larger instruction window. But the 750's pipeline is shorter than that of the 604, so it needs fewer buffers to track fewer in-flight instructions. Even more importantly, though, the 750 has one very clever trick up its sleeve that it uses to keep its pipeline full.

Branch prediction on the 750

In the previous article's discussion of branch prediction, we talked about how dynamic branch prediction schemes use a branch history table (BHT) in combination with a branch target buffer (BTB) to speculate on the outcome of branch instructions and to redirect the processor's front end to a different point in the code stream based on this speculation. The BHT stores information on the past behavior (i.e., taken or not taken) of the most recently executed branch instructions, so that the processor can determine whether or not it should take these branches if it encounters them again. The target addresses of recently taken branches are stored in the BTB, so that when the branch prediction hardware decides to speculatively take a branch it will have immediate access to that branch's target address without having to recalculate it. The target address of the speculatively taken branch is loaded from the BTB into the instruction register, so that on the next fetch cycle the processor can begin fetching and speculatively executing instructions from the target address.

The 750 improves on this scheme in a very clever way. Instead of storing only the target addresses of recently taken branches in a BTB, the 750's 64-entry branch target instruction cache (BTIC) stores the instruction that's located at the branch's target address. When the 750's branch prediction unit examines the 512-entry BHT and decides to speculatively take a branch, it doesn't have to go code storage to fetch the first instruction from that branch's target address. Instead, the BPU loads the branch's target instruction directly from the BTIC into the instruction queue, which means that the processor doesn't have to wait around for the fetch logic to go out and fetch the target instruction from code storage. This scheme saves valuable cycles, and it helps keep performance-killing bubbles out of the 750's pipeline.

PowerPC 750 conclusions

In spite of its short pipeline and small instruction window, the 750 packed quite a punch. It managed to outperform the 604, and it was so successful that a 604-derivative was scrapped in favor of just building on the 750. The 750 and its immediate successors, all of which went under the name of "G3," eventually found widespread use both in the embedded arena and across Apple's entire product line, from its portables to its workstations.

The G3 lacked one important feature that separated it from the x86 competition, though: vector computing capabilities. While comparable PC processors supported SIMD in the form of Intel's and AMD's vector extensions to the x86 instruction set, the G3 was stuck in the world of scalar computing. So when Motorola decided to develop the G3 into an even more capable embedded and media workstation chip, this lack was the first thing they addressed.

http://arstechnica.com/features/2004/10/ppc-2/
 
So, for the laymen, what's the verdict on the CPU inside the wii U?

The easiest way to look at it is around the performance of six Wii CPU's (its 3x Wii cores but they're clocked quite a bit higher than Wii and the improved cache will also increase performance further). Certainly not up their with a 8 core Jaguar but still a nice little CPU with very solid performance per clock.
 
ahh IBM, best products, worst manuals.

A bit off-topic but I just remembered a sticker I saw at a Computer Fair years ago that looked like the Intel Inside stickers you used to see on shop-bought PCs except it said - 'Intel Inside - Idiot Outside' lololol
 
Two of the main designers behind Gekko CPU giving some insights, it's an old interview (2001) but it shows the philosophy behind wii u cpu :
IGNcube: Talking a little bit about what went on inside the chip, we know you have a very large L2 cache in there. This too seems to be underrated by a lot of people. Could you describe what kind of benefits such a large L2 cache gives the Gekko?

IBM, Mike West:
Let me take it to a higher level. When you're dealing with media processing, there are basically four things that you need to consider. And some of these things are considered all the time by people writing about it, but others do ignore it. Obviously, compute is very important -- getting calculations done. Another part of it is control, and that's where you make sure things happen in the right order or in real-time or whatever. Then one of the major ones that is overlooked is data flow. One of the biggest things we've tried to achieve with this total system design is the avoidance of bottlenecks, whether it's memory bandwidth bottlenecks because data is in the wrong place or it's moving in the wrong direction. Part of that whole scheme was this really large cache. What's even more important is that this cache isn't just a cache. It can actually be managed in a number of different ways that are advantageous to a gaming, or indeed any multimedia environment. So we have made modifications to it. It's not like a standard processor cache. The L1 cache, that is.

IBM, Peter Sandon: So, the question of L2 cache size if any was one we looked at fairly carefully. We landed where we did because of the cost benefit issues. We wanted to keep enough state on chip to avoid the penalties of going over the bus, despite the fact that we actually improved the bus to the Flipper chip significantly. But, what we also did was to manage the cache as one would want to manage the cache. The cache is intended to hold data that you want to keep around for a while. And there is a lot of data in graphics and multimedia applications that doesn't stay around for any length of time.

IBM, Mike West: It reaches a point where it's no longer needed.

IBM, Peter Sandon: It's a huge block of data, so if you managed it the same way you managed the persistent data that you want on the caches, it would just wipe out that persistent data. So, in addition to adding the L2 cache we also allow -- under program control -- the developer to manage the L1 cache, such that the transient data -- these big blocks and streams of graphics data -- bypasses the L2, goes through the L1 and is maintained in what we call the locked side of the cache. So it's maintained by software instead of hardware. Again, Mike's point about data flow is the key. We put some effort into getting the data flow right, and that was adding the right size L2 cache, but on the other hand not clobbering all that data in the L2 cache by running the transient data through it.
You can read all the interview here, it's very interesting :
http://m.uk.ign.com/articles/2001/12/13/interview-ibm-details-gekko-part-i
And here for part 2 :
http://m.uk.ign.com/articles/2001/12/13/interview-ibm-details-gekko-part-ii

I liked the part when they identify what effect is done by the CPU :

...Well it basically does the local lighting calculations. The lighting that's implemented in the graphics chip is for lights at infinity. So, if you want to get the effects that you get from so-called local lights that don't look like they're out at infinity, those calculations are done on Gekko.

In Rogue Leader when you go in a canyon or fly past a building and fire your guns you can actually see the tracer fire lighting up the edge of the buildings. I'm pretty sure that they've done that with local lighting on Gekko

...One is the dust rising when he's walking around the mansion and another is when he's just standing still he's breathing and the hose on the vacuum cleaner is bobbing up and down. Something is computing physics there to make it bob and that's Gekko.
 
BC aside, to this day ppc have certain advantages in the embedded market where performance/watt matters much more and software dev cycles are generally longer than in the desktop world (so developers can learn to make the most out of the hw). That said AArch64 has been positioning itself as the architecture for the embedded world of the next decade. Not necessarily because it's so much better than ppc32/64, but because IBM (the biggest player behind ppc) totally dropped the ball on the ppc side so hard. Another potent embedded design was bound to fill that void and AArch64 does fit the bill (while Intel only wish they did that with Atom). Of course, Nintendo have the monetary power to order their own CPUs (just like Gekko's entire lineage was custom made for Nintendo by IBM) till thy kingdom come. Whether they'll want that is another matter. A custom(-ized) CPU design is one thing, the needed sw infrastructure (compilers, debuggers, profiles) - another.

Blu, Where it says increased floating point performance, about how much would estimate it as, and how much of a difference does going from 2 way associative to 4 way make in performance in real terms?

Would that lead to a doubled throughput?



Also, can someone explain MERSI to me?
 
Blu, Where it says increased floating point performance, about how much would estimate it as, and how much of a difference does going from 2 way associative to 4 way make in performance in real terms?
It's just a change in the way the cache behaves, so it can never make for double the performance in realworld unless that was the whole bottleneck in place; in short, it's like this, you made a new highway to the city (city=cpu pipeline), now it has 4 lanes, does that solve the whole city traffic? No, but if the problem was getting in and out of the city it can improve it a little.

See, the CPU portion (the amount of dispatched and completed instructions per cycle) remains the same, the L2 cache just changed from 2-way into 4-way, and that comes in handy.


As for the improvement to be expected... I'll point you to PPC 476FP as I suspect a few things got backported between these chips; or at least taken into account as if it wasn't for Nintendo's needs to retain code compatibility with Gekko/Broadway they probably would have gone for it; see...

GC Gekko - 2.31 DMIPS/MHz
PPC 476FP - 2.71 DMIPS/MHz

Quite an improvement for a classic PPC pipeline; and probably the roof as to what to expect here. That's the improvement I always expected as the best case scenario, as it is due to these small changes as well as increased cache and not a full architecture redesign by any means, but it is more effective.
 
Also, can someone explain MERSI to me?
It's a cache coherency protocol.

In a nutshell, it's how they keep caches in multi-core systems from disagreeing about what value is at a memory location that's being simultaneously referenced in caches for multiple cores.

Or in slightly more layman's terms, it's a mechanism to make sure that the caches in multi-core systems don't cause incorrect program behaviour.
 
It's a cache coherency protocol.

In a nutshell, it's how they keep caches in multi-core systems from disagreeing about what value is at a memory location that's being simultaneously referenced in caches for multiple cores.

Or in slightly more layman's terms, it's a mechanism to make sure that the caches in multi-core systems don't cause incorrect program behaviour.

How does this effect performance as compared to using MEI?
 
krizzx, as previously stated, the difference is certainly not huge (in fact should be mostly the same); see, G4's had MERSI on spec, but if you look into processor paper sheets for it you'll come to the conclusion no Apple shipped G4 actually used it/supported it, even if it was supported on silicon. Why? Because it didn't need it and there wasn't much if any benefit in enabling it for single processor cores.

It was supposed to be really relevant had there been dual-core G4's, but they were canned.

It happens to be very important for Espresso because it ensues that transversal multiprocessor code works as it's supposed to (otherwise you'd have to treat them all as separate processors and only access their own caches, as I understand it), and from a developer standpoint in means more flexibility/options, they wouldn't really need them for single core "simple" code, but it's a very welcome addition to the SMP module of the CPU.

If anything, that's the thing that makes it so that this is not just 3 broadways duct taped and overclocked and instead a real multicore CPU. The biggest gain being performing as expected.
 
Can anyone confirm this?

Effort in porting takes time and more dev costs. Even Ubisoft won't do that...it was lucky enough that they didn't cancel the game on Wii U to begin with. Thankfully the game is pretty good on the Wii U even with the frame rate, at least it's consistently hovering around 22fps and not sporadic like Batman AC.

Keep in mind also that these Xbox 360 ports are not using the Wii U DSP for sound but wasting a CPU core for it, not using much or any edram of the Wii U but delegating things like texture fetching to the main system memory ala Xbox 360 (Xbox 360 coders have to do this since the system only has 10MB of edram). This makes these ports run inferior despite having usually better picture quality and no screen tearing due to the more modern/powerful GPU in the Wii U. It makes sense when you think about it.....better graphics......but worse performance?

Also, a question:

Do we know that games supposedly suffering bad framerates, suffer the same amount when played solely on the gamepad (meaning a lower resolution to render)? Has this been tested? Can this be tested? If it doesn't impact the framerate, i would indeed think the CPU (optimization) is the culprit.

I remember people saying the framerate issues in Batman AC disappeared when changing the Wii U output to 720p (instead of 1080p). How about playing on the gamepad (with nothing on the tv)?
 
Can anyone confirm this?

I seem to remember a bit of chatter pre- and just post-launch about something like this, but I'd be shocked if it was still the case. Unless third parties have stripped their Wii U port teams back to the absolute bare minimum and there just isn't the staffing or expertise to do much more than shunt a title over (i.e. no time, money or ability to take advantage of the different setup) so things like this are just being ignored or missed.

Which, given the way things seem to have gone post-launch, wouldn't surprise me in the slightest.

EDIT:

For example - Arkham Origins vs Arkham City. One hit at launch, developed on hardware in flux, with apparently poor documentation. It still ran better than a game that came nearly a year later. I strongly suspect that the team that worked on AO was smaller, possibly less experienced, and likely running to a tighter budget and it shows.
 
Can anyone confirm this?



Also, a question:



I remember people saying the framerate issues in Batman AC disappeared when changing the Wii U output to 720p (instead of 1080p). How about playing on the gamepad (with nothing on the tv)?
Maybe that's why I never noticed any framerate issues on wii u AssCreed, my tv is set to 720p
 
I also remember hearing of a lot of games running better after the Spring update.

I remember people talking about Lego City having huge slowdowns, then a few said that they ceased after the update.

I personally have never experienced a slow down with Lego City. Also, I noticed that the game had less and shorter load times towards the later chapters which I think was because of some issues in the early development being worked out towards the end, or so I read.

I still want to know the real world capability of Espresso.

How does 1 espresso core stack up to 1 XboxOne CPU core, and one PS4 CPU core?
 
Top Bottom