LordOfChaos
Member
The PowerPC 750 (a.k.a., the G3)
The PowerPC 750, known to Apple users as the G3, is a design based heavily on the 603/603e. Its four-stage pipeline is the same as that of the 603/603e, and many of the features of its front-end and back-end will be familiar from the previous article's discussion of the older processor. Nonetheless, the 750 sports a few very powerful improvements over the 603e that made it faster than even the 604e.
PowerPC 750 summary table
Introduction date: November 10, 1997
Process: 0.25 micron
Transistor Count: 6.35 million
Die size: 167mm2
Clock speed at introduction: 233-266MHz
Cache sizes: 64KB unified L1, 512KB L2
First appeared in: Power Macintosh G3/233
The 750's significant improvement in performance over the 603/603e is the result of a number of factors, not the least of which are the improvements that IBM made to the 750's integer and floating-point capabilities.
A quick glance at the 750's layout will reveal that its execution core is wider than that of the 603. More specifically, where the 603 has a single integer unit the 750 has two, a simple integer unit (SIU) and complex integer unit (CIU). The 750's complex integer unit handles all integer instructions, while the simple integer unit handles all integer instructions except multiply and divide. Most of the integer instructions that execute in the SIU are single-cycle instructions.
Like the 603 (and the 604), the 750's floating-point unit can execute all single-precision floating-point operations, including multiply, with a latency of three cycles. Unlike the 603, though, the 750 doesn't have to insert a pipeline bubble after every third instruction in its pipeline. Double-precision floating-point operations, with the exception of operations involving multiplication, also take three cycles on the 750. Double-precision multiply and multiply-add operations take four cycles, because the 750 doesn't have a full double-precision FPU.
The 750's load-store unit and system register unit perform the functions described above for the 603, so they don't merit further comment.
The 750's front end and instruction window
The 750 fetches up to four instructions per cycle into its six-entry instruction queue (c.f. the 603's six-entry IQ), and it dispatches up to two non-branch instructions per cycle from the IQ's two bottom entries. The dispatch logic follows the four dispatch rules described above when deciding when an instruction is eligible to dispatch, and each dispatched instruction is assigned an entry in the 750's six-entry reorder buffer (compare the 603's five-entry ROB).
Figure POWERPC.4: The PowerPC 750
As on the 603 and 604, newly-dispatched instructions enter the reservation station of the execution unit to which they have been dispatched, where they wait for their operands to become available so that they can issue. The 750's reservation station configuration is similar to that of the 603, in that with the exception of the two-entry reservation station attached to the 750's LSU, all of the execution units have a single-entry reservation station. And like the 603, the 750's branch unit has no reservation station.
Because the 750's instruction window is so small, it has half the rename registers of the 604. Nonetheless, the 750's six general-purpose and floating-point rename register still put it ahead of the 603's number of rename registers (five GPR and four FPR). Like the 603, the 750 has one rename register each for the CR, LR, and CTR.
You would think that the 750's smaller reservation stations and shorter ROB would put it at a disadvantage with respect to the 604, which has a larger instruction window. But the 750's pipeline is shorter than that of the 604, so it needs fewer buffers to track fewer in-flight instructions. Even more importantly, though, the 750 has one very clever trick up its sleeve that it uses to keep its pipeline full.
Branch prediction on the 750
In the previous article's discussion of branch prediction, we talked about how dynamic branch prediction schemes use a branch history table (BHT) in combination with a branch target buffer (BTB) to speculate on the outcome of branch instructions and to redirect the processor's front end to a different point in the code stream based on this speculation. The BHT stores information on the past behavior (i.e., taken or not taken) of the most recently executed branch instructions, so that the processor can determine whether or not it should take these branches if it encounters them again. The target addresses of recently taken branches are stored in the BTB, so that when the branch prediction hardware decides to speculatively take a branch it will have immediate access to that branch's target address without having to recalculate it. The target address of the speculatively taken branch is loaded from the BTB into the instruction register, so that on the next fetch cycle the processor can begin fetching and speculatively executing instructions from the target address.
The 750 improves on this scheme in a very clever way. Instead of storing only the target addresses of recently taken branches in a BTB, the 750's 64-entry branch target instruction cache (BTIC) stores the instruction that's located at the branch's target address. When the 750's branch prediction unit examines the 512-entry BHT and decides to speculatively take a branch, it doesn't have to go code storage to fetch the first instruction from that branch's target address. Instead, the BPU loads the branch's target instruction directly from the BTIC into the instruction queue, which means that the processor doesn't have to wait around for the fetch logic to go out and fetch the target instruction from code storage. This scheme saves valuable cycles, and it helps keep performance-killing bubbles out of the 750's pipeline.
PowerPC 750 conclusions
In spite of its short pipeline and small instruction window, the 750 packed quite a punch. It managed to outperform the 604, and it was so successful that a 604-derivative was scrapped in favor of just building on the 750. The 750 and its immediate successors, all of which went under the name of "G3," eventually found widespread use both in the embedded arena and across Apple's entire product line, from its portables to its workstations.
The G3 lacked one important feature that separated it from the x86 competition, though: vector computing capabilities. While comparable PC processors supported SIMD in the form of Intel's and AMD's vector extensions to the x86 instruction set, the G3 was stuck in the world of scalar computing. So when Motorola decided to develop the G3 into an even more capable embedded and media workstation chip, this lack was the first thing they addressed.

http://arstechnica.com/features/2004/10/ppc-2/
It's a bit like AMDs bulldozer architecture come to think of it, twice the integer units per core than floating point units.