Eric Mejdrich Sr Director of SOC Architecture and Principal Architect Xbox at Microsoft
Some of the following have been discussed by Sony Game engine developers for the PS4. It's possible that some may show up in both consoles to allow software developers an easier port between platforms.
In any case I believe some of the following are the special sauce being discussed for the Xbox 3. The file dates for hardware features are all by Dec 2011 which is when SemiAccurate stated the Xbox 720 started tapeout.
http://www.faqs.org/patents/inventor/eric-o-mejdrich-2/
QOS software
20110285709
Allocating Resources Based On A Performance Statistic - A method includes rendering an object of a three dimensional image via a pixel shader based on a render context data structure associated with the object. The method includes measuring a performance statistic associated with rendering the object. The method also includes storing the performance statistic in the render context data structure associated with the object. The performance statistic is accessible to a host interface processor to determine whether to allocate a second pixel shader to render the object in a subsequent three-dimensional image. 11-24-2011
20110285710
Parallelized Ray Tracing - A method includes assigning a priority to a ray data structure of a plurality of ray data structures based on one or more priorities. The ray data structure includes properties of a ray to be traced from an illumination source in a three-dimensional image. The method includes identifying a portion of the three-dimensional image through which the ray passes. The method also includes identifying a slave processing element associated with the portion of the three-dimensional image. The method further includes sending the ray data structure to the slave processing element. 11-24-2011
20110289485
Software Trace Collection and Analysis Utilizing Direct Interthread Communication On A Network On Chip - Collecting and analyzing trace data while in a software debug mode through direct interthread communication (‘DITC’

on a network on chip (‘NOC’

, the NOC including integrated processor (‘IP’

blocks, routers, memory communications controllers, and network interface controllers, with each IP block adapted to a router through a memory communications controller and a network interface controller, where each memory communications controller controlling communications between an IP block and memory, and each network interface controller controlling inter-IP block communications through routers, including enabling the collection of software debug information in a selected set of IP blocks distributed through the NOC, each IP block within the selected set of IP blocks having a set of trace data; collecting software debugging information via the set of trace data; communicating the set of trace data to a destination repository; and analyzing the set of trace data at the destination repository. 11-24-2011
Software acceleration
20110316855
Parallelized Streaming Accelerated Data Structure Generation - A method includes receiving at a master processing element primitive data that includes properties of a primitive. The method includes partially traversing a spatial data structure that represents a three-dimensional image to identify an internal node of the spatial data structure. The internal node represents a portion of the three-dimensional image. The method also includes selecting a slave processing element from a plurality of slave processing elements. The selected processing element is associated with the internal node. The method further includes sending the primitive data to the selected slave processing element to traverse a portion of the spatial data structure to identify a leaf node of the spatial data structure. 12-29-2011
20110316864
MULTITHREADED SOFTWARE RENDERING PIPELINE WITH DYNAMIC PERFORMANCE-BASED REALLOCATION OF RASTER THREADS - A multithreaded rendering software pipeline architecture dynamically reallocates regions of an image space to raster threads based upon performance data collected by the raster threads. The reallocation of the regions typically includes resizing the regions assigned to particular raster threads and/or reassigning regions to different raster threads to better balance the relative workloads of the raster threads. 12-29-2011
20110317712
Recovering Data From A Plurality of Packets - A method includes receiving a plurality of packets at an integrated processor block of a network on a chip device. The plurality of packets includes a first packet that includes an indication of a start of data associated with a pixel shader application. The method includes recovering the data from the plurality of packets. The method also includes storing the recovered data in a dedicated packet collection memory within the network on the chip device. The method further includes retaining the data stored in the dedicated packet collection memory during an interruption event. Upon completion of the interruption event, the method includes copying packets stored in the dedicated packet collection memory prior to the interruption event to an inbox of the network on the chip device for processing. 12-29-2011
Hardware patent
20110320719
PROPAGATING SHARED STATE CHANGES TO MULTIPLE THREADS WITHIN A MULTITHREADED PROCESSING ENVIRONMENT - A circuit arrangement and method make state changes to shared state data in a highly multithreaded environment by propagating or streaming the changes to multiple parallel hardware threads of execution in the multithreaded environment using an on-chip communications network and without attempting to access any copy of the shared state data in a shared memory to which the parallel threads of execution are also coupled. Through the use of an on-chip communications network, changes to the shared state data may be communicated quickly and efficiently to multiple threads of execution, enabling those threads to locally update their local copies of the shared state. Furthermore, by avoiding attempts to access a shared memory, the interface to the shared memory is not overloaded with concurrent access attempts, thus preserving memory bandwidth for other activities and reducing memory latency. Particularly for larger shared states, propagating the changes, rather than an entire shared state, further improves performance by reducing the amount of data communicated over the on-chip communications network. 12-29-2011
20110320724
DMA-BASED ACCELERATION OF COMMAND PUSH BUFFER BETWEEN HOST AND TARGET DEVICES - Direct Memory Access (DMA) is used in connection with passing commands between a host device and a target device coupled via a push buffer. Commands passed to a push buffer by a host device may be accumulated by the host device prior to forwarding the commands to the push buffer, such that DMA may be used to collectively pass a block of commands to the push buffer. In addition, a host device may utilize DMA to pass command parameters for commands to a command buffer that is accessible by the target device but is separate from the push buffer, with the commands that are passed to the push buffer including pointers to the associated command parameters in the command buffer. 12-29-2011
20110321049
Programmable Integrated Processor Blocks - An integrated processor block of the network on a chip is programmable to perform a first function. The integrated processor block includes an inbox to receive incoming packets from other integrated processor blocks of a network on a chip, an outbox to send outgoing packets to the other integrated processor blocks, an on-chip memory, and a memory management unit to enable access to the on-chip memory. 12-29-2011
The following is software and was discussed by a Sony game engine developer.
20120176364
REUSE OF STATIC IMAGE DATA FROM PRIOR IMAGE FRAMES TO REDUCE RASTERIZATION REQUIREMENTS - An apparatus, program product and method reuse static image data generated during rasterization of static geometry to reduce the processing overhead associated with rasterizing subsequent image frames. In particular, static image data generated one frame may be reused in a subsequent image frame such that the subsequent image frame is generated without having to re-rasterize the static geometry from the scene, i.e., with only the dynamic geometry rasterized. The resulting image frame includes dynamic image data generated as a result of rasterizing the dynamic geometry during that image frame, and static image data generated as a result of rasterizing the static image data during a prior image frame. 07-12-2012
Hardware feature
20110320771
INSTRUCTION UNIT WITH INSTRUCTION BUFFER PIPELINE BYPASS - A circuit arrangement and method selectively bypass an instruction buffer for selected instructions so that bypassed instructions can be dispatched without having to first pass through the instruction buffer. Thus, for example, in the case that an instruction buffer is partially or completely flushed as a result of an instruction redirect (e.g., due to a branch mispredict), instructions can be forwarded to subsequent stages in an instruction unit and/or to one or more execution units without the latency associated with passing through the instruction buffer. 12-29-2011
The following I think is the thread fabric control linking multiple IP (CPUs and more) blocks. It's similar to what Super computers use.
20120192202
Context Switching On A Network On Chip - A network on chip (NOC) that includes IP blocks, routers, memory communications controllers, and network interface controllers, each IP block adapted to the network by an application messaging interconnect including an inbox and an outbox, one or more of the IP blocks including computer processors supporting a plurality of threads, the NOC also including an inbox and outbox controller configured to set pointers to the inbox and outbox, respectively, that identify valid message data for a current thread; and software running in the current thread that, upon a context switch to a new thread, is configured to: save the pointer values for the current thread, and reset the pointer values to identify valid message data for the new thread, where the inbox and outbox controller are further configured to retain the valid message data for the current thread in the boxes until context switches again to the current thread. 07-26-2012
If you follow the Linkedin profile for Eric Mejdrich he worked for IBM till 2010 on NOC and then moved to Microsoft. NOC would be the thread routing "Fabric" of a super computer (would also support distributed computing). Is the NOC being used a IBM IP???
I'm starting to come to the conclusion that Charlie was accurate in his Oban article. Older well understood 32nm SOI and IP from multiple disciplines to accelerate a smaller GPU. The smaller GPU makes Yield easier. I wonder at the DX 11.5 mentioned for the PS4. At the time that was posted DX was further along than OpenGL and the current Direct X standard is only 11.1. Could DX 11.5 be anticipating features coming in both Consoles.