In the PowerVR architecture (Figure 3), the arrays are grouped into Unified Shading Clusters (USCs). The process of scanline rendering naturally results in a high degree of this type of coherence, so the arrays can be kept busy, with ALU latencies and the inevitable memory access latencies masked to a certain degree by task switching.
There are a number of data masters feeding into the schedulers to handle vertex-, pixel-, and compute-related tasks. Once the shading operation is done, the result is output into a data sink for further processing, depending on what part of the rendering pipeline is being handled.
The ray tracing unit (RTU) can be added to this list as both a data sink and a data master so that it can both receive (sink) new ray queries from the shaders and dispatch (master) ray/triangle intersection results back for shading. It contains registers for a large number of complete ray queries (with user data) attached to a SIMD array of fixed-function "Axis Aligned Bounding Box vs. Ray" testers and "Triangle vs. Ray" testers.
Importantly, there is a coherence gathering unit which assembles memory access requests into one of two types of coherency queues: intersection queues and shading queues, then schedules them for processing. Intersection queues are scheduled on to the SIMD AABB or triangle testers; shading queues are mastered out to the USCs.
Intersection queues are created and destroyed on the fly and represent a list of sibling Bounding Volume Hierarchy (BVH) nodes or triangles to be streamed in from off-chip memory. Initially the queues are typically full naturally because the root BVH nodes span a large volume in the scene and therefore most rays hit them consistently. When a full queue of rays is to be tested against the root of the hierarchy, the root nodes are read from memory and the hardware can intersect rays against nodes and/or triangles as appropriate.
For each node that hits, a new intersection queue is dynamically created and rays that hit that node are placed into the new child queue. If the child queue is completely full (which is common at the top of the BVH), it is pushed onto a ready stack and processed immediately.
If the queue is not full (which occurs a little deeper in the tree, especially with scattered input rays from the USC), it is retained in a queue cache until more hits occur against that same BVH node at a later time. In this mode, the queues effectively represent an address in DRAM to start reading in the future. This has the effect of coherence gathering rays into regions of 3D space and will dynamically spend the queues on areas of the scene which are more challenging to collect coherence against.
This process continues in a streaming fashion until the ray traverses to the triangle leaf nodes; when a ray is no longer a member of any intersection queue, the closest triangle has been found.
At this point, a new shading queue is created, but this time it is coherence gathering on the shading state that is associated with that triangle. Once a shading queue is full, this becomes a task which is then scheduled for shader execution. Uniforms and texturing state are loaded into the common store and parallel execution of the shading task begins: each ray hit result represents a shading instance within that task.
The behavior is then identical to that of a rasterization fragment shader with the added feature that shaders can create new rays using a new instruction added to the PowerVR shader instruction set, and send them as new ray queries to the RTU.
The RTU returns ray/triangle intersection results to the shaders in a different order than that in which they entered due to the coherence gathering. A ray that enters the RTU early in the rendering of a frame may be the last to leave depending on coherence conditions.
This approach to dynamic coherence gathering has the effect of parallelizing on rays instead of pixels, which means that even rays that originate from totally different ray trees from other pixels can be collected together to maximize all available coherence that exists in the scene. This then decouples the pipelines, creating a highly latency-tolerant system and enabling an extensive set of reordering possibilities.