• Hey Guest. Check out your NeoGAF Wrapped 2025 results here!

Cure For The Common Graphics Processing

Lazy8s

The ghost of Dreamcast past
Even though graphics are displayed two-dimensionally as pixels, the common approach to drawing them is three-dimensionally with polygons. This tendency results because most graphics processors immediately start on the rendering phase of the job of handling the stream of polygonal graphics information being sent to it by the T&L unit despite the fact that such data is unordered. Because of the nature of 3D, polygons can be positioned behind and obscured by other polygons, and rendering them from an unordered stream of data will produce pixels that get drawn over others which had already been drawn, wasting the work previously done to produce them.

To avoid this unnecessary work, a fundamentally suited approach to graphics processing must be used. The most front-lying or visible pixels have to be identified for drawing, but this is a very involved procedure. It wouldn’t get completed in time unless it was done very quickly.

The factor that most limits performance in computing is the slowness at which data can be moved around, not the speed of calculation surprisingly. There are hard limits to the transfer rates of data, so the key to good performance is in minimizing the need to access information which resides externally.

Therefore, in order to quickly enough identify which pixels are visible, the procedure needs to be executed within the processor’s core. The amount of memory which can fit inside a core is not nearly large enough to hold all of the graphics data, however.

The job has to get handled in separate pieces, so the target space, the full area of the screen, must be split up into small enough tiles.

Determining in which tile(s) each bit of polygon information belongs from the incoming graphics data is not immediately possible because the information is unordered. So, the stream of graphics data needs to be fully compiled and interpreted into lists that correspond to the appropriate tiles of screen area.

After the scene has been compiled and split into manageably sized tile lists to fit within the graphics core, processing can then be fast enough to determine only the visible pixels from the image. The image can be rendered from there, and only the final pixels will ever need to be drawn.

By rendering only the final image mostly from within the graphics core, the severest limitations to computing performance, external dependencies, are addressed. This results in minimized external data traffic which allows the system to use less expensive memory types, and therefore be more cost effective, and which also consumes less battery/socket power. It results in rendering that occurs at the high internal precision without compromise to external framebuffer settings, raising the image quality for tasks like color blending and flexibility for object depth sorting. It allows for extra samples of the image to be taken for anti-aliasing without requiring more from framebuffer memory. Also, overall operation becomes more effective since there is a high locality kept among the data being processed.

There is more overhead in chip logic to implement this tile-based deferred rendering process, but the savings in fillrate from not having to overdraw any obscured pixels means less pixel pipelines and texture mapping units can be used to achieve the same effective performance, thereby counteracting any overhead in chip size.

PowerVR are a designer of graphics architectures of this type. Their mobile graphics accelerator, MBX, was released in 2003 and is the solution chosen by almost every major chip maker in the world, so it's been driving the visuals of several products over the past year and will be behind many future cell phones, PDAs, in-car GPS/infotainment systems, and other embedded and mobile devices.
 
From their site, here's the kind of performance difference that a tile-based deferred renderer like PowerVR's, which employs display lists to write out only the visible pixels, has afforded:

image006.gif

image008.gif

These graphs illustrate the memory bandwidth savings of a tile-based deferred renderer.

image010.gif

This graph shows that the memory bandwidth requirement for TBDRs stays low enough with increasingly complex graphics that it can use much less expensive RAM.
 
I think you've gone over the heads of most people :lol

You have to get their attention: PROOF THAT PS3 > XBOX 360. DS IS DEAD PSP 4 LYFE YO! NO GRASS IN NEW ZELDA, NEW ZELDA AM SHIT.

Btw; Maybe this would've been better kept at beyond3d? :)
 
Interesting idea, but both of PowerVR's tile-based VGA graphics solutions (Kyro & Kyro II) were dismal commercial failures. The fact that the entire industry is focused on the traditional graphics didn't help. The Kyro chips had great money-to-performance ratios, but typically had z-sorting issues - the kind their technobabble specifically states isn't a problem - and were unable to easily follow the rest of the graphics industry into shader-based rendering solutions. Although this idea may be revisited in the future (as it definately has it's merits), I'd say it's more likely to be refined and implemented by ATi or NVIDIA as a way to speed up some portion of the traditional pipeline, then integrated onto their chips and given a whiz-bang PR term - like 'lightspeed processing' or 'cineFX'. :D
 
This technology has always led its class in performance at each release by a margin that its competitors have recognized as being disruptive to their established advancement curve. They'll all likely switch to TBDR at some point. The difference from immediate mode rendering is fundamental in method, so it couldn't be adapted into an existing architecture.

The Dreamcast's graphics chip did per-pixel lighting with dot product bump mapping and supersampled anti-aliasing back in 1998, and it's capable of internal accuracy for both blending and floating-point z-buffering and translucency sorting in hardware at less than half the silicon expense.

The Kyro II PC graphics card had performance that transcended its class and actually competed with cards more than twice its price in some applications.
AnandTech's review: http://www.anandtech.com/video/showdoc.aspx?i=1435

MBX brings to mobile graphics programmable vertex shading, per-pixel lighting with DOT3, supersampled anti-aliasing practical enough to always be used, internal precision for blending and floating-point z-buffering, fractional tesselation and depth adaptation for curved surfaces, and anisotropic texture filtering, and it was released in 2003.

Joeholley:
both of PowerVR's tile-based VGA graphics solutions (Kyro & Kyro II) were dismal commercial failures.
Not really relevant to the technology, but Kyro II actually grabbed solid sales. The manufacturer, STM, ended up getting out of the sector as part of an operational restructuring.
The Kyro chips had great money-to-performance ratios,
That's the value of any technology.
but typically had z-sorting issues
There wouldn't be any hardware issues. It even had load/store functionality for operation like an external z-buffer for compatibility with depth-of field and other z-sampled effects. Maybe there were driver issues.
were unable to easily follow the rest of the graphics industry into shader-based rendering solutions.
The technology is ideally suited for programmable shaders. Instead of saving mostly on texture overdraw, now there's significant shader overdraw savings too.
 
Interesting stuff. Are you inferring that since ATI bought PowerVR, we'll be seeing this tech in the NextCube or the Xbox720?
 
Lazy8s said:
This technology has always led its class in performance at each release by a margin that its competitors have recognized as being disruptive to their established advancement curve. They'll all likely switch to TBDR at some point. The difference from immediate mode rendering is fundamental in method, so it couldn't be adapted into an existing architecture.

No they won't, IMRs will not go away at this point. The best shot you had at high-end use of any form of region-based deffered renderer was with 3dfx which had such IC ready to tape-out from a completed RTL.

There are to many problems with compatability and the bounds on preformance scaling compared to IMRs when looking forward that just don't make it feasible for a company the size of an ATI or nVidia to utilize. Architecturally, you can likely over come them with alot of pipeline reorganization, compression on the bins, I've heard of some interesting talk of how you can hedge on the spatial and temporal coherence between frames, et al; it's just not going to happen. And the days of the dark-horse are over.

I know you're a PowerVR fan, and I can't blame you from an academic standpoint, but...
 
it's like i've gone back in time to 1998, and gaf has turned into the dreamcast technical pages! freaky!
 
HokieJoe:
Are you inferring that since ATI bought PowerVR, we'll be seeing this tech in the NextCube or the Xbox720?
No one bought Imagination Technologies (PowerVR). Others will switch to TBDR, or at least some strong hybrid renderer away from immediate mode rendering, at some point because its principles "cure" the limitations in processing that other systems can only approach with increasingly expense prohibitive, high bandwidth performing parts.

Vince:
No they won't, IMRs will not go away at this point.
Not at this point, at some point.
There are to many problems with compatability
TBDRs can be just as compatible with APIs, and are allowed to be just as incompatible to a previous generation of hardware, as IMRs.
and the bounds on preformance scaling compared to IMRs when looking forward
While the operation of scene storage is added to the rendering process, it's not a limitation because IMRs will have already hit their maximum for scene complexity too by the time a TBDR is overflowed in a given hardware generation. Just in case, though, PowerVR employs scene management that lets it handle arbitrarily large complexities.

Even back when T&L became a hardware focus with the GeForce256 but not with Kyro, the Dargoth Moor application, which was supposed to be a showcase for hardware T&L, ran faster on Kyro.
that just don't make it feasible for a company the size of an ATI or nVidia to utilize.
On the contrary, their hefty sizes mean they can't ignore it for too long or else they'll continue to lose the share they need to support themselves in the emerging future of the mass market: the power consumption/heat sensitive, mobile and embedded spaces.
Architecturally, you can likely over come them with alot of pipeline reorganization, compression on the bins
Patchwork architectures will always lag behind fundamentally suited ones.
I've heard of some interesting talk of how you can hedge on the spatial and temporal coherence between frames
This important future principle is not mutually exclusive with TBDR -- its data locality can even be better suited for it.
 
In its newest generations, PowerVR's TBDR architecture has been licensed for use in dozens more 3D mobile phones and handhelds, Sega Sammy's performance line of arcade boards (the next NAOMI/Lindbergh), and Sega Sammy's value line of arcade boards (the next Atomiswave/Aurora).
 
Lazy8s said:
Not at this point, at some point.

Again, no they won't.

While the operation of scene storage is added to the rendering process, it's not a limitation because IMRs will have already hit their maximum for scene complexity too by the time a TBDR is overflowed in a given hardware generation. Just in case, though, PowerVR employs scene management that lets it handle arbitrarily large complexities.

Yes, it is a limitation because as I stated you need to cache the entire scene. As you scale geometry upwards, the cost of binning (eg. scene buffer size) will more than likely equal or surpass that of an external Z-Buffer, and there goes your bandwith advantage. And you need to have a scene buffer that's large enough for the most pathological case as the cost of going external is absolutly prohibitive and, while as I state later, I've heard of schemes to utilize the frame-to-frame spatial and temporal coherence, it would require a large investment in R&D as AFAIK the buffer-size varies greatly in time. It's got to be somewhat deterministic, but how?

But, this argument is moot as Region-based Deferred Renderers were intended to overcome a bound that's no longer there. With the move towards shading, the bound is on computation, not rasterization, and as we move forward the vast majority of the time per frame will be spent on computation and math ops, not RAM acccess. You'll respond by saying you save on shading less pixels since you eliminate OD, but this is a half-assed answer.

IMRs have adapted well to the scaling problem with Z-compression schemes, more advanced HSR anbd most of all they benefit more from the eventual use of HOS -- To abandon IMRs and their lineage (and better compatability to boot) for something which imposes anouther bound (forming the display list in an HOS rich enviroment) can be accomplished extremely fast on architectures like the NV40 by doing a first, Z-only pass, is stupid.

You mention the "scene manager" of Kyro which atleast puts a cap on what would otherwise be the unbounded increase (scales with polygon counts) I talked about earlier, but it does so at the expense of preformance (see above) and a memory usage slowdown as you need an external Z.

What's pretty funny is that the methods I posted that I often heard post-3dfx|GP to get around a RBDR's limitations (pipeline reorganization, compression on the bins as intermediates and turn to utilizing the spatial and temporal coherence between frames in the long-run) are things you said are "patchwork." So are you for or against TBDRs? You stated:

Patchwork architectures will always lag behind fundamentally suited ones.

*shrug* I was talking about TBDRs. And of course they're suited to T|RBDRs, that's why I mentioned the. Maybe you should reread what I said before attacking me... in the end, you attacked yourself.
 
Vince:
You'll respond by saying you save on shading less pixels since you eliminate OD, but this is a half-assed answer.
When the performance limitation changes to calculation time, the deferring half of a tiling deferring renderer is just as complete a solution as the tiling half is for a bandwidth limitation.
What's pretty funny is that the methods I posted that I often heard post-3dfx|GP to get around a RBDR's limitations (pipeline reorganization, compression on the bins as intermediates and turn to utilizing the spatial and temporal coherence between frames in the long-run) are things you said are "patchwork."
You're right; I didn't read closely enough there at all. I don't see the need for changing the structure of the rendering process. Future requirements can be met with normal expansion of the pipeline. As for compression on the scene buffer, that's, of course, a welcomed optimization.
 
A display-list deferred rasterizer gets more space out of its memory in addition to being able to afford the expense of more memory in total and/or a lower cost because it achieves acceptable image quality without having to enlarge its external framebuffer. Just like its ability to supersample internally when performing anti-aliasing and not waste space on a higher resolution target buffer, its ability to blend colors internally means its image quality is acceptable even without using more RAM space to move from a 16-bit framebuffer to 32-bits.

This is illustrated in the following example which demonstrates 16-bit color on a display-list deferred rasterizer versus a typical system.

From the following scene rendered at 16-bit color, first by a typical system:
4.jpg


... and then at 16-bit color by a display-list deferred rasterizer:
5.jpg


A major disparity can be seen between the smoothness and precision of a typical system which progressively dithers the colors at each blend (4x magnification):
6.jpg


... and a display-list deffered rasterizer which always blends at internal precision (4x magnification):
7.jpg


Obviously, a game designer for a typical system would usually find its 16-bit color there to be unacceptable, so they'll have to spend more memory on a 32-bit buffer just to get an acceptable level of image quality.
 
For only about half the cost, a tile-based display list renderer can achieve the same level of performance as a conventional processor. That allows it to end up being either much more advanced at a comparable price point or sold at a much more accessible, mass market price.

A lot of the savings come from how much work it’s being saved from doing since no pixels are ever drawn over. The amount of this savings depends on the 3D depth of the game’s scene, the number of layers of surfaces behind the front (the front of an object covering its backside, in front of another object in front of some background detail, etc.) On average, there are several layers like this in a game.

Taking even a seemingly simple scene from an old game like Quake 3 Arena, for instance:
image024.jpg


A high number of hidden surfaces submitted as part of the scene are revealed:
image026.jpg

On average, Quake 3 Arena requires conventional processors to draw more than three pixels for every point on the screen, even when using modern, early depth check techniques. Other games from that era, like Serious Sam, had average depth complexities of more than five, and modern games may go even higher while requiring a lot more textures to be transferred and shading to be produced for each surface.

Because a TBDLR draws only the one visible layer of pixels, its pixels are worth that several times higher amount comparatively. Such a chip could have comparable fillrates with several times fewer pipelines, allowing the chip to approach 50% the size and therefore cost of a conventional chip or to be much more powerful at a comparable expense. That cost effectiveness is reached when the savings from internally processing much of the graphics, which expends less bandwidth to allow less costly RAM to be used, is also accounted for.

The difference that PowerVR’s TBDLR architecture has made in practice in its most recent release, the MBX mobile graphics accelerator, is overwhelming. Its nearest challenger, despite being released many months after MBX, is a generation behind in functionality without programmable vertex shading, fractional tessellation and depth adaptation in curved surface rendering, per-pixel lighting with DOT3, always-affordable supersampled anti-aliasing, internal precision color blending, internal floating-point precision depth sorting, and anisotropic filtering, and it only gives comparable fillrates to MBX while expending multiple pipelines – four times as many – and therefore costing a lot more and draining a lot more battery power.
 
Vince:
And you need to have a scene buffer that's large enough for the most pathological case as the cost of going external is absolutly prohibitive
Even in such an ulikely, rare, worst-case scenario, it's not necessarily worse off than a conventional renderer by that point, and memory management techniques to reclaim display list space during processing are not necessarily prohibitive:

http://l2.espacenet.com/espacenet/viewer?PN=EP1287494&CY=gb&LG=en&DB=EPD
The best shot you had at high-end use of any form of region-based deffered renderer was with 3dfx...
It's coming in a new Sega Sammy arcade board. Don't be surprised when the graphics quality is unmatched by nVidia/Sony and ATi.
...and better compatability to boot...
PC games may be more customized to the limits of conventional renderers, but that's not an issue of compatibility, nor would it necessarily go unaccounted for by a TBDLR, and nor will PC games make a significant difference. The games market is split mostly among a field of embedded systems, and compatibility isn't the issue there like it wasn't for Dreamcast.
 
Cost breakdown, illustrated, of the components which make up a graphics system, and the advantages of a TBDLR:

syscost.jpg


Vince:
it's just not going to happen.
I was recently given a reminder of techniques which keep scene storage in check and are already researched. From an Intel 2700G document overviewing the MBX Lite's operation:

"On a screen, the smallest visible element is a pixel. If a triangle fails to intersect with a pixel, it will not be drawn on the screen. Many triangles like this can occur in an image due to complex models that may be scaled, rotated, and transformed. The tile accelerator removes these unused triangles, reducing the number of triangles to be processed by the 3D core."

Kyro compacted triangles with vertex stripping and didn't even reach 10% of the framebuffer size with scene storage. Newer techniques like small object culling and a kind of hierarchical tiling are providing even more control for MBX and beyond.
 
On a high end card - like the latest ATI - you can draw in excess of 500Million triangles/second.
If each vertex is 24 bytes (xyz float, rgba byte, st float ) , you had a 1 vert/triangle ratio,
and a frame rate of 60HZ you would need around 200MBytes to store the deferred triangle lists... - System bandwidth would go up as well, as you might need 12GB/s to read the original list before transform, at least 12GB/s to write out the transformed verts into bins, and then another 12GB/s to read them in again when tiling..

It wouldn't be impossible to scale up TBDR - but several problems would need to be solved first to limit the bandwidth used before pixel shading :)
 
In terms of overdraw. I read somewhere that modern graphic cards are so fast that it's quicker to send the raw geometry to the GPU rather then spending CPU time trying to figure out what polygons should be drawn.

Portals and good LOD management are the way to go IMO.
 
In addition to object culling, triangle stripping, and indexing, visibility can be tested at T&L with bounding boxes. Combined with its affording of more RAM and also more space left after the framebuffer, a TBDLR can scale to high polygon counts as well. Not to such an indefinite degree as a conventional renderer, but that's not where graphics in the sector are headed. Improving image quality will bring greater returns than escalating polygon counts now, and TBDLR's advantage in anti-aliasing and rendering precision, especially with the onset of multiple render targets for some approaches, will continue to keep it the best suited solution.
 
hahahahahaha

tile-based deferred rendering is the new Timecube

all your traditional rendering have been cornered stupid

SuperKyro defeated Thomas Jefferson but your schools teach the lie of z-buffer evil
 
Crazyace said:
On a high end card - like the latest ATI - you can draw in excess of 500Million triangles/second.
If each vertex is 24 bytes (xyz float, rgba byte, st float ) , you had a 1 vert/triangle ratio,
and a frame rate of 60HZ you would need around 200MBytes to store the deferred triangle lists... - System bandwidth would go up as well, as you might need 12GB/s to read the original list before transform, at least 12GB/s to write out the transformed verts into bins, and then another 12GB/s to read them in again when tiling..

It wouldn't be impossible to scale up TBDR - but several problems would need to be solved first to limit the bandwidth used before pixel shading :)
I can't reproduce your numbers? Care to expand your calculations? ;)

With a milion unique vertices per frame (let's say each vertex is going to be reused at least 1.5 times..time would make a 1.5 MTriangles per frame scene!) and with each vertex taking 32 bytes of memory you would need 32 MBytes to store vertices for a given frame, double that amount of memory if we want to use triple buffering.
To read, store and read 32 Mbytes of mem at 60 Hz one would need 5.7 GBytes/s of bandwith, this sound so achievable!
Oh well..TBDR makes so much more sense to me.. too bad we're not going to see a consumer product that use this technique :(
 
Top Bottom