Just wanted to respond to a few posts on the topic of the Primitive Discard Accelerator, and how it relates to Polaris and 14nm.
Just to clarify here, the quote in 10k's post describing the Primitive Discard Accelerator isn't from his source, it's from me (and I'm definitely not any kind of insider). His source had told him that the GPU featured new features like the PDA (which AMD has publicly listed as one of the new features in Polaris), and asked me what the PDA does, and whether it could be used in a 28nm GPU.
I should also note that although AMD have listed PDA as one of the new features in their Polaris presentation, they haven't given any details on what it does or how it does it, so my description of its use is purely my own assumption based on the name of the feature, although I can't really see how the words "primitive discard accelerator" could mean anything very different from what I described.
The "developer" also gave a description of what the PDA does and why it's useful, which ties in pretty much exactly with my understanding of how it would work. Which doesn't necessarily mean that the rumour is true, but it does mean that the person making the claim at least has a decent understanding of real-time 3D graphics.
In any case, the reason for discussing the PDA isn't that it's necessarily a revolutionary feature (although it could certainly be useful), but that if the source is real, then the GPU has at least some Polaris features, and is therefore very likely to be manufactured on 14nm.
If true, then it would be reasonable to assume that it has at least some other Polaris features, but it doesn't necessarily guarantee that it has all of them. For example, let's say the improved Command Processor, Geometry Processor and PDA for Polaris had all been all-but finished in late 2014 when work on NX started, but the improved CUs weren't finished until much later. It possible that Nintendo would have been able to make use of several of the new features of Polaris, but would still use GCN 1.2-era CUs. Both PS4 and XBO's APUs sort of sit between generations like this, so it wouldn't be all that out of the ordinary. On the other hand, there are reports that Polaris has been pretty much finished for a long time now, and was delayed first by the abandonment of 20nm and secondly by the slow growth in 14nm yields. In that case Nintendo may have had almost the entire suite of Polaris features to use when building the NX chip.
The other thing I'd like to mention here is that wccftech's description of the Primitive Discard Accelerator here is actually wrong, at least as far as my understanding goes. They claim it will be used to implement conservative rasterisation (a new DX12 technique which currently only works on Nvidia cards), although there's nothing about conservative rasterisation which I would describe as "discarding primitives". In fact, one of the main use cases of conservative rasterisation (ray-traced shadows) uses it for precisely the opposite effect, in preventing sub-pixel triangles from being thrown away.
The reason I bring this up is that wccftech is a tech website that's heavily focussed on graphics technology. If they can get something like this wrong, then it's all the more impressive that someone claiming to have insider information would get it right. That doesn't mean it's any more than a rumour, but there's a relatively high threshold of knowledge on the part of anyone trying to fake this.
As mentioned above, I wouldn't consider a Primitive Discard Accelerator a major feature (although it could be a useful one), just that the source claimed that NX has it, which would indicate that the NX APU is on 14nm and using a Polaris-based GPU.
That said, although you're right in that GPUs do implement small-triangle culling, orientation culling, etc. all in hardware, they only do it at the rasterisation stage (and afaik this is true for both AMD and Nvidia). What this means is that the triangle has to run through the geometry front-end, the vertex shaders and the rasteriser before it actually gets thrown out. Even if we're just talking about orientation culling, regardless of what kind of game it is or how well optimised your engine is, that means that you're wasting about half of the throughput of your geometry front-end on unnecessary triangles and you're doubling your vertex shading workload without benefit. If you bring that culling right to the very start of the pipeline then you stand to make some fairly nice efficiency gains, and small-triangle, frustum and high-Z culling could also stand to benefit (although these will be more dependent on how well optimised games already are in each regard).
I'd recommend having a read through the slides of a recent GDC talk given by one of Frostbite's senior rendering engineers (
PDF link here). It does a good job of running through the benefits of early culling on GCN-based hardware, and the performance gains from Frostbite's compute shader solution. They also implement occlusion culling in their solution, but I wouldn't expect to see acceleration for that at the front of the pipeline, due to it requiring pre-Z. That said, I could certainly see developers being happy about a solution which brings efficient orientation culling and small-triangle culling to the start of the pipeline. If you're working on a game where geometry throughput is your main bottleneck (say an open-world game with sub-optimal LOD) then it could make your life quite a bit easier.