I see a lot of bullshit is flying in all the threads about how RT is implemented in NV/AMD and how one is superior to the other or vise versa.
I hope I can put all of these arguments out once and for all. (err, whom I kidding)
TL;DR NV and AMD solutions are almost exactly the same. Nothing to see here.
Now actually if we open the Nvidia whitepaper on ray tracing we can see how it is built (roughly)
Each Turing SM (shader module) has a special silicon called "RT unit", it is located close to TEX fetch units, in fact it is located before them, meaning it is on path of VRAM->L1 cache but just before the Texture units
And each 4 Texture units have 1 RT unit serving them.
But what's the performance of an RT unit?
We do not know.
But we do know the texture unit performance, for a typical 2080Ti (in boost mode) it's ~420GTex/sec
Because RTs are on the path to texture cache they cannot possibly fetch from VRAM faster than that.
I would suspect that NV doesn't state the actual max perf numbers just because the actual intersection check is almost instant (probably small number of clocks) after the sample is loaded from memory.
How do I know it?
From the same presentation we can see that RT core accelerates ray->AABB/Tri intersection in a BVH structure.
What's BVH?
It's a tree of boxes for each and every "object"/"group of objects" in the scene, and when boxes house a sufficiently small object that tree leaves are the triangles that this object is built from.
So essentially BVH includes all of your scene, everything and every fucking triangle.
It means in turn that BVH structures are huge, a lot of memory, depends on how precise you want your effects to be, but still.
The size and maintenance (you need to add new objects to that tree, and reconfigure tree when object changes position) of BVH is the first stumble block of a real-time raytracing.
So, to test for an intersection of ray vs BVH we ask the texture cache to load a small BVH "node" into our RT unit, check for intersection, then load next node and so on.
In no place Nvidia mentions any type of special cache for the BVH or any cache in RT unit whatsoever, so, until further notice, we should assume no internal memory in RT units.
That's why each RT unit is effectively bottlenecked by how fast textures (in our case it's a part of BVH tree) can be fetched from VRAM.
Now we do have another number for 2080Ti it's "10Grays/sec" how was it calculated?
According to the same whitepaper it's the best synthetic benchmark result they could achieve on a primary ray intersections for a specific curated list of BVHs they had.
If we open a more realistic scenario of a multi-ray benchmark here we can see that performance drops further, to 3-3.5Grays/sec.
These are synthetic. Actual games will have even less performance available.
So what will happen in actual games?
Let's return to the whitepaper: we can see that the RT core is invoked by scheduling an instruction from the shader and the result is returned back to the shader engine (probably in a register).
It means that "rasterization" is not going anywhere, after we got the intersection it's up to the shader itself to determine what to do with it, how to color the pixel, how to render the shadow, etc.
RT units accelerate BVH traversal and nothing else, all the usual shaders need to run and use these normal unaccelerated FLOPS to render the final image.
What about AMD?
Let's check the AMD patent here.
What do we see? We see "intersection engines"(IE) that are colocated with texture cache units.
Each TEX unit gets an intersection engine that gets BVH node from the cache and then returns the result to the shader.
It's exactly the same path, but with 1 IE core per each 1 TEX core.
The only difference is that "NV RT" is placed before the cache, and "AMD RT" goes after the cache.
So what about bottlenecks?
It's exactly the same. For XSeX RDNA2 GPU we have a texture fillrate of 208TEX units * 1.825Ghz = 379.6GTex/sec.
And we have a number from MSFT "380 billion BVH traversals per second".
Doesn't ring a bell? Yep. It does. We are still limited by the ~380GTex/sec
The same as NV.
Can we compare 2080Ti to XSeX?
Yep, now we can.
We can approximately calculate theoretical difference in max RT performance between 2080Ti and XSeX: 420 vs 380 = 10Grays/sec vs 9Grays/sec
Pretty close. But again, actual in-game numbers will be much, much, much lower.
Probably to the point that there is no difference at all.
What about PS5?
Simple 144*2.23 = 321GTex/sec (yes it is a boost clock, and it is a boost clock for NV too)
Which places it in theoretical 7.6Grays/sec. Not bad. But lower than the other two.
Actually NV states 6Grays/sec for 2070, so still better than that.
Questions?
I hope I can put all of these arguments out once and for all. (err, whom I kidding)
TL;DR NV and AMD solutions are almost exactly the same. Nothing to see here.
Now actually if we open the Nvidia whitepaper on ray tracing we can see how it is built (roughly)
Each Turing SM (shader module) has a special silicon called "RT unit", it is located close to TEX fetch units, in fact it is located before them, meaning it is on path of VRAM->L1 cache but just before the Texture units
And each 4 Texture units have 1 RT unit serving them.
But what's the performance of an RT unit?
We do not know.
But we do know the texture unit performance, for a typical 2080Ti (in boost mode) it's ~420GTex/sec
Because RTs are on the path to texture cache they cannot possibly fetch from VRAM faster than that.
I would suspect that NV doesn't state the actual max perf numbers just because the actual intersection check is almost instant (probably small number of clocks) after the sample is loaded from memory.
How do I know it?
From the same presentation we can see that RT core accelerates ray->AABB/Tri intersection in a BVH structure.
What's BVH?
It's a tree of boxes for each and every "object"/"group of objects" in the scene, and when boxes house a sufficiently small object that tree leaves are the triangles that this object is built from.
So essentially BVH includes all of your scene, everything and every fucking triangle.
It means in turn that BVH structures are huge, a lot of memory, depends on how precise you want your effects to be, but still.
The size and maintenance (you need to add new objects to that tree, and reconfigure tree when object changes position) of BVH is the first stumble block of a real-time raytracing.
So, to test for an intersection of ray vs BVH we ask the texture cache to load a small BVH "node" into our RT unit, check for intersection, then load next node and so on.
In no place Nvidia mentions any type of special cache for the BVH or any cache in RT unit whatsoever, so, until further notice, we should assume no internal memory in RT units.
That's why each RT unit is effectively bottlenecked by how fast textures (in our case it's a part of BVH tree) can be fetched from VRAM.
Now we do have another number for 2080Ti it's "10Grays/sec" how was it calculated?
According to the same whitepaper it's the best synthetic benchmark result they could achieve on a primary ray intersections for a specific curated list of BVHs they had.
If we open a more realistic scenario of a multi-ray benchmark here we can see that performance drops further, to 3-3.5Grays/sec.
These are synthetic. Actual games will have even less performance available.
So what will happen in actual games?
Let's return to the whitepaper: we can see that the RT core is invoked by scheduling an instruction from the shader and the result is returned back to the shader engine (probably in a register).
It means that "rasterization" is not going anywhere, after we got the intersection it's up to the shader itself to determine what to do with it, how to color the pixel, how to render the shadow, etc.
RT units accelerate BVH traversal and nothing else, all the usual shaders need to run and use these normal unaccelerated FLOPS to render the final image.
What about AMD?
Let's check the AMD patent here.
What do we see? We see "intersection engines"(IE) that are colocated with texture cache units.
Each TEX unit gets an intersection engine that gets BVH node from the cache and then returns the result to the shader.
It's exactly the same path, but with 1 IE core per each 1 TEX core.
The only difference is that "NV RT" is placed before the cache, and "AMD RT" goes after the cache.
So what about bottlenecks?
It's exactly the same. For XSeX RDNA2 GPU we have a texture fillrate of 208TEX units * 1.825Ghz = 379.6GTex/sec.
And we have a number from MSFT "380 billion BVH traversals per second".
Doesn't ring a bell? Yep. It does. We are still limited by the ~380GTex/sec
The same as NV.
Can we compare 2080Ti to XSeX?
Yep, now we can.
We can approximately calculate theoretical difference in max RT performance between 2080Ti and XSeX: 420 vs 380 = 10Grays/sec vs 9Grays/sec
Pretty close. But again, actual in-game numbers will be much, much, much lower.
Probably to the point that there is no difference at all.
What about PS5?
Simple 144*2.23 = 321GTex/sec (yes it is a boost clock, and it is a boost clock for NV too)
Which places it in theoretical 7.6Grays/sec. Not bad. But lower than the other two.
Actually NV states 6Grays/sec for 2070, so still better than that.
Questions?