Oxide: Nvidia GPU's do not support DX12 Asynchronous Compute/Shaders.

Though to be honest, if anyone is trying to offload a 980 Ti on the cheap (Europe) tell me your price :D

I want to see more Beyond3D benchmarks, those guys are really trying to test both architectures and trying to get some results out of it.
 
Fable beta would definiely be interesting. A shame it is locked behind NDA atm.


BTW, earlier you .gif posted my mentioning of Crysis 2 concerning tesselation but failed to respond. The tessellation myth in Crysis 2 was disproved years ago just so you know. By Cryengine devs themselves among many others. A shame that silly wccftech report taking pictures and videos in debug mode came out... they just id not understand how cryengine works...

I know the water under the level bit was debunked, but how about the concrete barriers that had 76778767777343973 polygons to render a flat surface?
 
Though to be honest, if anyone is trying to offload a 980 Ti on the cheap (Europe) tell me your price :D

I want to see more Beyond3D benchmarks, those guys are really trying to test both architectures and trying to get some results out of it.

Can you link me the "bench"? I could test and post as well if you would like (Titan X).
I know the water under the level bit was debunked, but how about the concrete barriers that had 76778767777343973 polygons to render a flat surface?

Phong tessellation costs very little in comparison to other types (which is why it has that pattern stretch over it) and those images were taken in debug mode which ALSO disables Tesselation distance scaling and LODs. Those blocks only tessellate heavily (using the cheapest tessellation mode) when you are right right next to them. It is a non-issue. The best way to test how little performance it costs is by loading up the game on an older AMD card and toggling tesselation cvars in the console. It is easy to see how little it costs.

Sadly people confused the game running poorly in DX11 as lying with tesselation. Rather than looking at the super expensive post processing and pixel shaders.
 
I don't think so. Fable Legends uses UE4 which seemingly isn't doing anything interesting in this area atm.

It is only interesting as a test of a different dx12 workload.

Well we'll see a victory for NV under dx12 then ;) and the drama is on the other foot until the next DX12 game which will probably replicate that result, I'm not sure there is a situation outside this case that will see AMD ahead, I do think that AMD owners will just be grateful to get away from DX11 overheads
 
Compute shaders, yes. You have to look at them talking about it in context of dx11 hardware. The code dealing with it could still be structured for serial execution.

Compute shaders ≠ Async Compute/sharers

Compute sharers have been used for a while now. And has the blog post is saying; they are aiming for DX11 level hardware.

I was under the impression the game is DX12 only currently? And wouldnt AMD cards be better at a default compute (non-async)/serial one anyway?`
 
I know the water under the level bit was debunked, but how about the concrete barriers that had 76778767777343973 polygons to render a flat surface?

Exactly. Water under level has always been part of the CryEngine since day one and didn't impact performance in-game (AFAIK is can now be disabled in the latest version). All the other crap (concrete slabs, pavements, brick walls..) was intentional to crush performance on Radeons (as seen by Crysis 3 where everything is toned down to acceptable levels with still good quality)
 
Exactly. Water under level has always been part of the CryEngine since day one (AFAIK is can now be disabled in the latest version). All the other crap (concrete slabs, pavements, brick walls..) was intentional to crush performance on Radeons (as seen by Crysis 3èwhere everything is tones sown to acceptable levels with still hood quality)

No it wasnt. Water under the level never rendered in view. That is not how occlusion culling works in the engine. And you can go in the game yourself with some 5870 or whatever and test how little tesselation costs. It is seriously cheap.

Do you have an AMD GPU now? You can totally try it out with a CVAR to see how little it effects overall framerate.
 
I was under the impression the game is DX12 only currently? And wouldnt AMD cards be better at a default compute (non-async)/serial one anyway?`

Sure, but the game has been in development for a while now. You have to wonder how much of it is dx11 code transitioned to dx12.
 
Async compute would have to give AMD a 30-40% advantage in most DX12 games for me to care...

Typical 980 Ti OC is around 20% faster than OC'd Fury at 1440p, so, it may only really put them on par for me. I still play a lot of DX11 games and so performance in those is still important. I'll trade 10% less performance in DX12 if that's what it ends up being for that.
 
No it wasnt. Water under the level never rendered in view. That is not how occlusion culling works in the engine. And you can go in the game yourself with some 5870 or whatever and test how little tesselation costs. It is seriously cheap.

Do you have an AMD GPU now? You can totally try it out with a CVAR to see how little it effects overall framerate.

I've used the CryEngine SDK on several occasions and know how its rendering works. Water under the level has never impacted performance. As a matter of fact the first thing you are presented with when you launch a new project is endless water on which you build your level on top (yeah it sounds ridiculous).

And yes ironically I still have a 5870 on which I ran C2 back then. Simply looking at a tessellated brick wall or concrete slab made the FPS tank into single digits.
 
I was under the impression the game is DX12 only currently? And wouldnt AMD cards be better at a default compute (non-async)/serial one anyway?`

Its Windows 10 only, but it doesn't stop them from targeting DX11 hardware being the bulk of the market of machines upgrading, we kind of forget here not everyone splurges on a new card every 6mnths or buys a console day one.
 
Except it's also completely misleading. All execution on modern vec1 SIMDs is done in a serial fashion, so there is no "8 roads with 8 lanes for trucks which can be used to move freely" but there is 8 roads with 8 lanes which are waiting to be picked from to the execution pipeline. The more you have - the higher efficiency of the execution you may achieve.
That's closer to my understanding of how this stuff works. Basically, you've got turnstile-style access to a fixed pool of resources; the various math units on the GPU, each with its own specialty. So think of it like loading a roller coaster. Every cycle, the system hangs the next rendering job on the GPU, occupying some or all of those specialized units. These jobs are the people who paid for VIP passes. Then the system looks at the math units that haven't been assigned jobs, compares that to the 64 jobs waiting at their respective turnstile — all managed by eight line attendants — and lets in whatever punters best fill the remaining seats before dispatching the train.

It sounds like NV do something similar, but instead of filling empty seats every cycle with jobs from the 31 compute queues, they actually alternate job types, pulling a job from the render queue on even cycles and a job (more?) from the compute queues on the odd cycles. Then they're saying, "Well, at the end of the day, everybody gets to ride." While it's true they're seamlessly pulling jobs from both queue types, because they can't pull from multiple queue types simultaneously, they're not actually doing much to increase utilization. Any math unit not used in a given render operation remains idle; it just gets used on the following cycle. I'm assuming they'd at least be able to pull from all 31 queues on the compute cycle to attempt to fully saturate the math units, but they'd still have a lot of idle units on the render cycle.

Another question is does Maxwell even need async compute to keep its utilization in DX12 at peak?
Is this a trick question because adding async to the mix "just" increases your peak utilization? It will have empty spaces in its rendering pipeline that need filling, just like any other GPU, if that's what you're asking.
 
How likely is it that Pascal will have Asynchronous Compute?

In general, and aside from moving from 28nm to 16nm FF+, having HBM2 and double rate FP16, shouldn't Pascal have more architectural changes over Maxwell than Maxwell had over Kepler?
 
How likely is it that Pascal will have Asynchronous Compute?

In general, and aside from moving from 28nm to 16nm FF+, having HBM2 and double rate FP16, shouldn't Pascal have more architectural changes over Maxwell than Maxwell had over Kepler?

not according to what we know and rumors
 
I don't think so. Fable Legends uses UE4 which seemingly isn't doing anything interesting in this area atm.

It is only interesting as a test of a different dx12 workload.

Actually Unreal Engine 4 supports async compute on XB1, and I think it was lionhead who implemented it in unreal engine. There was also a gameplay video where they talked about it, I believe it was in the context of the pc version.

Yup, here you go.
https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html
 
Actually Unreal Engine 4 supports async compute on XB1, and I think it was lionhead who implemented it in unreal engine. There was also a gameplay video where they talked about it, I believe it was in the context of the pc version.

Yup, here you go.
https://docs.unrealengine.com/lates...ing/ShaderDevelopment/AsyncCompute/index.html

Thanks. However, they only mention Xbox One, and the documentation talks about it being disabled for PC in dx11.1 (possibly means that the documentation hasn't been updated).
 
Thanks. However, they only mention Xbox One, and the documentation talks about it being disabled for PC in dx11.1 (possibly means that the documentation hasn't been updated).

Yeah, it says it's not supported in d3d11.1, because it literally isn't part of d3d 11.
 
Yeah, it says it's not supported in d3d11.1, because it literally isn't part of d3d 11.

Of course. There is no indication that it will be enabled for PC on dx12. Given that this documentation is tagged for UE 4.9 which was just released probably means it isn't available for PC. That is, if we're going by documentation, which we all know isn't always the most reliable thing.
 
Of course. There is no indication that it will be enabled for PC on dx12. Given that this documentation is tagged for UE 4.9 which was just released probably means it isn't available for PC. That is, if we're going by documentation, which we all know isn't always the most reliable thing.
Why would it be available on Bone but not PS4?
 
Why would it be available on Bone but not PS4?

Probably because lionhead probably did all the legwork here while coming up with a solution tailored for the xbox one for their game. It also says that epic intergrated it and will work on making it multiplatform if im interpreting that last paragraph correctly.
 
I may not understand all the computer science involved but it's a good thing I have an intimate understanding of floating fanboy units or I'd be totally lost in here.

Keep being all technical, guys. Learning is fun!

giphy.gif
 
Wait, so Maxwell is fully DX12 compliant but does not have async compute like AMD cards have? Does this mean that PS4 is almost DX13 levels then due to having this feature as well as hUMA and a supercharged PC architecture which DX12 does not have? If so I can easily see PS4 competing with the next gen Xbox which will assumedly be based on DX13 further delaying the need for Sony to launch a successor. Woah. If this is true I can easily see PS4 lasting a full ten years. Highly interesting development, I can't wait to see what Naughty Dog and co do with this new found power.

This could be your best work yet.
 
Thanks. However, they only mention Xbox One, and the documentation talks about it being disabled for PC in dx11.1 (possibly means that the documentation hasn't been updated).

Of course. There is no indication that it will be enabled for PC on dx12. Given that this documentation is tagged for UE 4.9 which was just released probably means it isn't available for PC. That is, if we're going by documentation, which we all know isn't always the most reliable thing.

You'd expect that of they've added the feature set for X1 they'd implement on PC though, wasn't that the point of windows 10 code once and deploy to many devices.
 
How likely is it that Pascal will have Asynchronous Compute?

In general, and aside from moving from 28nm to 16nm FF+, having HBM2 and double rate FP16, shouldn't Pascal have more architectural changes over Maxwell than Maxwell had over Kepler?
Should be 100% chance. It is part of DX12 after all.

The chance of seeing history repeat itself, Maxwell getting Kepler-ed in favor of Pascal, should be fun.
 
Should be 100% chance. It is part of DX12 after all.

The chance of seeing history repeat itself, Maxwell getting Kepler-ed in favor of Pascal, should be fun.

I wouldn't be so sure of it.
Pascal was supposed to be just Maxwell with HBM. Don't forget that for Nvidia Maxwell is full DX12 compliant
 
Also guys correct me if I'm wrong. But aren't those homebrew tests proving that Maxwell does have Async Compusing but limited to 31 queues ? The test goes by submitting 128 threads. GCN takes them all with the same latency but Maxwell can only take 31 (It has 31+1) queues before processing the next load.

The only thing it shows is that it has a lower latency for low queues processing and a equivalent latency compared to AMD.

They do less but more quickly. GCN does more but slower
 
seem unlikely because i think at this point its pretty cut and dry that nvidias hardware is poorly suited for async compute, but itd be interesting to hear an official response from nvidia.
 
Also guys correct me if I'm wrong. But aren't those homebrew tests proving that Maxwell does have Async Compusing but limited to 31 queues ? The test goes by submitting 128 threads. GCN takes them all with the same latency but Maxwell can only take 31 (It has 31+1) queues before processing the next load.

The only thing it shows is that it has a lower latency for low queues processing and a equivalent latency compared to AMD.

They do less but more quickly. GCN does more but slower

To me it looks like the tests aren't showing that at all. Nvidia cards are finishing the graphics + compute test in the same time as both the graphics test and compute test combined, while amd cards are finishing the graphics + compute test in the same time as the compute test alone.
 
I have doubt. Wouldn't this has a significant impact upon the running temperature of the chips?

Yep. Sebs said he expects some games to hit furmark levels of load. Maybe only on the xb1 though, I've heard the ps4 already gets pretty loud.
 
To me it looks like the tests aren't showing that at all. Nvidia cards are finishing the graphics + compute test in the same time as both the graphics test and compute test combined, while amd cards are finishing the graphics + compute test in the same time as the compute test alone.
That would seem to confirm the idea NV are simply alternating between the two job types rather than blending them.
 
That would seem to confirm the idea NV are simply alternating between the two job types rather than blending them.

It reminds me of intel ht working better on i7 better than amd fx cmt because of overall IPC superiority from intel. And as long as Nvidia can tweak the implementation of async. to be decent, around 70% of same in class, it doesn't matter how they get there unless you want more transparency from their marketing guys.

Hopefully everyone gets a win. If things as is stand then it's pretty loose with dx12 compliance.
 
Async compute would have to give AMD a 30-40% advantage in most DX12 games for me to care...

Typical 980 Ti OC is around 20% faster than OC'd Fury at 1440p, so, it may only really put them on par for me. I still play a lot of DX11 games and so performance in those is still important. I'll trade 10% less performance in DX12 if that's what it ends up being for that.

The performance of a video game is defined by more than just frames per second or frametimes. Asynchronous compute allows for higher throughput at lower latencies which easily makes it one of the most important features for VR gaming. Remember the beginning of this gen when Mark Cerny explained again and again the importance of async compute for the future of video games? That was before Morpheus was announced. Two years later it all makes sense.
 
Yeah the times add up on Nvidias graphs, but it's still quicker to compute than AMD for lower batches.

Don't know if it's relevant or not.
 
Glad I didn't fork out a ludicrous £550 for a 980 Ti that will struggle with demanding DX12 games in the next couple years. As if I spent that kind of money, I'd expect to not have to upgrade for 2+ years. But then again, my buying habits are different from most who like to upgrade yearly.

What I've done instead is fork out a large sum for an EVGA 980 that will probably last less time :p

Anyway, these are all good cards for a while yet at least.
 
The performance of a video game is defined by more than just frames per second or frametimes. Asynchronous compute allows for higher throughput at lower latencies which easily makes it one of the most important features for VR gaming. Remember the beginning of this gen when Mark Cerny explained again and again the importance of async compute for the future of video games? That was before Morpheus was announced. Two years later it all makes sense.

What about normal games? Should we expect better performance in them? Will the PS4 perform as well as a higher end PC graphics card?
 
Glad I didn't fork out a ludicrous £550 for a 980 Ti that will struggle with demanding DX12 games in the next couple years. As if I spent that kind of money, I'd expect to not have to upgrade for 2+ years. But then again, my buying habits are different from most who like to upgrade yearly.

What I've done instead is fork out a large sum for an EVGA 980 that will probably last less time :p

Anyway, these are all good cards for a while yet at least.

It depends on the volume of async calls being used, as said earlier in the thread devs aren't going to gimp their games on the larger 80% of hardware, though even if they did you'd probably see comparable performance under dx11 without the effects async would be used for anyway, this is getting silly, its being pushed hard on consoles because the CPUs are weak, we'll see devs like Dice play with async as they are already leveraging it but to the levels that this fuss is over is really doubtful.

What about normal games? Should we expect better performance in them? Will the PS4 perform as well as a higher end PC graphics card?

Normal games? All games are normal unless you're referring to dudebro AAA.

No, they'll leverage for more pretty than performance.

No.
 
What about normal games? Should we expect better performance in them? Will the PS4 perform as well as a higher end PC graphics card?

Already answered on the first page:
Wait, so Maxwell is fully DX12 compliant but does not have async compute like AMD cards have? Does this mean that PS4 is almost DX13 levels then due to having this feature as well as hUMA and a supercharged PC architecture which DX12 does not have? If so I can easily see PS4 competing with the next gen Xbox which will assumedly be based on DX13 further delaying the need for Sony to launch a successor. Woah. If this is true I can easily see PS4 lasting a full ten years. Highly interesting development, I can't wait to see what Naughty Dog and co do with this new found power.
 
So, no one has entertained the very distinct possibility that this one developer was either using a poor implementation of Async Compute or encountered a driver bug? Seems far more likely than Maxwell 2 not supporting the feature it claims to.
Nvidia itself specifically told the developer to turn off the feature for their card. If it were just a bug they could like fix it instead.
 
First of all, there is no need for personal attacks.

A CPU is a processor that consists of a small number of big processing cores. A GPU is a processor that consists of a very huge number of small processing cores. Therefore most home PCs have multiple processors. "APU" is a marketing term for a single processor that consists of different kinds of processing cores. In the case of the PS4, the APU has two Jaguar modules with four x86 cores each and 18 GCN compute units with 64 shader cores each. You distinguish processors as the following: Single core (like Intel Pentium), multi core (Intel Core i7 or any GPU), hetero core (APUs like the one in PS4) and cloud core (Microsoft Azure for example).

If you take a look back, the evolution of computer technology was always about maximum integration. The reason for that you is want to minimize latency as much as possible. A couple of years ago, GPUs only had fixed-function hardware. That means that every core of the GPU was specialized for a certain task. That changed with the so called unified shader model. Today, the shader cores of a modern GPU are freely programmable. Just think of them as extremely stupid CPU cores. The advantage of a freely programmable GPU, however, is that you have thousands of those cores. The PS4 has 1156 shader cores. That makes a GPU perfectly suited for tasks that benefit from mass parallelization like graphics rendering. You can also utilize them for general purpose computations (GPGPU) which, in theory, opens up a whole new world of possibilities since the brute force of a GPU is much higher than the computational power of a traditional CPU. In practice, however, the possibilities of GPGPU are limited by latency.

If you want to do GPGPU on a traditional gaming PC, you have to copy your data from your RAM pool over the PCIe to your VRAM pool. The process of copying costs latency. A roundtrip from CPU -> GPU -> CPU usually takes so long that the performance gain from utilizing the thousands of shader cores gets immediately eaten up by the additional latency: Even if the GPU is much faster at solving the task than the CPU, the process of copying the data back and forth will make the GPGPU approach slower than letting the CPU do it on its own. That's the reason why GPGPU today is only used for things that don't need to be send back to the CPU. The possibilities on a traditional PC are very limited.

The next step in integration is the so-called hetero core processor. You integrate the CPU cores as well as the GPU shader cores on a single processor die and give them one unified RAM pool to work with. That will allow you to get rid of that nasty copy overhead. Till this day, the PS4 has the most powerful hetero core processor (2 TFLOPS @ 176GB/s) available. Not only that, since the APU in PS4 was built for async compute (see Cerny interviews), it can do GPGPU without negatively affecting graphics rendering performance. It's a pretty awesome system architecture, if you want my opinion.

The only problem is, that PC gamers don't have a unified system architecture. The developers of multiplatform engines have to consider that fact. 1st-party console devs can fully utilize the architecture, though.

Good post, this is pretty much what it is, and the benefits derived from the PS4 architecture. People must know that async compute is being used on PS4, yes, but still marginally. That sort of stuff takes time.

Although ISS used it, I'm going to say that Uncharted 4 will be the first game to use the feature (at least on the level that Cerny alluded to) and even moreso by first party Sony games going forward, from March 2016 when that game launches and beyond. I believe that's when we will witness some consistent results in games relative to GPGPU computing.

Also, good contributions by ServerSurfer and Arkanius..........
 
original.jpg


I found this in the comments thread of WCCFTech. It looks to be about the most comprehensive listing of DX12 feature support.
 
Top Bottom