Actually there is a hardware decompression block, I remembered reading something like that. So there is a component dedicated to uncompressing the data.
The hardware decompression block plays a vital role, allowing games to consume less space via compression on the SSD. That hardware is devoted to tackling run-time decompression, keeping games running smoothly without giving more work to the CPU. It uses Zlib, a general-purpose data-compression library, and a mysterious new system named "BCPack," geared to GPU textures.
That question is irrelevant. It's like asking, where does the data in RAM go when you transfer it to the GPU? Yeah. Where does it go? That answer is the same in comparison to the case of the XSX transferring from SSD, because this is from the perspective that the system sees the mapped 100GB on the SSD as RAM, like has been postulated a gazillion times.
To me, it doesn't sound like you're disagreeing with her at all.
Actually there is a hardware decompression block, I remembered reading something like that. So there is a component dedicated to uncompressing the data.
What is Xbox Series X Velocity Architecture?
Xbox Velocity Architecture promises no-compromise gameplay and cuts load times, drawing the best from Xbox Series X.www.windowscentral.com
all comes down to what priorities they focused on in their design of the APUs
7:00 A 18GB/s or something SSD for PC, theres also 2 fast SSDs, probably won't be cheap
Right. I'm just wondering if the GPU accesses the decompression block along with the CPU (which would probably mean they share the access path, similar to how they do on main system RAM), or if it's just the CPU that accesses it. Things like that, haven't been disclosed yet so we kinda don't have a way of knowing yet.
P Panajev2001a Yeah more or less; however I think what T Trueblakjedi was referring to wasn't the GPU accessing more than 10 GB of RAM at a time, but being able to access the 4 1 GB modules it exclusively has access to while the CPU accesses its 6 GB on the 2 GB modules (and while the CPU accesses along that, the GPU cannot access the other 6 GB of RAM).
I'm curious if that's a possible feature they've added in terms of bus access; 4 GB isn't a lot of RAM, but if even that can still be accessed while the other 6 GB is being accessed by CPU and other components like audio, that would help with some of the bus contention issues APUs inherently bring.
CPUs don't read anything directly, they read from RAM or an L1 -L3 cache. Cache on a CPU is not directly manageable. On a GPU it is, we'll get back to this.
The rest of what you said is nonsense.
She's talking about the following.
Flushing cache. on SSD read. This really doesn't mean anything / is nonsense. Flushing a cache is Verboten on an actual working system unless you are in serious trouble.
She says basically cache misses are somehow ok because I know I will miss and read from RAM. Again, doesn't really make sense when the cache isn't the issue, accessing the RAM is..
Downplaying granular cache flushes as somehow not needed? Granular cache kills improve cache-hit ratio, improve performance. This person doesn't understand caching at all.
You never want to flush active cache,, unless everything has become invalid. Then you get a stampeding herd problem straight after.
The caches used by GPUs will not be loaded from SSD directly, they are usually done AFTER the data is in RAM and you've actually created a vertex buffer, or a shader and used the texture.
Some of these caches are measured in KB. What does an SSD give you here?
Typically, when removing items from a cache, it's hard to know what to target as you would need some sort of record and that record itself becomes a bottleneck. You can use some heuristics to target some regions of cache, and kill all entries in there, but you will be removing hot items from that cache, which isn't ideal, and will cause excess misses. A dedicated system for this which does it asynchronously, without a dev managing it is gold.
Either indirectly requesting an asset be deleted directly without knowing the lookup in combination with an LRU when under memory pressure. If I can use indirection to request an asset be deleted I don’t need to keep memory addresses around. Deleting an entry directly means I can keep good cache around so I don’t get a bunch of misses. I imagine something similar to a consistent hash method from metadata about the asset or a hw lookup table similar to how hw PRT worksThank you for your analysis. Can you describe the utility of the GPU cache scrubbers in this scenario?
Either indirectly requesting an asset be deleted directly without knowing the lookup in combination with an LRU when under memory pressure. If I can use indirection to request an asset be deleted I don’t need to keep memory addresses around. Deleting an entry directly means I can keep good cache around so I don’t get a bunch of misses. I imagine something similar to a consistent hash method from metadata about the asset or a hw lookup table similar to how hw PRT works
The gpu asks for the data or asks to remove it but the cache scrubbers know what to do and work in the background and other processes may have asked for a cachekill either. There will just suddenly be a cache miss, then a fallback to request the original asset which may or may not be in RAM either. This then overwrites the previous location. An LRU usually works to remove the oldest and least used entries to make room for new entries.So if the GPU knows that the data resident in the cache is useful anymore than it can successfully flush with no penalty or instead of flusing simply over write?
This time the die size difference is all about the CUs. XSX has 45% more CUs than the PS5, an even bigger factor than PS4 over XBO. We would be having a completely different conversation this coming gen had the clock frequencies been similar, but Sony increasing the clock was the only way they could compete with the power difference and it's one of the things you can adjust last minute, the other being ram, in fact it was expected for Sony to do this as an answer, I can confidently bet that it was never Sony's intention to go with clocks as high as 2.23Ghz from the beginning.
I do honestly think Sony were looking for at least a 2 GHz clock on the GPU, since they decided from the get-go for a 36 CU GPU and that meant they could only get desired performance with high clocks, with a big focus on the cooling system. Maybe they even planned for variable frequency much earlier on (they could not test that on Ariel though since it was an RDNA1 chip, and at least two of the early Oberon revisions were using a fixed frequency setup, Cerny seemed to have suggested this himself).
Oh I agree, they locked in at 36 CU way before with the intent of pushing frequencies to a considerable 2Ghz, the 9.2 Teraflop figure was no mistake, but I do believe that they pushed harder once MS revealed their 12 TF figure, they HAD to.
So, Sony pushed a GPU with variable 2.23 Ghz in 3 months just like that since MS confirmed 12 TF in December last year. Yeah, no! It doesn't make any sense what you said at all. Btw. they've pantented their cooling solution back in August last year. Doesn't mean if Github didn't have any data for 2.23 GHz, Sony pushed GPU to 2.23 GHz after that. It doesn't make any sense. AT ALL!
I'm not buying the "panicked Sony" theory either.
If the PS5 already had a cooling system that could handle those clocks then I don't see why they would go with lower clocks in the first place.
Every patent needs to be tested. It also needs a certifications. And that takes time.
I didn't mention the CPU at any point here, so, I'm not sure why you're bringing it up. XVA seems to be targeted to feed specifically the GPU, not the CPU. And it will primarily be used for high quality mip pages.CPUs don't read anything directly, they read from RAM or an L1 -L3 cache. Cache on a CPU is not directly manageable. On a GPU it is, we'll get back to this.
The rest of what you said is nonsense.
I'm not sure you've read everything she says. I still get the sense that you two are saying the same thing. Let me quote the whole thing, just to make sure everyone is on the same page...Flushing cache. on SSD read. This really doesn't mean anything / is nonsense. Flushing a cache is Verboten on an actual working system unless you are in serious trouble.
She says basically cache misses are somehow ok because I know I will miss and read from RAM. Again, doesn't really make sense when the cache isn't the issue, accessing the RAM is..
Downplaying granular cache flushes as somehow not needed? Granular cache kills improve cache-hit ratio, improve performance. This person doesn't understand caching at all.
You never want to flush active cache,, unless everything has become invalid. Then you get a stampeding herd problem straight after.
Yes it is usually done after the data is in RAM. The SSD would give you reduction in RAM requirements. Why do you specifically look at the smallest caches, rather than the largest?The caches used by GPUs will not be loaded from SSD directly, they are usually done AFTER the data is in RAM and you've actually created a vertex buffer, or a shader and used the texture.
Some of these caches are measured in KB. What does an SSD give you here?
That's exactly what the sampler feedback is for.Typically, when removing items from a cache, it's hard to know what to target as you would need some sort of record and that record itself becomes a bottleneck. You can use some heuristics to target some regions of cache, and kill all entries in there, but you will be removing hot items from that cache, which isn't ideal, and will cause excess misses. A dedicated system for this which does it asynchronously, without a dev managing it is gold.
Pretty much. If they were to increase the clocks they would have to test the system to make sure it doesn't suffer from overheating issues. And if it does they either have to lower the clocks or change the systems design.
The only way that I can see them boosting the clocks at the last minute is if they overbuilt their cooling system. This is similar to what Microsoft did with the X1. However there were rumors of the systems overheating so that rumor contradicts this theory.
I think that rumor was being pushed by Jez Corden. Later he backpedalled. Why? Read the thread in spoiler ( it isn't large, few pages ). It's hilarious. You'll get it.
Rumor: The PS5 final console design has not been shown because Sony is trying to solve heating issues with the console
https://www.tweaktown.com/news/71606/playstation-5s-rumored-heat-issues-should-be-solved-in-final-console/index.html New reports suggest Sony is currently wrangling PS5 overheating issues. Supposed unnamed dev sources tell reporters like Windows Central's Dan Rubino and Jez Corden that the...www.resetera.com
All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.I didn't mention the CPU at any point here, so, I'm not sure why you're bringing it up. XVA seems to be targeted to feed specifically the GPU, not the CPU. And it will primarily be used for high quality mip pages.
I'm not sure you've read everything she says. I still get the sense that you two are saying the same thing. Let me quote the whole thing, just to make sure everyone is on the same page...
- Regarding XVA vs PS5 I/O engine Each are designed in different ways, leading to systems that are better/worse at certain things PS5 is better for no processing, direct to RAM I/O and dumping raw data into RAM XVA is better when you need to process the data in question
Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.And remember the PS5's cache scrubbers? These are a thing due to a weakness of the PS5 I/O engine that isn't an issue in the first place in XVA
When you overwrite data in RAM, it's possible that data is mirrored in the GPU cache. But it's also possible something else is in the cache and is not the data being overwritten.
The obvious solution would be flushing GPU caches when the SSD is read, that way no matter what the GPU doesn't get a cache miss (it knows the cache is clear and to ignore cache and look in RAM)
Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.But (as Cerny says) this will really hurt GPU performance. Solution? Cache scubbers/Coherency engine
CE tells the cache scrubbers what part of RAM was overwritten, and they check the caches for the mirroring of said data, wiping it if found
No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?XVA inherently avoids this issue by feeding CPU/GPU directly with SSD data. Since the data is either discarded by the CPU/GPU or written back to RAM, the CPU/GPU always know what data need be wiped from cache No need for scrubbers
This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.Yes it is usually done after the data is in RAM. The SSD would give you reduction in RAM requirements. Why do you specifically look at the smallest caches, rather than the largest?
That's exactly what the sampler feedback is for.
From "Ronaldo8" on beyond3d.All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.
Any single decompression job you can think of, unless specially written for multicore decompression with synchronization between threads (this is rare) , is a single threaded operation. Kraken is one of these rare examples in this regard as you can use up to 2 cores in special situations. Otherwise you have the same level playing field. Zen2 processors doing "stuff".
Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.
No, you do not flush cache,. Ignoring cache and hitting RAM is a negative, not a positive.. I think they think cache is fed from and SSD. If you miss cache, you hit RAM, or you read the SSD into RAM. then into your code, which then utilizes cache. You try to avoid this as much as possible. You do not flush cache on some arbitrary action like "reading an SSD". Remember they say "you flush cache and you knows it;s clear so you don't get a miss". That's actually a miss. The next few operations will have no cache and all stampede the RAM / SSD at once. (thundering herd problem, or distributed locking! yay)
Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.
No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?
No this is what RAM is for. CPU / GPU executes instructions with data loaded onto a stack, then moves on to the next instruction. You are not writing to this.
This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.
No, sampler feedback is to tell the engine, or API in this case, what are the next textures / MIP levels to fetch, based on current RAM residency and the need of the engine, and maybe delivering less than the engine request based on previous request and use of said textures. Absolutely nothing to do with cache.. It's kind of a weird understanding you have, when you have 96kb of cache in these things for L1. what do you imagine an SSD is doing. Is it just imagination?
I have no dog in this fight, both consoles are awesome, The XSX will kill the PS5 in framerates, and most of the benefits will be seen in first party's for either console.
I'm interested in tech, I work as a principal architect in a company building massively distributed systems for the last decade, and I have built games as a hobby for the 360 and PC. I don't pretend to know everything, but the stuff we are talking about, really has nothing to do with games, but typical architectures of any scalable system. I really feel people are looking for some magic bullet here to push their preferred massive corporations new product because they have decided to buy it. That twitter account is prob a gfx enthusiast college student who wrote some VB and has a github account with 0 code in it ( as far I saw, that was the case ). I commend their enthusiasm, I do not appreciate as matter of fact nonsense though.
If you want to go into any of these concepts, we don't even need to use video games as a jump off point.
I posted the tweets in sequential order so you could get the whole picture, not pick them apart individually. The part about not flushing cache because it tanks performance is true, and she basically says that too at #5.All data needs to be processed, You have HW for certain processes, software for others. Decompression on HW is something you want otherwise you have software doing it.
Any single decompression job you can think of, unless specially written for multicore decompression with synchronization between threads (this is rare) , is a single threaded operation. Kraken is one of these rare examples in this regard as you can use up to 2 cores in special situations. Otherwise you have the same level playing field. Zen2 processors doing "stuff".
Cache invalidation is an issue in every system you can think of. These scrubbers are a pretty good solution. There are plenty of ways to handle cache invalidation though, but it still needs to be done. It's not an issue for XVA why? Well you ask about small caches, maybe you think we should store stack data on the SSD.
No, you do not flush cache,. Ignoring cache and hitting RAM is a negative, not a positive.. I think they think cache is fed from and SSD. If you miss cache, you hit RAM, or you read the SSD into RAM. then into your code, which then utilizes cache. You try to avoid this as much as possible. You do not flush cache on some arbitrary action like "reading an SSD". Remember they say "you flush cache and you knows it;s clear so you don't get a miss". That's actually a miss. The next few operations will have no cache and all stampede the RAM / SSD at once. (thundering herd problem, or distributed locking! yay)
Coherency is about managing cache in multiple locations, CE wouldn't tell the scrubber anything, if anything, it would ensure if L1 wrote to cache, L2 replicated it for subsequent reads, otherwise you get stale reads. The scrubber deletes cache. The scrubbers might use the CE to find locations of the cache, in case it does sit in multiple locations. Distributed cache is hard across multiple levels.
You're still thinking from the perspective of the traditional setup. What happens if the system sees 100GB of the SSD as RAM, having a total ram pool of 116GB?No, the CPU is not reading registers from the SSD, does it write them back too? Does the SSD store stack and heap? How large is this cache table?
No this is what RAM is for. CPU / GPU executes instructions with data loaded onto a stack, then moves on to the next instruction. You are not writing to this.
Yes, but if you're going to feed the GPU from SSD, you're not going to do it directly to the smallest cache, obviously. You don't do that from RAM either.This is how every CPU, CD-ROM or any device that accesses memory works, in the world. The smaller cache is orders of magnitudes faster. A typical single operation uses a few bytes of memory.
Again, you're still seeing the SSD as storage rather than as extended RAM. Additionally, RAM is technically just a higher level of cache, being larger and slower.No, sampler feedback is to tell the engine, or API in this case, what are the next textures / MIP levels to fetch, based on current RAM residency and the need of the engine, and maybe delivering less than the engine request based on previous request and use of said textures. Absolutely nothing to do with cache.. It's kind of a weird understanding you have, when you have 96kb of cache in these things for L1. what do you imagine an SSD is doing. Is it just imagination?
I'm interested in tech as well. The PS5 SSD has been talked about so much that we mostly understand it. XVA is another story. I'm focusing on trying to understand it. Some people confuse that with hyping up the Xbox.I have no dog in this fight, both consoles are awesome, The XSX will kill the PS5 in framerates, and most of the benefits will be seen in first party's for either console.
I'm interested in tech, I work as a principal architect in a company building massively distributed systems for the last decade, and I have built games as a hobby for the 360 and PC. I don't pretend to know everything, but the stuff we are talking about, really has nothing to do with games, but typical architectures of any scalable system. I really feel people are looking for some magic bullet here to push their preferred massive corporations new product because they have decided to buy it. That twitter account is prob a gfx enthusiast college student who wrote some VB and has a github account with 0 code in it ( as far I saw, that was the case ). I commend their enthusiasm, I do not appreciate as matter of fact nonsense though.
If you want to go into any of these concepts, we don't even need to use video games as a jump off point.
That's virtual memory, you can use mmio to reference a page of the 100GB directly, or it will swap out the page when you access outside of the resident memory. This isn't new, any database you can think of does the same, or you can use memory mapped files for massive files.I posted the tweets in sequential order so you could get the whole picture, not pick them apart individually. The part about not flushing cache because it tanks performance is true, and she basically says that too at #5.
As for the coherency engine not telling the scrubber anything, how would the scrubber know what to delete if the coherency engine is not managing it? Even if the coherency engine is about managing cache in multiple locations, it can't properly manage it if the scrubber can delete what it wants. Whether the scrubber uses the CE or the CE commands the scrubber, the end result is the same. But they have to work together somehow. I think her explanation of the CE and scrubbing is more poor wording than a lack of understanding.
You're still thinking from the perspective of the traditional setup. What happens if the system sees 100GB of the SSD as RAM, having a total ram pool of 116GB?
You read through the cache, miss, hit the RAM then hit the cache with whatever the instruction / data was. You rarely prewarm caches in such a volatile environment.Yes, but if you're going to feed the GPU from SSD, you're not going to do it to the smallest cache, obviously.
You are right, in that it's a higher level, but I understand virtual memory very well. That's all this is. PS5 supports mapping the entire drive so....Again, you're still seeing the SSD as storage rather than as extended RAM. Additionally, RAM is technically just a higher level of cache, being larger and slower.
Ok... And... You can't think of that having any benefit on the XSX, particularly due to the SSD being used as virtual memory rather than an HDD?That's virtual memory, you can use mmio to reference a page of the 100GB directly, or it will swap out the page when you access outside of the resident memory. This isn't new, any database you can think of does the same.
Agreed. But then again, the XSX is a bit complicated here, because its memory setup itself is also a bit peculiar, not counting any SSD virtual memory.You read through the cache, miss, hit the RAM then hit the cache with whatever the instruction / data was. You rarely prewarm caches in such a volatile environment.
Well yeah. But that's kind of overkill, don't you think? No game is going to use the entire drive. At least I hope not lol.You are right, in that it's a higher level, but I understand virtual memory very well. That's all this is. PS5 supports mapping the entire drive so....
Ok... And... You can't think of that having any benefit on the XSX, particularly due to the SSD being used as virtual memory rather than an HDD?
Agreed. But then again, the XSX is a bit complicated here, because its memory setup itself is also a bit peculiar, not counting any SSD virtual memory.
Well yeah. But that's kind of overkill, don't you think? No game is going to use the entire drive. At least I hope not lol.
Since you understand virtual memory, let's imagine this scenario...
Your GPU needs a certain high level mip. It reads through the different levels of cache, misses, 'arrives' at RAM, and in the 16GB RAM pool there is physically only the low level mip. However, the SSD is virtual RAM, and obviously the high level MIP is there, which means the GPU thinks it is available in RAM. How does the transfer of that high level mip take place?
Let me ask the same question another way. If the system cannot differentiate between the virtual memory and the actual RAM, would the required high level mip in the scenario above be transferred as;This is paging. This is something your phone does. The XSX is not doing anything any other machine cannot do. The PS5 supports this. This is typically slow on most machines. Do you see why these new machines benefit from their SSDs, and in particular... latency? Especially pulling in a small file!
a) Swapped into RAM (which in both of these consoles cases, is the GPU RAM, although with XSX you have two options, slow and fast).Let me ask the same question another way. If the system cannot differentiate between the virtual memory and the actual RAM, would the required high level mip in the scenario above be transferred as;
a) SSD -> RAM -> GPU
or
b) SSD -> GPU
Well, if the XSX works like that too, we're back at square 1 on trying to figure out what the velocity architecture actually is.a) Swapped into RAM (which in both of these consoles cases, is the GPU RAM, although with XSX you have two options, slow and fast).
Both will use different paths to get there (one checked in view pure HW path, one Staged via HW then CPU).
For the caller, they do not know if the data is resident, the system does though via a lookup table from Virtual Memory into Physical, (whereas a miss may mean swapping in from disk).
This sums everything up perfectly, thank you for that. Basically the difference isn't as big as the SSD speeds would have you think, and we can look forward to amazing games utilising these fast solution. June/July game events can't come soon enough.From "Ronaldo8" on beyond3d.
Begin quote
There seems to be a lot of misconceptions about the xbox velocity architecture. The goal of the PS5's and the Series X's I/O implementation is to increase the complexity of the content presented on screen without a corresponding increase in load times/memory footprint but go about it in totally different ways. Since the end of the cartridge era, an increase in geometry/texture complexity was usually accompanied by an increase in load times. This was because while RAM bandwidth might be adequate, the thoughput of the link feeding the RAM from the HDD was not. Hence, HDDs and the associated I/O architecture was the bottleneck.
One way to address this issue was to "cache" as much as possible in the RAM so as to get around the aforementioned bottleneck. However, this solution comes with its own problem in that the memory footprint just kept ballooning ("MOAR RAM"). This is brilliantly explained by Mark Cerny in his GDC presentation with the 30s of gameplay paradigm. Playstations answer to this problem is to increase the throughput to the RAM in an unprecedented way. Thus, instead of caching for the next 30s of gameplay, you might only need to cache for only the next 1s of gameplay which results in a drastic reduction in memory print. Indeed, the point of it all is that for a system with the old HDD architecture to have the same jump in texture and geometry complexity, either the amount of RAM needed for caching will have to be exorbitant or frametime will have to be increased to allow enough time for the texture to stream in (low framerates) or gameplay design will have to be changed to allow for texture loading (long load times). The PS5 supposedly will achieve all of this with none of those drawbacks thanks to alleviating the bottleneck between persistent memory and RAM (the bottleneck still exists because RAM is still quicker than the SSD but it is good enough for the PS5 rendering capacity and hence doesn't matter anyway. You just don't load textures from SSD to the screen.)
We can now see why the throughput from the SSD to RAM has now become the one-and-only metric for judging the I/O capability of next-gen systems in the mind of gamers. After all, it does make perfect sense. BUT...is there an alternative way of doing things? Microsoft's went in a completely different direction. Is the Persistent memory to RAM throughput still the bottleneck? Yes! Why is more throughput needed? To stream more textures evidently. The defining question is then how much of it is actually needed? After careful research by assessing how games actually utilised textures on a per frame basis, MS seems to have come to a surprising answer: not that much actually.
Indeed, by loading higher detailed MIPs than necessary and keeping the Main memory - RAM throughput constant, load times/memory footprint is increased. Lets quote Andrew Goosen in the Eurogamer deep-dive for reference:
"We observed that typically, only a small percentage of memory loaded by games was ever accessed," reveals Goossen. "This wastage comes principally from the textures. Textures are universally the biggest consumers of memory for games. However, only a fraction of the memory for each texture is typically accessed by the GPU during the scene. For example, the largest mip of a 4K texture is eight megabytes and often more, but typically only a small portion of that mip is visible in the scene and so only that small portion really needs to be read by the GPU."
The upshot of it all is that by knowing what MIP levels are actually needed on a per-frame basis and loading only that, the amount needed to be streamed is radically reduced and so is the throughput requirement of the SSD-RAM link as well as the RAM footprint. Can this Just-in time streaming solution be implement ed via software? MS indeed acknowledges that it is possible to do so but concedes that it is very inaccurate and requires changes to shader/application code. The hardware implementation of determining residency maps associated with partially resident textures is sampler feedback.
While sampler feedback is great, it is not sampler feedback streaming. You now need hardware implementation for :
(1) transition from a lower MIP-level to a higher one seamlessly
(2) fallback to a lower MIP-level if the requested one is not yet resident in memory and to blend back to the higher one when it comes available after a few frames.
Microsoft claims to have devised a hardware implementation for doing just that. This is the so-called "texture filters" described by James Stanard. Do we have more information about Microsoft's implementation? Of course we do. SFS is patented hardware technologgy and is described in patent US10388058B2 titled
"Texture residency hardware enhancements for graphics processors" with co-inventors being Mark S Grossman and....Andrew Goosen.
Combined with Directstorage (presuambly a new API that revamps the file system but information about it is sparse) and the constant high throughput of the SSD, this is how Microsoft claims to achieve 2x-3x increase in efficiency. Hence, the "brute force" meme about the series X is wildly off-base.
As for which of the PS5 or Series X I/O system is better? I say let the DF face-offs begin.
End quote
New quote(starts with quoting ShiftyGeezer)
I will quote your own thoughts on the matter as response (from the UE5 thread):
"The moment the data is arranged this way, we can see how virtualised textures would also apply conceptually to the geometry in a 2D array, along with how compression can change from having to crunch 3D data. You don't need to load the whole texture to show the model, but only the pieces of it that are viewable, which is the same problem as picking which texture tiles with virtual texturing.
Very clever stuff."
Ronaldo8:
The Unreal Engine team has a devised a software solution for a problem that Microsoft has resolved in hardware.
But sampler feedback in truth answers two questions:
(1) What MIP level was utimately sampled (LOD problem). What MIP level to load next
(2) Where exactly in the resource was it sampled (which tiles was sampled). This is based on what's visible to the camera. Basically what MIP to load next.
SFS is the streaming of only visible assets at the correct level of details. So yeah, software implementation of a solution already found in hardware.
End new quote
Read this article:
Coming to DirectX 12— Sampler Feedback: some useful once-hidden data, unlocked - DirectX Developer Blog
Why Feedback: A Streaming Scenario Suppose you are shading a complicated 3D scene. The camera moves swiftly throughout the scene, causing some objects to be moved into different levels of detail. Since you need to aggressively optimize for memory, you bind resources to cope with the demand for...devblogs.microsoft.com
And one more quote from Scott_Arm
Start quote
Microsoft’s solution is virtual texturing with sampler feedback for accurate mip and tile selection, plus some hardware filters to blend from a low resolution mip and a high resolution mip in case the high resolution mip is not loaded in time for the current frame. So they have some guarantee of the low quality mip arriving on time and then blend to the high quality of it’s late so you don’t notice pop in. It should be overall more efficient in making sure they don’t waste memory on pages they don’t need
End quote
This sums everything up perfectly, thank you for that. Basically the difference isn't as big as the SSD speeds would have you think
He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.The Ronaldo user from B3D is speculating and comparing apples to oranges: the scenario he paints puts Sony as brute forcing it by pushing the SSD I/O throughput through the roof while only XVA Would be doing hardware assisted virtual texturing / texture streaming to make the gap appear much much smaller if not absent.
To clarify, the XSX memory is reported to be 6 2GB modules consisting of an upper and lower memory address (=12 GB) and 4x1 GB memory modules making 10 memory segments.
These 10 modules are accessible by 2 bidirectional 16-bit lanes per module (32 bit X 10 lanes = 320 bit bus). Here is where info gets a little murky about the architecture:
Those 16 bit lanes should be able to access all memory addresses to which they are attached.
But the way the architects describe it, the CPU is assigned access to 6 out of the 10 lanes. The CPU is reserved access to the upper memory address range of the 2GB modules only.
The lower 1GB memory address on those 2GB chips are reserved for GPU work (6GB).
The 4 1GB modules are reserved for the GPU at all times... So the lower 6 GB of the 2GB modules plus the 4X1 GB = 10GB of VRAM. The 2GB might be subject to contention depending upon whether accessing the upper 1GB of the 2Gb module uses both lanes at full speed (both 16 bit lanes @56GB/s) or half speed (1 16 bit lane at 28GB/s).
This setup is quite confusing because it seems that if the GPU were to use the full band access to all 10 modules, the entire system bandwidth is used and the CPU doesn't have access to the other 6 GB.
I can't find a permutation of accesses by CPU and GPU where they both can use their full bandwidth simultaneously. The most I could come up with is if the GPU took the full bandwidth of the 4x1gb chips (4X56 = 224 GB/s) plus using half the bandwidth to access the lower memory address of the 2GB chips (28*6 = 168GB/S) for 392 GB/s max bandwidth without contention. The CPU would be the consumer of the remainder bandwidth (168GB/s).
Sorry for the long post.
This isn’t anything unique, I can guarantee the PS5 will do the same thing. It’s not some magic, it just works better with fast storage.The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.
He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.
He clearly explains the two different approaches, and you might read it as brute force approach, but that's not how I read it. Just two different approaches to solve the same problem. The fact that Microsoft says that they can instantly access 100GB should tell you enough about the magnitude of the implementation.
Yeah I get that. It's just that Microsoft is basically saying we believe you don't need such a fast I/O when applying our XVA solution. But the games will tell.I think he is comparing apples to oranges, Sony’s approach seems roughly similar to the approach MS took with XVA with likely more logic spent to accelerate SSD I/O and higher throughout (GB/s).
XVA is a marketing term for the solution Sony have no name to, maybe they should have called it Lighting Data Transfer Architecture or Infinite RAM Architecture .
The block Cerny described in his presentation seems not to be trivial designed, but something they spent quite a bit of transistors and R&D time on, resources they did not add to the GPU hence the TFLOPS gap... they are probably happy as they were able to drive a bigger gap in I/O than the one they lost to XSX in the TFLOPS war).
Yeah I get that. It's just that Microsoft is basically saying we believe you don't need such a fast I/O when applying our XVA solution. But the games will tell.
Exactly.
I think I should visit Beyond3D more, seems to be far more information there without the obvious agenda.
The evidence is the interviews with Microsoft engineers saying that 100 GB is instantly accessible. Unless you are saying their engineers are lying? Obviously I can't prove what they are saying because I don't have the hardware so I can't run tests myself. So if they say that this is possible, I just believe Microsoft, since we can't do our own tests at the moment.This is now what you said earlier and what that Ronaldo poster from B3D implied: that the I/O gap was actually really narrow because of XVA or something. I do not think it is the case m, but if you have evidence of the contrary I would be happy to discuss it over here please .
They reserve 100gb for paging. That’s it. Nothing stopping Sony doing the same.The evidence is the interviews with Microsoft engineers saying that 100 GB is instantly accessible. Unless you are saying their engineers are lying? Obviously I can't prove what they are saying because I don't have the hardware so I can't run tests myself. So if they say that this is possible, I just believe Microsoft, since we can't do our own tests at the moment.
It's delivering to VRAM directly, this doesn't usually happen on your PC. It's directly mappable to the VRAM and the SSD can be used as a fast swap space.Why SSD suppling data to the GPU directly is a good thing ? SSD still nowhere the performance of GDDR6. isent SSD should supply data to the VRAM then gpu ?
XSX has 45% more CUs than the PS5, an even bigger factor than PS4 over XBO.
It's delivering to VRAM directly, this doesn't usually happen on your PC. It's directly mappable to the VRAM and the SSD can be used as a fast swap space.
You typically reserve space on the drive for this, or a dedicated partition. If you did this to your whole drive, well then you would probably not have many games thereAnd on XSX you are limited to 100gb to do this but ps5 you should have the entire SSD to do it instantly , right ?
Perhaps you could show how you arrive at this conclusion....
Bumping the clocks up to 2.1 GHz on the RX 5700–an 18 percent boost over stock, generally yields just 5-10 percent higher performance. This is due to various reasons including a sensitive boosting behavior