DirectStorage speculations & myths


So, it’s been 2 months since the release of DirectStorage Github repos and no new major updates.

I deferred this post for a little while waiting for more updates, but I guess I’ll just have to post it with what has been made available so far.

Myths? At least so far

Although I often enjoy Linus Tech Tips channel, specially Anthony’s videos, their video on Direct Storage is blatantly wrong.

However I believe it represents the idea that I and probably everyone had in mind before its announcement.

I want to point out two things that are completely wrong in that video:

  • Contrary to its initial announcement, and from what has been revealed so far, DirectStorage does not perform a direct interface from SSD to the GPU. Perhaps this will change in the future but I doubt it.
  • Anthony makes a flawed performance comparison. He compares a raw disk load vs decompressing on the fly (on the CPU) and sees the raw disk is 3x faster; and incorrectly concludes DirectStorage will offer a 3x performance improvement. That’s comparing apples to machine guns. And it wasn’t even comparing DS vs traditional APIs. It was DirectStorage vs DirectStorage.

Existing IO problems that DirectStorage wants to address

There are various problems related to IO.

Security and access synchronization.

This applies to PC but can be ignored on consoles. It is the reason I seriously doubt DS will ever give the SSD -> GPU direct path we were promised. Ensuring the following is a nightmare:

  • The GPU respects file access permissions
  • The GPU doesn’t accidentally perform out of bound reads (i.e. read the next file on disk it’s not supposed to; or read filesystem metadata of random files like timestamps or filenames)
  • The GPU somehow becomes able to write to the disk
  • The CPU doesn’t write to the filesystem sections the GPU is currently reading
  • Motherboard bus errors (contrary to popular belief they’re RAMPANT. We just don’t hit them because our SW is designed to avoid them. But if we give widespread access, malware will find a way to trigger it)

If we could ignore this problem, then direct SSD -> GPU transfers could happen. This is possible in locked down platforms like consoles or iOS; but security is a concern in more open platforms like PC.

Spectre

Spectre has been a thorn for everyone. New mitigations keep getting released and everyone is sick of it, to the point people start speculating it’s a conspiracy to sell new CPUs (unfortunately, it’s not)

In terms of IO Spectre affects performance because every time we need to read (or write) from disk we need a user -> kernel transition and then back kernel -> user to return to the program with the requested data.

mmap helps because it avoids this problem by mapping disk pages to virtual address. Hence one can read from disk as if we were reading directly from RAM.

But mmap has a problem: accessing unloaded pages results in a page fault. This is slow; specially if we’re reading the whole disk sequentially. It’s not a magic bullet.

Accessing lots of small files

The dreaded “4kb random reads” from benchmark utilities.

Accessing lots of files concurrently invokes two problems:

  • Spectre again. If we need to access 100 files, we need at least 100 transitions in, 100 transitions out.
  • SSD HW scheduling. Some of these files may actually be together in the same NAND module(s). This means that if an algorithm is clever enough to notice and knows beforehand that all of these 100 files will be accessed; the SSD can load the whole NAND chip once, get the data it needs; instead of re-loading the NAND cells multiple times.
    • Such algorithm needs to sort this out faster than it takes to reload the NAND cells multiple times
    • Such algorithm shouldn’t consume lots of CPU resources
    • It needs to know these accesses beforehand (or intentionally add latency waiting for multiple requests to stack up)

Lots and LOTS of caches

Traditionally, IO problems were solved with caches. If it wasn’t working fast enough, the solution would be to add more caches:

  1. The disk has an internal cache to avoid accessing the physical medium again (whether SSD NAND cell or reading the HDD).
  2. The OS has its own filesystem cache by keeping data in RAM and avoid hitting the disk entirely.
  3. The FILE API implementations have their own cache as well, to reduce the number of user <-> kernel transitions

This worked well in the past becauseā€¦ HDDs could read at a meager 100 MB/s; SSDs at around 500MB/s; while RAM could read at >= 20GB/s speeds. And Spectre mitigations weren’t around either.

Caches were ok. RAM is much faster than disk, and latency access are orders of magnitude much better than that of HDDs.

But now that SSD Gen4 can perform 6-8GB/s reads and current RAM is around 50GB/s; let’s put that in numbers by using peak theoreticals:

  • RAM has 50GB/s of BW available.
  • OS has its own cache. We now have left 42GB/s of RAM BW (8GB/s to write to the OS cache)
  • FILE API has its own cache. We now have left 18GB/s (8GB/s to read from OS cache, 8GB/s to write to the API cache)
  • The data is memcpy’d to the application-provided pointer. We now have left 2GB/s
  • The application does ‘something’ with that data by reading it. We now have left -6GB/s

We’re short of 6GB/s. We kept copying data around to cache it without doing any actual work. Even if we had additional 6GB/s left, there’s nothing left for anything else!

Of course this is theoretical worst case scenario and isn’t realistic:

  • CPU caches (L1/L2/L3) will mitigate the impact when copying from OS cache into the API cache, and from API to app pointer.
  • The API cache usually doesn’t have such a large cache and will skip it if the transfer is big enough. How the caching algorithm is tuned plays a big role.

Anyway, regardless of actual real world case numbers; it’s becoming clear that as the BW gap between RAM and SSDs have shortened; caches may start hurting more than they help.

The solution DirectStorage proposes

DirectStorage goes in the direction I proposed in 2018 (skip to “Reevaluating API decisions”):

  1. Make IO API more like Vulkan / DX12 API, but for disks
  2. Bake all the disk operations the app wants to perform into one more command buffers
  3. Submit the commands together in one API call
  4. Wait for the results to arrive

This solves all of the mentioned problems:

Let’s say we need to access 100 files. It doesn’t matter if they’re big or small. We make a command requesting the data of all 100 files in one go:

  1. In relation to Spectre, this is solved. Instead of performing 100 transitions in, 100 transitions out; we only perform ONE transition in, ONE transition out.
  2. All 100 file accesses are known beforehand, thus any algorithm can sort out an optimal route to minimize unnecessary NAND activations; no need to add latency either.
  3. DS command queues map to NVMe Native Command Queues. This means the request created by the application can be handled directly to the HW accelerated chips when present, freeing up the CPU from doing that job.
  4. DS returns direct paths to data, with little or no caching in between. Thus avoiding unnecessary copies. If we had 50GB/s of RAM BW available, we now should have left somewhere around 34GB/s of BW available for the application (8GB/s consumed by DS for writing into RAM, 8GB/s consumed by the application to read the data to do whatever it needs; without considering CPU caches minimize the impact)
  5. The API has a place left for HW-accelerated decompression (like the PS5 does). But this is not yet implemented and so far only SW emulation is provided as a proof of concept.

This solves A LOT of the problems we enumerated. Of course if we want to load from disk -> GPU, we still waste BW by having the data travel first to system RAM (CPU) and then to the GPU; this is probably unavoidable due to security concerns.

But it’s still a lot better than either traditional filesytem API or mmap.

And it also works on any kind of medium. DirectStorage as presented so far can be used on an HDD as well as SSD. It would be a problem if the API only supported one type of hardware.

It’s not automatic

A common mistake would be to think DirectStorage will make games’ (and non-games’) disk IO automatically faster.

It’s not. The engine needs to be designed around it. Why? Well, so far the example I gave has a key assumption I didn’t point out:

  • The application knows beforehand it will be reading 100 files.

Game Engines (and normal applications) are often chaotic. They randomly ask to load a file and don’t really track who’s reading what.

If a game adopts DirectStorage but performs 100 submits of 1 file each; instead of packing 100 files into 1 submission; it’s going to be slow as heck. You’d be better off using traditional file APIs.

Another problem is that if we pack 100 files into 1 submission; but one of these files is high priority; the application needs to wait for all 100 files to be available.

If a file is more important than the rest, then it needs to perform 2 submissions; placing a few high priority files in the first submission, and then the rest of the files in the next submission.

Therefore I wouldn’t expect DirectStorage to be an automatic performance improvement; and please don’t blame a developer for not adding DirectStorage support because ‘they’re lazy’ and ‘have no excuse’. It won’t be a matter of just plugging it in.

Depending on the engine’s initial design choices, significant work may or may not be needed to take advantage of it.

A lot of applications are not actually IO bound

It’s common to blame an IO bottleneck without actually checking. Now that SSDs are becoming really fast, it’s also coming out to surface applications that were apparently “IO bottleneck” were actually just CPU bottlenecked, or doing something stupid like locking disk reads to VSync because of a progress bar indicator.

DirectStorage can’t help you there (except in perhaps realizing IO wasn’t the problem).

There are uses outside gaming

Version Control (Git, Mercurial, Subversion) is a prime candidate for DirectStorage-like APIs: Version Control need to recursively enumerate all files in a folder to find for modifications.

I can’t think of a better use case than that. The entire point around VCS is to do disk operations in bulk.

Going file by file calling stat() is a complete waste of computing resources.

Database applications (e.g. SQL) could probably benefit from it as well; although it’s harder because DirectStorage right now is read only, whereas databases need to do a lot of read and write.

Nonetheless the API could eventually evolve to supporting write operations just like D3D12/Vulkan supports reading and writting from/to GPU.

Why is it Microsoft-only?

I haven’t heard yet of any Open alternative yet. DirectStorage so far doesn’t seem to fundamentally use anything that is HW specific (although there’s plans, i.e. for HW accelerated decompression) except that the data structures must map to NVMe NCQ so that they can be handled directly without translation.

To be honest I’d expect the Linux Server community to be much more interested in financing this type of tech development than the gaming community. Specially after how hard SQL performance was hit by the mitigations.

IO operations take a heavy toll on servers, particularly database servers. Apache’s probing of .htaccess files on every folder takes a heavy toll on shared webhostings.

Normal servers are advised to typically disable .htaccess because of its performance impact when it’s not needed; but shared webhosting providers often enable it due to its user-friendly interface that lets customers customize their wesbsites on a per-folder basis.

Ultimately DirectStorage API boils down to being a Vulkan-like API to perform IO operations in bulk, submit commands and wait for them, with little caching in between.

There are other issues in DS API that are gaming centric. For example DS assumes the application knows beforehand which files need to be accessed.

UPDATE: I’ve been pointed out Linux has io_uring.

API flaws I see in DirectStorage

So far I can see DS has the following flaws:

  1. It assumes we know the file location beforehand (e.g. located at Folder/Level5.pak). This is fine for most games, but non-games usually want to probe the contents of a folder and concatenate a few commands (e.g. find all files in a folder and fetch metadata such as last modified timestamp, filesize and name; signal application if the application-provided buffer was filled and there is more data to fetch).
    • Such API, if complex, would look very similar to GPU-driven rendering via indirect draws.
    • Such API, if kept simple, would simply return all stat’ed files under a folder and return when done or the buffer is full
      • If the buffer is full, an API handle can be used to later resume the operation once the app consumes that buffer’s contents and enqueues a resume operation.
  2. IDStorageFile::GetFileInformation returns immediately. Which means that either GetFileInformation is a blocking call, or IDStorageFactory::OpenFile is. That’s convenient, but bad.
    • File opening should be requested through commands, e.g. like IDStorageQueue.
  3. There is no IDStorageFactory::OpenFiles (plural).
  4. DirectStorage current API assumes reading a single file from disk is expensive, and ignores that requesting to open a file is expensive (fetching metadata, sort out permissions, on Windows each file access needs to synchronize access to avoid write access to opened files).
    • This is fine for videogames that package everything into a few multi-GB-sized tar-like files. But useless anywhere else.
  5. The API assumes file handles will remain opened for a long time. This may not be the case for tools like git and other vcs.