TSANCHEZ'S BLOG

Musing on Virtual File Systems

Virtual File Systems are everywhere. Each has its strengths and weaknesses, but they all provide a level of abstraction to resource accesses. In game development, you want that abstraction to remove dependences in your loading process. But why not just use fopen? What can a VFS do for a videogame? Lets discuss.

First off, lets get some definitions out of the way. Wikipedia has this to say about it. And, the answers are fairly bland, but show how you end up using a VFS every day. In particular, FUSE, Daemon Tools, and DosBox may be instances you’ve run across in your daily usage. The idea is to translate from one source to another in order to provide access to a resource in a common way. Dos games can’t read your NTFS files, but DosBox translates. Your games can’t read .iso images, but Daemon Tools presents them as a cd drive. And, FUSE lets you remap just about anything to just another directory or file. So, how does this help your games? If you’ve ever explored the file structure of a video game it is hard not to notice files floating around like “.wad”, “.pak”, “.zip”, etc. The developers pack up all their source assets into just a few archive files, and need a way to access these in game. So many of the same theories apply.

Now all of this is starting to sound like a lot of trouble for some simple file accesses. It would be simple enough to just have your “monster.tga” and “monster.md5″ floating around in a directory and use “fopen” and “fread” to get at the data. But, you lose all the benefits of the archives!

Compressing your files into archives saves disk space. This can become important on large projects with limited disk sizes. Downloadable content limited in size by several major publishers. Formats like the Sony UMD only provide ~2GiB of storage. If you’re still packing on cd-rom, you’ve only got 700MiB.
Compressing the files also saves on loading time. Disk drives are slow. Regular 7200RPM spinning platter disks are going to max out around 60MiB/sec. The newest Solid State Disks are pushing 200+MiB/sec. Your DDR3 RAM on the other hand is pushing 10GiB/sec. That’s a 50+x difference, and you can use that to speed up your loading time. You’re limited by the slowest input to the system (the HDD), and by reducing the amount of data you need from the HDD, you’ve speed up your load.
Packing together related files means less seeking around. Each file read can include a seek to the file table, a seek to the file, and a few seeks to get the file fragments. Having all your assets in one file removes the first two seeks, since you can just load up the archive once into memory.
Packing together related files helps budget memory and reduce load times. It usually ends up being very important to know how much memory your game needs. Having a few “common.pak” files to store stuff that only needs to be loaded once helps to solidify their memory budget. This gives you more room for your “level.pak” data, and removes a chunk of data that no longer needs to be loaded all the time.

There are a tonne of other benefits, but that lists some of the major points. But still, it would be simple enough to just have PAK_Open(), PAK_Read(), etc. That, however, is exactly how you can limit yourself too early. In exactly the same way Daemon Tools presents your .iso as “just another directory”, it’d be nice to present your pak files as just another directory for your “fopen” call. Then, while you CAN pak up your data, you don’t have to during development. It could take a lot of time to re-pak a 100MiB file with your 4KiB .tga your just edited in Photoshop. It would be really nice to just edit your .tga file, and have the game reload just those changes. And what about mods and patches? It is VERY powerful to be able to specify “data/monster.tga” in your level’s config, and not have to know where or how that is stored. It is also VERY powerful to provide a unified code path for any coder looking to read a file. And thusly, you need a VFS.

The Game Specifics

Some wonderful coders out there have put together different aspects of this over the years. boost::filesystem provides a lot of utilities to abstract away the host’s physical filesystems and the differences that crop up between platforms like Windows and Linux. PhysFS takes a step in the direction of what I’ve been talking about. It provides access to files, folders, and archives with a unified interface. Though it is lacking a few key items in the name of portability, and items that tend to crop up in big game projects. I’d still fully endorse using PhysFS, as it is a solid library on the whole.

I’ve been working on updating my own VFS library to use a more C++ type interface, and so will muse on some of the items lacking in PhysFS that I’ve personally found come up in game development.

Firstly, as I mentioned, memory is a big problem in games, and so is load time. It is often very useful to just load an archive into memory in one big chunk, then “cast” or “fix up” files in-place. You’d take the data in place, and just call it a “Mesh *”, maybe do some translation to fix up pointers and endian issues, and then your mesh is good to use. This entierly sidesteps a filesystem’s “Read” operations on the child files.

Secondly, compression can save you a lot of time in loading your data. It can save you a lot more time if you decompress in parallel to the file reading. Instead of reading in your archive in one big read call, you want to chunk it up into smaller bits. If you read those smaller bits using an asynchronous read method (ie another thread, IOCP, boost::asio), then your code can have a chance to decompress one chunk while the waiting for the next chunk to finish reading. If it takes 5sec to read your file, and 5sec to decompress it, your load time would be 10 seconds using the un-chunked version. But with the chunked version, you overlap the decompress and read times, and so the whole load will only take slightly more than 5sec.

Thridly, filters can be powerful. Sometimes you just want to treat some files different from others, but you don’t need to change the final interface to do so. I’ve built ontop of the iostream library for my VFS, and using the std::streambuf, was able to provide a way to filter file data. These filters can be anything including, compression, buffering, encryption, checksums, or any other transformation you can think of needing. I also included in those filters, a way to limit the range of the filter. So above the filter, you seek in the range [0,size), while the filter knows to map that to the range [128, size+128) and size may be less than the whole file’s size. In this way a filter can hide headers, or in the case of an archive, map a portion of the archive as the actual file being read.

Lastly, there isn’t just one place that files are read, but there should be. A proper VFS centralizes all the knowledge about file access, and should act as a central point for access control and data gathering. Optimizing load times often requires intimate knowledge about the order and frequency of file accesses. Placing things in the order they are read reduces seeking. Caching data that isn’t “read once” can help reduce read times. But most importantly of all of this, scheduling your disk access will keep everything smooth. When you issue a disk read, the OS doesn’t know how important that read is, and so runs a scheduler to keep all outstanding disk reads happy. But sometimes you want to issue a background level load, while streaming in hud and music. Without your direct intervention in this, the OS is likely to over-subscribe the disk causing your music to stutter, and your hud to take forever to load. Thing is, your background level isn’t nearly as important right this second, and so it should have been put on hold for a moment. A well designed VFS should give you the capability of scheduling disk IO, without any of your game’s subsystems having to know exactly what other disk reads might be happening. Without that, your hud, music, and level loader all have to talk to one another directly. With it, each subsystem only has to describe how important the file accesses are, and the VFS can schedule around that.

To be continued…. (once I get all my code zipped up in a stand-alone compile).