Thoughts on Game Data Formats

Standard

I’ve been thinking about the representation and loading of game data recently. This has largely been the result of stumbling upon a variety of blog posts and libraries such as:

http://gamedevcoder.wordpress.com/2013/02/16/c-plus-plus-rtti-for-games/

http://kentonv.github.io/capnproto/install.html

http://blinkprotocol.org/

http://www.altdevblogaday.com/2012/06/04/read-my-lips-no-more-loading-screens/

 

Desired characteristics

So, the basic problem is that you want to store a large variety of “resources” for a game “level” (or “sector” or whatever), describing all sorts of things such as the location of every object, the vertices of every mesh, and so on. There are a number of desirable characteristics for this data format, such as:

  • Efficient on-disc representation
  • Efficient run-time and load-time access. Ideally you can load the data structure into memory and just use it directly without any “translation” step. At the very least any load-time cost should be cheap.
  • Some way of reflecting the data at run-time. Ideally this is something you don’t need for the final release game, at least for most objects, but it’s extremely useful during development to be able to essentially “attach” a visual debugger to a running game and get a description of every object in the level. E.g. for console development your editor could run on the PC, while the actual game runs on the console. The console would reflect its actual runtime state and keep the editor up to date, and the editor would send modifications as messages to the runtime.
  • Some way of evolving the representation of an object. Ideally a new runtime of the game should be able to just transparently load old resources without any fuss, or rebuilding. This is particularly useful for post-release patching, but even during development “rebuilding” thousands of objects (with whatever synchronization overhead and dependency tracking you might also need to do to avoid stomping on other developers’ toes) can quickly become an issue.
  • Multiple artist should be able to work on a level at the same time.

I haven’t implemented anything that provides all these benefits simultaneously in a shipping game, but with hindsight, and after thinking about some of the newfangled “protocol” formats that have become popular recently, I think I have a good approach in mind.

Overview

Levels are just folders of resources in the file system. Every single “object” in the level corresponds to a file, e.g. “tree_001291.inst”. The exact internal organization can vary – perhaps sub-folders refer to individual “sectors” (for e.g. streaming purposes), or perhaps the sub-folders are completely arbitrary and used by the designers to organize things however they want. The point is that there is no overall “level file” that anyone has to lock to modify the level. This means multiple artists can add to and modify different parts of the level. Just use normal source control, with exclusive locks, to process the individual files. Any per-level properties would be stored in tiny little resources that get dropped into the level like anything else. E.g. you might have a “physics properties” resource that store a per-level set of parameters like gravity. Keep even these global resources small and focused, to avoid contention. This does means you’ll have many thousands of files per editor, but it’s also super simple to inspect and debug when anything goes wrong.

Resources consist of “instance” data and a “schema“, both of which has a source representation and a runtime representation. For example, the schema might be represented in a simple JSON-like format, the instance source data can vary (e.g. a Maya .mb file, or a .wav file). The runtime representation is binary and efficient, produced at build-time from the source, using an appropriate “builder”. It’s likely that many different kind of instances use the same builder (e.g. simple “property bags” type resources might just take their input as plain JSON, even if each one represents a completely different kind of game object), with a few custom ones requiring more severe processing.

The schema describes the data layout of each resource type. I.e. you might have thousand “goblin” resources in the level, but there will only be a single schema describing the layout of all those instances. The reason for storing as schema separately, as opposed to having each instance be self-describing, is to avoid duplicating type-metadata and field offsets etc. that don’t change, and also because it gives you a convenient place to hang additional (optional) meta data off of, such as RTTI information (and on the editor side of things, UI meta data).

The instance data is basically just the actual bits that refer to a single instance of that resource. E.g. a single health pack, or monster spawn location, or rigid body, or a mesh, or whatever. It contains no “meta data”. It’s a compact format, but stored in a “native” way so that it’s cheap to read at runtime directly without first having to convert it to a runtime representation. Keeping the instance data in the “native” form is important because it means most resource types wouldn’t have any “runtime version” of the on-disc data – it would just use the instance data directly. Some types would probably have some additional “companion” runtime data layered on top, and created at load-time – e.g. D3D header objects and such.

A particular version of a schema is immutable. I.e. once you’ve created (and checked in) a schema you can’t ever change it. You can only create a new version of it (and you can only do that according to specific rules, that don’t invalidate previous instances and schemas). Each instance refers to a particular schema of a particular version. In other words, the game can load any version of any instance, because it uses the schema the instance was built against to interpret the data. So what I said above about there only being one schema for thousands of instances of a given resource type was a lie – there can be multiple versions of each schema. In practice you’ll only have a few versions actively used at any one time, because anytime you modify an instance, it would be saved out using the most recent schema (and you can periodically run a “refresh” batch job to rebuild instances).

The schema’s runtime asset is simple. Each field in the structure has a unique integer identifier starting from zero, with no gaps. This is the only way to identify a field. These identifier cannot change when you create new versions of a schema. You can add new field, and you can mark old fields as “deleted” (which means they no longer take up space in the instance), but you can’t actually reuse an old identifier. Lots of deleted fields imply some overhead in the schema runtime asset, but not in the instance data. These rules ensure that creating a new version of the schema won’t cause old instance to be invalid.

The runtime structure of a schema is basically just an array of offsets, one for each field, indicating where that field is stored within the instance data. There’s also some additional meta-data containing the string representation of each field etc., used for RTTI (ideally stored in a separate chunk that can be skipped if you don’t need it).

Each schema corresponds to a generated C++ class header. This class simply provides a bunch of accessor methods. Each one looks up the offset for the field in question using the offset array from the schema, and the identifier for the field, and returns the memory from the instance data at that location casted to the right type. This does mean an additional indirection for field access, but it should still be reasonably fast (especially since the schema’s offset table is likely to be hot in the cache quite frequently). If the field doesn’t exist in the schema (e.g. it was added after the schema was created – so its identifier is larger than the size of the offset array, or the offset array contains the “deleted” marker), the generated code would simply return the default value (stored inline as a constant, in the generated code). This is the migration path – if you add a new field to a schema, all you need to do is make sure you add the corresponding code to “do the right thing” with that new field, including doing something sensible for default values. Old code will simply ignore that field (their version of the generated class won’t have it, nor any code that relies on it!), and if the new code reads an old instance it will just get the default value for that field.

Further details and rationale

So that’s the basic idea, there are some other details that are important to note, though.

So clearly you don’t want to be loading individual files all the time, which is why there needs to be some kind of “bundling” process that groups resources together. Probably sorts them by (schema,version), so that you load all objects with a given schema at the same time. This wouldn’t just be for release builds, you’d regularly build these bundles even during development. In order that you don’t invalidate a whole bundle and revert to excruciatingly slow load times if you modify a single resource, there needs to be a way for the loader to basically override individual resources from a bundle.

One likely objection is that it’s much simpler to just enforce header/data consistency and always just do a plain memcpy-style resource load. That way there’s no need for the indirection via the schema. Why go through the trouble? Well, in a system with strict header/data dependencies, if you change the layout, you have to rebuild every resource of that type. I really like the simplicity of this approach, but I’ve been down that road and had to live with resource rebuilding being a major iteration hurdle for everyone involved.

The bottom line is that if you have thousands of assets of a particular resource type, every tiny little inefficiency in your build-time code gets multiplied by that number, making your performance margins razor-thin. Even 100ms of build time for a resource can quickly become unacceptable if you have ten thousand of them that need to be rebuilt because you added a flag. It’s not hard to hit those kind of times from just loading DLLs, and hitting the disk once or twice. Any slight inefficiency quickly adds up to many seconds, or even minutes of waiting around. I don’t know about you, but I’d rather not worry about a couple of milliseconds here and there on the build-side of my code. Plus, this approach opens up a pretty nice patching/DLC story.

Another objection may concern the fact that we’re generating code. Why not let our schema source asset simply be our C++ header file and parse it to build the runtime asset for a schema? Or why not annotate it with preprocessor macros and generate the offsets directly in code? Well, in a word – simplicity. Writing parsers is pretty hard. Writing a parser with good error messages is even harder. Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination. You do not want to be in this business! You might be able to restrict the kind of C++ you allow in this “schema header”, but inevitably someone will mess up and put unsupported code in there, and your cobbled together mess of preprocessor defines, build-time parsers, and maybe even template meta programming will lead to the worst error messages you’ve ever seen. I’ve been down this road. There be dragons. Maybe you can avoid them if you think hard about it going in, but honestly I don’t see any major upside to the approach that pays for the extra risk and complexity.

In contrast, generating code is child’s play. The accessor methods in the generated code is trivial. Anyone can understand and debug it. There’s absolutely nothing clever going on at all when you’re generating code (the code that generates C++ is simple, and the generated code is simple too). Everything is just plain old dumb C++ code that contains absolutely nothing clever. In my opinion, of the common choices for these things, code generation is the right choice. It retains maximum flexibility since you can just generate any code you want (you’re not restricted to what you can do with preprocessor macros or template or whatever), without costing you any simplicity.

You may worry about the overhead of these accessor methods. While it’s reasonably cheap, there’s still a range-check on the offset, and an extra indirection. I’m not overly concerned about that. Part of it is because in my experience it turns out a huge chunk of the data you actually touch lives in a relatively small set of game types. E.g. matrices, quaternions, vectors, colors, etc. These types never actually change representation in practice, so can simply be hard coded as “native” types instead of being represented as schemas of their own. In other words, accessing the toplevel “transform” field does require the extra range check and indirection, but each member of that structure (e.g. 16 matrix members) can be accessed directly as it’s just a plain old C struct. I’d estimate you end up with a few dozen “native” types like this, which improves the constant factors involved by quite a bit.

Plus, if accessor overhead is really an issue, you could make sure your Release builds will never load old instances (by running a global “refresh job” on all instances as you produce your final builds) and hard-code the offset in the generated accessor code instead of going via the schema. I’d avoid this, because it means you may have issues using the “upgrade path” for future patches. Another option is to tag certain type as having an “efficient” format, this would make the schema builder also produce a plain C struct for the type, and a loader method that simply copies each member at load time (using the schema offset table and instance data) – once that’s done it’s just plain old C data that can be accessed at full speed. This costs you load time performance and simplicity, but does offer a nice “backup plan” (e.g. if you want to modify the runtime data and save it back out to the instance source format, for live editing purposes).

So what about on-disc storage? If we need to store all our data in a “native” format, won’t we waste a whole bunch of memory? For example using 32 bits to store an integer when it could be compressed as 10 bits is a bit wasteful, right? Yeah, this is true, and it may be worth worrying about. However, it’s not uncommon to have some kind of transparent compression on the IO-system level anyway (e.g. based on zlib, or Snappy), so combined with the bundling system above most of the redundancy should be compressed away. If not, Cap’n proto has a clever trick where they store the instance data as the actual instance data XOR’d with the default instance (which I would store in the schema). This means most instances consist of tons of zeroes (since most fields aren’t actually changed from their defaults, in practice). Combine this with some really cheap zero-aware compression where you only store non-zero bytes, and you get great on-disc storage characteristics. Of course, the next step is to skip the XOR step altogether. When you compress the instance you do it using a “default-aware” compressor instead of a zero-aware one – i.e. skip storing any bytes that are identical to the default instance and at load time you would initialize the instance by either loading bytes from the resource instance or from the default instance, based on some control stream stored in the resource . Of course, you should measure carefully if this is worth it. There’s something to be said for loading directly from disk into the final destination buffer, without having to even involve the CPU.

Another interesting thing I’d like to try is to separate mutable from immutable data on the level of the schema (with immutable being the default, and this affecting the accessors being generated). Typically the vast majority of the data you’d load this way is immutable (in terms of size) – e.g. vertex data, audio data, texture data, etc., with only a few things actually being modified at runtime. Since we’re generating code from a schema, we can make these kind of global experiments pretty easily.

Anyway, I this seems to combine the best features from all the different systems I’ve worked with, read about, or even implemented myself. Thoughts?

Advertisements

8 thoughts on “Thoughts on Game Data Formats

  1. Cool. Pretty much my thinking ( and indeed, doing ). I also think that making the instance documents text docs and/or human readable is a massive benefit – e.g. xml, json, etc. There shouldn’t be much cost ( especially if this is coupled with some bundling / packaging step ) converting these to their binary equivalent during a deploy step and keeping these in text lets you leverage all the text based tools ( perl, diff-tools but lets not underestimate the power of find-and-replace-in-files ), supports quick and dirty hacks during development, etc. Note that I think is applies to 3rd party tools where possible to – i.e. don’t use .mb files – use .ma: While writing a mel-script to change/add/remove attributes across an entire depot is not hard, its not as easy as just finding and replacing strings.

    As to parsing this binary xml at runtime, I think its best not sweating the small stuff. For games, structually complex files are in the minority – loading time is dominated by bulk-resources like textures and geometry. These will almost always be in their native formats ( although during development might want to support rebuilding vbs if shader definitions change ..? ) and the cost of loading them will be pure IO.

    How should schemas be expressed? Unlike the instance docs ( which should be human readable and only human writen in edge cases ) schemas should be easily human writeable – i.e. not xml. It’s so easy to throw together a DSL for a simple declarative language that there’s almost no reason not to. Probably the best thing to do is just pick the declarative sub-set of a language that people are already familiar with and where the grammar is well-documented ( i.e. don’t make up your own! ), paste that grammar into your parser generator of choice. lots of code paths to test, but in the overall scheme of things, a parser for a declarative DSL is easy to test.

    • Hi, must’ve missed this reply in the deluge of traffic on my other recent post!

      I do agree that instance *sources* should be human readable. This may be JSON or something like it most of the time (generated by humans, or by a level editor, or whatever), but sometimes the instance source will be a .ma file, or indeed something that isn’t human readable (e.g. .wav). Of course, you could choose to convert your .mb file to JSON, and then convert from JSON to the binary instance format as a pre-deploy step (this step should be fast – all the processing happens in the mb -> JSON step).

      I think there are appealing benefits to keeping the data in its final runtime representation on disk, though (i.e. no parsing, no pointer relocation, etc.).
      1. Performance. Yes, you often have idle time while loading, but it’s not trivial to keep the IO busy when you get a burst of a thousand tiny instances (taking very little IO time, but needing a bunch low-IPC patchup work). Clever file-system logic and concurrency can avoid this, but being able to just DMA the data straight to its final location with no CPU intervention required is very appealing to me for its simplicity. Especially if you do this while the CPUs are busy running the actual game (background streaming).
      2. Runtime tweaking. If the data used at runtime is identical to the data stored on disk, it’s very easy to modify it and just save it back out again (e.g. save games, persistent damage to worlds, etc.). You may not need a binary->XML serializer at all. It could all be strictly one-way (just make sure any dev-time editing tasks happen on the source instance data, and then gets converted to the binary representation and reloaded, rather than tweaking the runtime data directly)..

      Yes, in my experiment I just wrote a very simple DSL for the schema representation. Each struct is its own file, so it’s basically just a list of (name,index,type,default) tuples. The default gets copied into the generated source code as the fallback for when the field doesn’t exist. Trivial to parse even with a non-recursive single-pass manual parser, without any parser generator needed. It also supports copying in C++ verbatim in delimited sections (to add custom methods, etc.).

  2. The approach I’m currently exploring is using a specialized file-system similar to Linux ext2 to layout a DBMS.
    1. Define a Package as a sequence of pages from 0 to N
    2. A Package is layout as extent based file system, allowing the creation of a number of sub-files
    3. Each sub-file is used to store custom data according to the need of the game.
    4. Use a Frame buffer to load/store transparently data from disk to memory.

    The plusses and cons that I’ve found with this approach are…

    Pluses:
    – Direct and transparent mapping of game data from disk to memory.
    – A unified approach to data and memory management (basically everything is “seen” as in memory) but streamed in real time with carefull use of pre-fetching
    – Easy to design custom data structures to support the specific needs of game once the system is setup + all the classic data structures such as R-Trees, Heap Tables, Linear/Extensible Hashing, BTrees, Custom ones might be designed to store Voxels, or other more game industry specific.
    – Easy to switch to more cache friendly formats such as column stores, or PAX
    – Easy to add reflection and dynamic schema changes; basically just add a Catalog such as Table of Tables, Table of Columns, … and any other Meta-Table is required
    – Easy to take advantage of immutability

    Cons:
    – Complete rewrite of all data structures to a “page” aware form (all paged and use of standard DB operators such as Map/Pin/Unpin/MarkDirty)
    – Complex implementation to add parallel load and store

    References for some ideas on the DB part.
    http://www.amazon.com/Database-Management-Systems-Raghu-Ramakrishnan/dp/0072465638/ref=sr_1_1?s=books&ie=UTF8&qid=1371685718&sr=1-1&keywords=database+management+systems+3rd+edition+by+ramakrishnan+and+gehrke+r+g

    http://www.amazon.com/Database-Systems-Complete-Book-2nd/dp/0131873253/ref=sr_1_6?s=books&ie=UTF8&qid=1371685778&sr=1-6&keywords=database+System+Implementation

    http://www.amazon.com/Foundations-Multidimensional-Structures-Kaufmann-Computer/dp/0123694469/ref=sr_1_1?s=books&ie=UTF8&qid=1371685850&sr=1-1&keywords=multidimensional+data+structures

    Any Good Os Book such as Tannenbaum for the File system part.

    FilasienoF

    • Yes, a lot of it similar to what we did at Rare! With some major differences of course. I now think that you should use purely code generation to (de-)serialize data, that schemas should be *copied* into any other schema using it instead of using a runtime lookup, that the on-disc structure should be directly accessible (through a single “offset table” where each field has an index) rather than generating a runtime struct that you have to copy in and out of, etc.

  3. I was thinking the same, Tom! This bit in particular made me chuckle: “Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination”…

    I agree with you on all this, Sebastian. I’m glad you mentioned Cap’n Proto; I’ve been looking at it recently wondering if it would be useful for game assets, and I think it’s not far off!

  4. There are two main issues with capn proto. First, it duplicates schema data, since each instace is self describing. That may not be an issue, but it seems silly to not split out the “shape” information when it’s so easy. Especially since it makes the accessors cheaper too.

    The second is the whole streaming stuff. It sounds neat at first, but makes everything more complicated and it’s actually not needed one you think about it.. You rarely want to load half an asset. Build streaming on top of the asset system, don’t try to make it part of the format (you’ll want custom strategies for different asset types anyway).

    For writing messages use a 64 bit process, allocate 1TB of address space or something ridiculous like that and lazily commit physical pages. That way you get the same benefits (no need to know size in advance, no extra copies), but you let the OS handle the indirection, and they became hardware accelerated too.

  5. Nice read, thanks. Another approach I’ve been thinking about: use a sqlite database. It would probably work only for smaller things where performance and storage footprint are not critical. On the other hand: very easy to implement schema migrations, standard tools to manipulate and query the data. Keep a sql dump in the repository to enable versioning…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s