Language Design Deal Breakers

Standard

I’m a bit of a programming language nerd. I love to learn new languages. That said, I spend most of my days writing C++. It’s a truly awful language, with many warts and problems, but I know it well and with enough effort and pain you can get the job done. The biggest benefit of C++ is purely an historical accident: it’s great because it’s popular. There’s no trouble finding people who know it, or libraries that work with it.

The point is, that in order for me, or anyone else, to actually switch from C++ to a different language, there has to be a substantial improvement to overcome the inertia that C++ has built up over the years.

I find that when I look at a new language, there are a number of “deal breakers” which simply mean that regardless of all its other nice features I will never take it as a serious contender. Note that this isn’t a fair fight. Something can be a deal breaker even if C++ doesn’t have that feature either. Any language has to be so much better than C++ that the benefits outweigh the inertia. A deal breaker is a feature that either A) puts it at a disadvantage compared to C++ or B) puts it on equal footing with C++ in an area where C++ is doing particularly poorly.

Here’s my list of language design requirements, which if unmet would be deal breakers for me, what’s yours?

1. Must be statically typed

I know you’re supposed to be diplomatic and claim that there’s two sides to this story, and no real right answer, but really people who think dynamic typing is suitable for large scale software development are just nuts. They’ll claim it’s more flexible and general but that’s just nonsense in my opinion. It may be true compared to utter strawman examples of static typing. If you think this is an argument, go learn Haskell, and then get back to me.

There’s a wonderful correspondence between programs that can be statically typed, and programs that can be statically understood by other people. In other words, programs that you can understand without having to run a debugger or REPL, by just reading code. If you’re writing programs that rely on dynamic typing to work, you’re probably writing exceedingly difficult to understand code. If you are writing code that can be statically understood by other people, you’re probably writing down comments to indicate the allowable types and preconditions anyway (or you should) – so you might as well have those properties checked by the compiler.

Most code does not rely on dynamic typing to work. They could be written just as well in statically typed languages. In that case it’s just a matter of convenience – make specifying types light-weight enough that the extra syntactic overhead doesn’t outweigh the benefits (type inference helps, even the simple “local” kind that C# and C++ 11 has cuts down on much of the noise).

Then there’s the correctness issue. Static typing catches mistakes early. This is a good thing. Now, dynamic typing advocates will say that of course any real program will also have a unit testing suite, so types can be checked that way! This is obviously naïve in the extreme. In the real world writing unit tests for everything very frequently fails the bang-for-buck test – the cost is just too high. This cost comes in two forms, first the cost of writing the tests, and second the friction it introduces to future modifications and refactorings – after all if you have to update 20 unit tests to refactor a messy interface, you’re a lot less likely to do it. Examples of code you probably don’t want to write unit tests for includes gameplay logic which gets written and rewritten twenty times before ship as the result of iteration. Having to write test harnesses and tests for all this throwaway code is madness. Then of course there’s code that’s just plain hard to unit test, such as graphics. The point is that pervasive unit testing is a fantasy – it may be feasible in some domains, but it’s certainly not the case everywhere. Static typing is a great compromise – you basically document the things you would’ve put in comments anyway, but you do so in a standardized way that can be checked by the compiler. This catches many errors for very little cost. When unit tests makes sense (as they sometimes do), you can add those too for even more coverage, but when they don’t you still have a basic sanity check.

Finally, performance. Yes, JITers can be very clever and employ tracing and stuff to close some of the gap here, but of course those techniques could be used by a statically typed language implementation as well. If we can statically decide what methods to call and what types to use we will always be at an advantage compared to a language where this has to be determined at runtime. Not to mention that statically typed languages will encourage you to write code that can be executed efficiently (e.g. variables don’t change types over the course of a program). Also, static typing allows you to make performance guarantees that won’t be randomly broken tomorrow. For example you can store an unboxed array of values, or store a value on the stack or embedded within another object. You don’t have to rely on a runtime tracing system to uncover and apply these optimizations (with typically no way to ensure that it actually occurred), which also means you don’t have to worry that you’ll fall off a performance cliff in the future when something subtle changed which caused the JITer to fail to optimize something. The programmer is in control, and can statically enforce important performance-related choices.

Of course if you want to allow dynamic types as an option for a subset of cases where it makes sense (e.g. dealing with untyped data read from external sources), that’s fine.

2. Memory safety

C++ fails this one, but it’s still a deal breaker. If I was content with worrying about memory scribbles and random access violations I’d just stick to C++. In order to offer enough benefits to convince me to switch from C++, your language needs to be memory safe.

Now, let me also state that I think the language needs an “escape hatch” where you can tag some function or module as “unsafe” and get access to raw pointers, the point is that this needs to be something optional that you have to explicitly enable (using a compiler switch, ideally) for very isolated cases, and you should never be required to use it for typical “application level code”.

This also implies some kind of guarantees about memory reclamation…. We’ll get to that.

3. Yes, memory safety – no null pointer exceptions!

I use a broader definition of memory safety than some. In my view, if a program follows a reference and crashes because the reference is invalid, it’s not a memory safe language. This implies that in addition to making sure you never point to stale memory or memory of the wrong type (through type safety and automatic storage reclamation), you also have to make sure that a reference never points to null.

Null pointer exceptions have no business in modern languages. Getting rid of them costs no performance, and statically eliminates the one big remaining cause of runtime crashes that we still see in otherwise “modern” languages. Yes, it requires some thought (particularly w.r.t. how you do object construction), but this is a solved problem. There is only one valid reason not to do this and that’s legacy code – perhaps you didn’t know what you were doing when you started off, not realizing the issue and how simple the solution is, and now you have too much legacy code to fix it. A language without legacy issues should get this right, and yet many don’t.

Note: it’s not enough to simply have a “non-nullable-pointer” type, not even if it’s checked at compile time. You have to make sure that nullable pointers cannot be dereferenced. In other words, using the pointer-dereferencing operator on a nullable pointer should literally be a compile-time type error. You should have to do a check of the pointer first which introduces a dynamic branch where on one of the sides the type of the pointer has been changed to be non-nullable, which enables the dereferencing operator again.

Having every few statements potentially trigger a runtime crash is an extremely expensive price to pay… for nothing! We’re not talking about a pragmatic tradeoff here, where you pay the cost of reduced reliability but get something else in return – there’s just no real upside to allowing dereferencing of potentially null pointers.

4. Efficient storage reclamation

I mentioned above that I require automatic storage reclamation, well I also require it to be efficient. Too many modern languages fail this. I don’t really mind how they achieve performance, but two ways you could do it is to support pervasive use of value types (i.e. types that get embedded within their owners – which drastically reduces the number of pointers in your heap, and therefore the cost of GC), along with thread-local heaps. Concurrent collection is another solution, though you’d have to be careful not to incur too much of an overhead on the mutator with whatever memory barrier you use. I’ve written about this issue twice in the past: one two

5. Great Windows support

Windows is still the most popular OS by far. If you don’t give me installers, or easily-buildable code, for windows I’m not buying. I realize a lot of academia is used to *nix, which causes a disproportionate preference for linux and mac in projects such as this, but in the real world your customers actually use windows so you should make sure it works.

Conclusion

The most common one of these deal breakers is probably dynamic typing. I usually just close the tab immediately when I see that a language is dynamically typed. So while strictly speaking this is the most common issue, it doesn’t feel like it because I never invest any amount of time even considering those languages anymore. They may have some benefits in terms of theoretical elegance (e.g. Lisp, or even Lua or IO), but for actual work I’ve seen too much of the benefits of static typing, and too much of the horrors of not having it. It’s so obvious, that it barely qualifies as a deal breaker – it’s not even in the running to begin with.

So really, the most common deal breaker is probably point 3 above – Null pointer exceptions. Whenever I see promising languages like Go, D or Nimrod, that still repeat Hoare’s billion dollar mistake I get a bit sad and frustrated. It’s like seeing a beautiful, elegant, car that has just one tiny flaw in that the wheels occasionally fall off. It simply doesn’t matter how many awesome and innovative features you have if you don’t get the fundamentals right.

Advertisements

Robin Hood Hashing should be your default Hash Table implementation

Standard

There’s a neat variation on open-addressing based hash tables called Robin Hood hashing. This technique isn’t very well-known, but it makes a huge practical difference because it both improves performance and space utilization compared to other “standard” hash tables (e.g. chaining). It also eliminates the major downside of other open addressing hash tables.

Here are the benefits, summarized:

  • High load factors can be used without seriously affecting performance. 0.9 is perfectly reasonable as a default (0.95 or higher works too, it mainly affects insertion cost a bit).
  • No linked lists or other extra pointers. This reduces cache misses and storage overhead. Your underlying structure can be a simple flat array since it’s just open addressing under the hood.
  • Lookup and insertion logic is fast. Again, no linked lists to traverse or other complications, just a linear probe sequence with a few checks per element.
  • Unlike other open addressing schemes, looking for non-existent elements is still fast.

For a simple benchmark where I inserted 100k english words into a map, then deleted 10% of them, and then looked them all up again, the timings for my simple Robin Hood hash table was 23%, 66% and 25% lower for insertions, deletions and lookups respectively compared to the VS 2012 std::unordered_map, it did this using 30% less memory overall.

It’s all about the variance

So the basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths. We’ll se later why reducing variance is so important. To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it’s 16.2. At a load factor of 0.99 it’s 1.87 and 194 respectively.

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name – the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) – it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Furthermore, the expected value of the longest probe sequence approaches about 6 for a load of 0.9 (it’s not actually constant, but grows very, very slowly – see this paper), which is pretty reasonable.

What about failed lookups?

So now you’re probably thinking that this sounds good, but in the back of your mind you know that normal open addressing tables do pretty well too, but have a major downside in that searching for non-existing elements is troublesome to handle. To allay your fears I’ll first tell you a very simple (but totally practical) solution that is enabled by the low variance, and then we’ll see the even better version later.

First a recap. The usual problem is this: since the search algorithm terminates when you find an “empty” slot in the underlying array, it can take a very long time to determine that an element doesn’t exist in the array when the table grows full.

How does Robin Hood hashing solve this? Well the simplest solution is to exploit the fact that the expected longest probe count is low (~6). Just modify the standard search algorithm to ignore empty slots (rather than terminate the search) and keep going until you’ve probed longer than the known maximum probe length for the whole table. This maximum probe length will be small, with a very high probability, even for very loaded tables.

This solution would obviously be inefficient for the standard open addressing scheme, since the expected longest probe count is much higher (e.g. at a load of 0.9 and ~64K elems it is about 70).

The details

In order to know what the probe count of an existing element is (which is key to the algorithm) we could just re-hash the element to re-compute its “desired” slot index and then subtract this from its actual location. A faster way is to simply cache the hash value for each key.

Storing the hash value is generally a sensible strategy anyway, because it means you have a fast early-out for comparisons, so we take that approach here. In other words, the hash table elements consist of a 32 bit hash value, the key, and the value. For my implementation I stored the hash values in a separate array in order to get more hash probes per cache line (at the expense of a second mandatory cache miss to actually fetch the key). This gave a small speed-up for my test where the key size was large (sizeof(std::string)), but there’s a #define in the code to put the hash value alongside its key.

In order to indicate that a slot is unused, we modify the hash function to never return 0, and use a stored hash value of 0 to mean “uninitialized slot”.

First, let’s look at the insertion algorithm, since this is where the actual Robin Hood trick comes in.

void insert_helper(uint32_t hash, Key&& key, Value&& val)
{
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {			
        if(elem_hash(pos) == 0)
        {			
            construct(pos, hash, move(key), move(val));
            return;
        }

        // If the existing elem has probed less than us, 
        // then swap places with existing
        // elem, and keep going to find another slot for that elem.
        int existing_dist = probe_distance(elem_hash(pos), pos);
        if (existing_dist < dist)
        {	
            if(is_deleted(elem_hash(pos)))
            {
                construct(pos, hash, move(key), move(val));
                return;
            }			
            swap(hash, elem_hash(pos));
            swap(key, buffer[pos].key);
            swap(val, buffer[pos].value);
            dist = existing_dist;
        }

        pos = (pos+1) & mask;
        ++dist;			
    }
}

The algorithm is pretty simple. We simply loop until we’ve found an uninitialized slot (hash value == 0). If we found an existing slot whose probe distance is less than our current probe count (‘dist’), we swap places and continue.
Note: using move semantics here matters (e.g. for efficient swapping).

In order to delete an element, we follow the same strategy as for normal open addressing and mark the slot as a tombstone. Tombstones are treated specially in the insert algorithm. We overwrite the tombstone only when we would’ve wanted to swap anyway (see the check for is_deleted above).

We must mark the tombstone using a whole bit rather than just reserving a single hash value (like we did for uninitialized slots), because we need to know the probe count of the tombstone.

bool erase(const Key& key)
{
    const uint32_t hash = hash_key(key);
    const int ix = lookup_index(key);

    if (ix == -1) return false;

    buffer[ix].~elem();
    elem_hash(ix) |= 0x80000000; // mark as deleted
    --num_elems;
    return true;
}

This is all pretty straightforward. We first find the element, then we call its destructor, and finally set the “tombstone” bit.

And lastly, the lookup algorithm:

int lookup_index(const Key& key) const
{
    const uint32_t hash = hash_key(key);
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {							
        if (elem_hash(pos) == 0) 
            return -1;
        else if (dist > probe_distance(elem_hash(pos), pos)) 
            return -1;
        else if (elem_hash(pos) == hash && buffer[pos].key == key) 
            return pos;				

        pos = (pos+1) & mask;
        ++dist;
    }
}

This just finds the desired position of the element using the hash, then probes linearly from there on in. There are two conditions to signify that the element doesn’t exist, and finally one check to see if we’ve found the element.

The first exit condition checks for completely uninitialized elements. This is because if the element we are looking for had probed an uninitialized slot during insertion, it would have simply been place there. So the existence of an uninitialized slot along our probe sequence means the element must not be in the table.

The second condition is more interesting. This is our replacement for simple checking the probe distance against a table-wide maximum probe count. We know that when we probe an element during insertion, the one with the longer probe count of the two gets to keep the slot. So if we’re looking for an element that exists, we should expect to never see an existing element with a shorter probe count then our current count (if that had happened, there would’ve been a swap during insertion!).

In other words, we early out as soon as our current probe count exceeds the probe count of the stored element. Since the average probe count for stored elements is 2.55, we can exit pretty quickly for non-existent elements (much earlier than stopping after a table-wide maximum probe count).

This second condition is why we need to maintain the hash value even for tomb stones. Imagine what would happen if we simply marked the slot as uninitialized when it was deleted – the next insertion that comes across it would simply occupy it thereby getting an unfairly low probe count, and more importantly messing up the second condition above. Instead, we just flag deleted elements as tombstones, and only reuse the slot in the insertion algorithm if a swap would’ve happened anyway.

Finally, the last condition simply compares the hashes and the keys and returns the found element.

Here’s a the full code for this blog post: code. I should note that I wrote this specific implementation for this blog post, so it hasn’t been extensively used or tested (or optimized). It’s quite likely that there are bugs.

Thoughts on Game Data Formats

Standard

I’ve been thinking about the representation and loading of game data recently. This has largely been the result of stumbling upon a variety of blog posts and libraries such as:

http://gamedevcoder.wordpress.com/2013/02/16/c-plus-plus-rtti-for-games/

http://kentonv.github.io/capnproto/install.html

http://blinkprotocol.org/

http://www.altdevblogaday.com/2012/06/04/read-my-lips-no-more-loading-screens/

 

Desired characteristics

So, the basic problem is that you want to store a large variety of “resources” for a game “level” (or “sector” or whatever), describing all sorts of things such as the location of every object, the vertices of every mesh, and so on. There are a number of desirable characteristics for this data format, such as:

  • Efficient on-disc representation
  • Efficient run-time and load-time access. Ideally you can load the data structure into memory and just use it directly without any “translation” step. At the very least any load-time cost should be cheap.
  • Some way of reflecting the data at run-time. Ideally this is something you don’t need for the final release game, at least for most objects, but it’s extremely useful during development to be able to essentially “attach” a visual debugger to a running game and get a description of every object in the level. E.g. for console development your editor could run on the PC, while the actual game runs on the console. The console would reflect its actual runtime state and keep the editor up to date, and the editor would send modifications as messages to the runtime.
  • Some way of evolving the representation of an object. Ideally a new runtime of the game should be able to just transparently load old resources without any fuss, or rebuilding. This is particularly useful for post-release patching, but even during development “rebuilding” thousands of objects (with whatever synchronization overhead and dependency tracking you might also need to do to avoid stomping on other developers’ toes) can quickly become an issue.
  • Multiple artist should be able to work on a level at the same time.

I haven’t implemented anything that provides all these benefits simultaneously in a shipping game, but with hindsight, and after thinking about some of the newfangled “protocol” formats that have become popular recently, I think I have a good approach in mind.

Overview

Levels are just folders of resources in the file system. Every single “object” in the level corresponds to a file, e.g. “tree_001291.inst”. The exact internal organization can vary – perhaps sub-folders refer to individual “sectors” (for e.g. streaming purposes), or perhaps the sub-folders are completely arbitrary and used by the designers to organize things however they want. The point is that there is no overall “level file” that anyone has to lock to modify the level. This means multiple artists can add to and modify different parts of the level. Just use normal source control, with exclusive locks, to process the individual files. Any per-level properties would be stored in tiny little resources that get dropped into the level like anything else. E.g. you might have a “physics properties” resource that store a per-level set of parameters like gravity. Keep even these global resources small and focused, to avoid contention. This does means you’ll have many thousands of files per editor, but it’s also super simple to inspect and debug when anything goes wrong.

Resources consist of “instance” data and a “schema“, both of which has a source representation and a runtime representation. For example, the schema might be represented in a simple JSON-like format, the instance source data can vary (e.g. a Maya .mb file, or a .wav file). The runtime representation is binary and efficient, produced at build-time from the source, using an appropriate “builder”. It’s likely that many different kind of instances use the same builder (e.g. simple “property bags” type resources might just take their input as plain JSON, even if each one represents a completely different kind of game object), with a few custom ones requiring more severe processing.

The schema describes the data layout of each resource type. I.e. you might have thousand “goblin” resources in the level, but there will only be a single schema describing the layout of all those instances. The reason for storing as schema separately, as opposed to having each instance be self-describing, is to avoid duplicating type-metadata and field offsets etc. that don’t change, and also because it gives you a convenient place to hang additional (optional) meta data off of, such as RTTI information (and on the editor side of things, UI meta data).

The instance data is basically just the actual bits that refer to a single instance of that resource. E.g. a single health pack, or monster spawn location, or rigid body, or a mesh, or whatever. It contains no “meta data”. It’s a compact format, but stored in a “native” way so that it’s cheap to read at runtime directly without first having to convert it to a runtime representation. Keeping the instance data in the “native” form is important because it means most resource types wouldn’t have any “runtime version” of the on-disc data – it would just use the instance data directly. Some types would probably have some additional “companion” runtime data layered on top, and created at load-time – e.g. D3D header objects and such.

A particular version of a schema is immutable. I.e. once you’ve created (and checked in) a schema you can’t ever change it. You can only create a new version of it (and you can only do that according to specific rules, that don’t invalidate previous instances and schemas). Each instance refers to a particular schema of a particular version. In other words, the game can load any version of any instance, because it uses the schema the instance was built against to interpret the data. So what I said above about there only being one schema for thousands of instances of a given resource type was a lie – there can be multiple versions of each schema. In practice you’ll only have a few versions actively used at any one time, because anytime you modify an instance, it would be saved out using the most recent schema (and you can periodically run a “refresh” batch job to rebuild instances).

The schema’s runtime asset is simple. Each field in the structure has a unique integer identifier starting from zero, with no gaps. This is the only way to identify a field. These identifier cannot change when you create new versions of a schema. You can add new field, and you can mark old fields as “deleted” (which means they no longer take up space in the instance), but you can’t actually reuse an old identifier. Lots of deleted fields imply some overhead in the schema runtime asset, but not in the instance data. These rules ensure that creating a new version of the schema won’t cause old instance to be invalid.

The runtime structure of a schema is basically just an array of offsets, one for each field, indicating where that field is stored within the instance data. There’s also some additional meta-data containing the string representation of each field etc., used for RTTI (ideally stored in a separate chunk that can be skipped if you don’t need it).

Each schema corresponds to a generated C++ class header. This class simply provides a bunch of accessor methods. Each one looks up the offset for the field in question using the offset array from the schema, and the identifier for the field, and returns the memory from the instance data at that location casted to the right type. This does mean an additional indirection for field access, but it should still be reasonably fast (especially since the schema’s offset table is likely to be hot in the cache quite frequently). If the field doesn’t exist in the schema (e.g. it was added after the schema was created – so its identifier is larger than the size of the offset array, or the offset array contains the “deleted” marker), the generated code would simply return the default value (stored inline as a constant, in the generated code). This is the migration path – if you add a new field to a schema, all you need to do is make sure you add the corresponding code to “do the right thing” with that new field, including doing something sensible for default values. Old code will simply ignore that field (their version of the generated class won’t have it, nor any code that relies on it!), and if the new code reads an old instance it will just get the default value for that field.

Further details and rationale

So that’s the basic idea, there are some other details that are important to note, though.

So clearly you don’t want to be loading individual files all the time, which is why there needs to be some kind of “bundling” process that groups resources together. Probably sorts them by (schema,version), so that you load all objects with a given schema at the same time. This wouldn’t just be for release builds, you’d regularly build these bundles even during development. In order that you don’t invalidate a whole bundle and revert to excruciatingly slow load times if you modify a single resource, there needs to be a way for the loader to basically override individual resources from a bundle.

One likely objection is that it’s much simpler to just enforce header/data consistency and always just do a plain memcpy-style resource load. That way there’s no need for the indirection via the schema. Why go through the trouble? Well, in a system with strict header/data dependencies, if you change the layout, you have to rebuild every resource of that type. I really like the simplicity of this approach, but I’ve been down that road and had to live with resource rebuilding being a major iteration hurdle for everyone involved.

The bottom line is that if you have thousands of assets of a particular resource type, every tiny little inefficiency in your build-time code gets multiplied by that number, making your performance margins razor-thin. Even 100ms of build time for a resource can quickly become unacceptable if you have ten thousand of them that need to be rebuilt because you added a flag. It’s not hard to hit those kind of times from just loading DLLs, and hitting the disk once or twice. Any slight inefficiency quickly adds up to many seconds, or even minutes of waiting around. I don’t know about you, but I’d rather not worry about a couple of milliseconds here and there on the build-side of my code. Plus, this approach opens up a pretty nice patching/DLC story.

Another objection may concern the fact that we’re generating code. Why not let our schema source asset simply be our C++ header file and parse it to build the runtime asset for a schema? Or why not annotate it with preprocessor macros and generate the offsets directly in code? Well, in a word – simplicity. Writing parsers is pretty hard. Writing a parser with good error messages is even harder. Voluntarily writing a parser with good error messages for C++ is pretty much grounds for termination. You do not want to be in this business! You might be able to restrict the kind of C++ you allow in this “schema header”, but inevitably someone will mess up and put unsupported code in there, and your cobbled together mess of preprocessor defines, build-time parsers, and maybe even template meta programming will lead to the worst error messages you’ve ever seen. I’ve been down this road. There be dragons. Maybe you can avoid them if you think hard about it going in, but honestly I don’t see any major upside to the approach that pays for the extra risk and complexity.

In contrast, generating code is child’s play. The accessor methods in the generated code is trivial. Anyone can understand and debug it. There’s absolutely nothing clever going on at all when you’re generating code (the code that generates C++ is simple, and the generated code is simple too). Everything is just plain old dumb C++ code that contains absolutely nothing clever. In my opinion, of the common choices for these things, code generation is the right choice. It retains maximum flexibility since you can just generate any code you want (you’re not restricted to what you can do with preprocessor macros or template or whatever), without costing you any simplicity.

You may worry about the overhead of these accessor methods. While it’s reasonably cheap, there’s still a range-check on the offset, and an extra indirection. I’m not overly concerned about that. Part of it is because in my experience it turns out a huge chunk of the data you actually touch lives in a relatively small set of game types. E.g. matrices, quaternions, vectors, colors, etc. These types never actually change representation in practice, so can simply be hard coded as “native” types instead of being represented as schemas of their own. In other words, accessing the toplevel “transform” field does require the extra range check and indirection, but each member of that structure (e.g. 16 matrix members) can be accessed directly as it’s just a plain old C struct. I’d estimate you end up with a few dozen “native” types like this, which improves the constant factors involved by quite a bit.

Plus, if accessor overhead is really an issue, you could make sure your Release builds will never load old instances (by running a global “refresh job” on all instances as you produce your final builds) and hard-code the offset in the generated accessor code instead of going via the schema. I’d avoid this, because it means you may have issues using the “upgrade path” for future patches. Another option is to tag certain type as having an “efficient” format, this would make the schema builder also produce a plain C struct for the type, and a loader method that simply copies each member at load time (using the schema offset table and instance data) – once that’s done it’s just plain old C data that can be accessed at full speed. This costs you load time performance and simplicity, but does offer a nice “backup plan” (e.g. if you want to modify the runtime data and save it back out to the instance source format, for live editing purposes).

So what about on-disc storage? If we need to store all our data in a “native” format, won’t we waste a whole bunch of memory? For example using 32 bits to store an integer when it could be compressed as 10 bits is a bit wasteful, right? Yeah, this is true, and it may be worth worrying about. However, it’s not uncommon to have some kind of transparent compression on the IO-system level anyway (e.g. based on zlib, or Snappy), so combined with the bundling system above most of the redundancy should be compressed away. If not, Cap’n proto has a clever trick where they store the instance data as the actual instance data XOR’d with the default instance (which I would store in the schema). This means most instances consist of tons of zeroes (since most fields aren’t actually changed from their defaults, in practice). Combine this with some really cheap zero-aware compression where you only store non-zero bytes, and you get great on-disc storage characteristics. Of course, the next step is to skip the XOR step altogether. When you compress the instance you do it using a “default-aware” compressor instead of a zero-aware one – i.e. skip storing any bytes that are identical to the default instance and at load time you would initialize the instance by either loading bytes from the resource instance or from the default instance, based on some control stream stored in the resource . Of course, you should measure carefully if this is worth it. There’s something to be said for loading directly from disk into the final destination buffer, without having to even involve the CPU.

Another interesting thing I’d like to try is to separate mutable from immutable data on the level of the schema (with immutable being the default, and this affecting the accessors being generated). Typically the vast majority of the data you’d load this way is immutable (in terms of size) – e.g. vertex data, audio data, texture data, etc., with only a few things actually being modified at runtime. Since we’re generating code from a schema, we can make these kind of global experiments pretty easily.

Anyway, I this seems to combine the best features from all the different systems I’ve worked with, read about, or even implemented myself. Thoughts?

On GC in Games (response to Jeff and Casey)

Standard

So it turns out youtube comments suck. I’ll write my response to Jeff and Casey’s latest podcast in blog form instead of continuing the discussion there. View it here: http://www.youtube.com/watch?v=tK50z_gUpZI

Now, first let me say that I agree with 99% of the sentiment of this podcast. I think writing high performance games in Java or C# is kinda crazy, and the current trend of writing apps in HTML5 and JavaScript and then running it on top of some browser-like environment is positively bonkers. The proliferation of abstraction layers and general “cruft” is a huge pet peeve of mine – I don’t understand why it takes 30 seconds to launch a glorified text editor (like most IDEs – Eclipse, Visual Studio, etc.), when it took a fraction of a second twenty years ago on hardware that was thousands of times slower.

That said, I do think their arguments against GC aren’t quite fair (and note that GC has nothing to do with JITs or VMs). The pile up roughly the right things in the “cons” column, but they completely ignore “pros” column, and as a result act baffled that anyone would ever think GC is appropriate for any reason.

Before I get into it, I should probably link to my previous post on GC where I spend a large chunk of time lamenting how poorly designed C# and Java are w.r.t. GC in particular. Read it here.

To summarize: no mainstream language does this “right”. What you want is a language that’s memory safe, but without relegating every single allocation to a garbage collected heap. 95% of your memory allocations should either be a pure stack allocation, or anchored to the stack (RAII helps), or be tied uniquely to some owning object and die immediately when the parent dies. Furthermore, the language should highly discourage allocations in general – it should be value-oriented like C so that there’s just plain less garbage to deal with in the first place. Rust is a good example of a language of this kind.

You’ll note that most of Jeff and Casey’s ranting is not actually about the GC itself, but about promiscuous allocation behavior, and I fully agree with that, but I think it’s a mistake to conflate the two. GC doesn’t imply that you should heap allocate at the drop of a hat, or that you shouldn’t think about who “owns” what memory.

Here’s the point: Garbage collection is about memory safety. It’s not about convenience, really. Nobody serious argues that GC means you don’t have to worry about resource usage. If you have type safety, array bounds checks, null safety, and garbage collection, you can eliminate memory corruption. That’s why people accept all the downsides of GC even in languages where it comes with much higher than necessary penalties (e.g. Java, C#, Ruby, Lua, Python, and so on… pretty much all mainstream languages).

A couple of weeks ago I spent several days tracking down a heap corruption in a very popular third party game engine. I haven’t tracked down who’s responsible for the bug (though I have access to their repository history), and exactly how long it’s been there, but from the kind of bug it was I wouldn’t be surprised if it’s been there for many years, and therefore in hundreds (or even thousands?) of shipped games. It just started happening after several years for no real reason (maybe the link order change just enough, or the order of heap allocations changed just enough, to make it actually show up as a crash).

The main thing to say about this bug (I won’t detail it here because it’s not my code) is that it was caused by three different pieces of code interacting badly, but neither piece was necessarily doing anything stupid. I can easily see very smart and professional programmers writing these three pieces of code at different times, and going through a few iterations perhaps, and all of a sudden there’s a perfect storm, and a latent memory corruption is born.

I mention this because it raises a few important points:

  • Memory corruption is not always caught before you ship. Any argument about manual memory corruption not being so bad because it’s at least transparent and debuggable, unlike the opaque GC, falls flat on its face for this reason. Yes, you have all the code, and it’s not very complicated, but how does that help you if you never even see the bug before you ship? Memory corruption bugs are frequently difficult to even repro. They might happen once every thousand hours due to some rare race condition, or some extremely rare sequence of heap events. You could in principle debug it (though it often takes considerable effort and time), if you knew it was there, but very sometimes you just don’t.
  • Memory corruption is often very hard to debug. Often this goes hand in hand with the previous point. Something scribbles to some memory, and fourty minutes later enough errors have cascaded from this to cause a visible crash. It’s extremely hard to trace back in time to figure out the root cause of these things. This is another ding against the “the GC is so opaque” argument. Opacity isn’t just about whether or not you have access to the code – it’s also about how easy it is to fix even if you do. The extreme difficulty of tracking down some of the more subtle memory corruption bugs means that the theoretical transparency you get from owning all the code really doesn’t mean much. With a GC at least most problems are simple to understand – yes you may have to “fix” it by tuning some parameters, or even pre-allocating/reusing memory to avoid the GC altogether (because you can’t break open the GC itself), but this is far less effort and complexity than a lot of heap corruption bugs.
  • Smart people fuck up too. In the comments there were a number of arguments that essentially took the form “real programmers can deal with manual memory management”*. Well, this is an engine developed by some of the best developers in the industry, and it’s used for many thousands of games, including many AAA games. Furthermore, there was absolutely nothing “stupid” going on here. It was all code that looked completely sane and sensible, but due to some very subtle interactions caused a scribble. Also, it’s not hard to go through the release notes for RAD-developed middleware and find fixes for memory corruption bugs – so clearly even RAD engineers (of whom I have a very high opinion) occasionally fuck up here.

With memory safety, most of these bugs simply disappear. The majority of them really don’t happen at all anymore – and the rest turn into a different kind of bug, which is much easier to track down: a space leak (a dangling pointer in a memory safe language just means you’ll end up using more memory than you expected, which can be tracked down in minutes using rudimentary heap analysis tools).

In other words: memory safety eliminates a whole host of bugs, and improves debuggability of some other bugs. Even when a GC causes additional issues (which they do – there’s a real cost to GC for sure) they at least do so before you ship, unlike the bugs caused by not having memory safety. This is a very important distinction!

Yes, you should be careful, and I’m certainly not advocating Java or C# here, but when you do consider the tradeoffs you should at least be honest about the downsides of not having memory safety. There is real value in eliminating these issues up front.

In current languages I would probably almost always come down on the side of not paying the cost of GC for high-performance applications. E.g. I’ll generally argue against any kind of interpretation or VM-based scripting altogether (DSLs that compile to native is a different issue), especially if they require a GC. However, I don’t think you need to overstate your case when making the tradeoff.

If I could pay a small fixed cost of, let’s say 0.5ms per frame, but be guaranteed that I’m not going to have to worry about any memory corruption ever again I’d totally take that tradeoff. We’re not there yet, but we really aren’t that far off either – the problem isn’t intrinsic to GC. Plenty of high performance games, even 60Hz ones, have shipped with GC’d scripting languages, and while they don’t usually collect the whole heap, a lot of them do manage to keep the GC overhead around that level of cost. So maybe in the future, instead of paying 0.5ms to GC a small heap that’s completely ruined by a shitty language that generates too much garbage, we could instead GC the whole heap and end up with similar levels of complexity by just creating less garbage in the first place (using a non-shitty language).

 

*Side note: I really hate arguments of the form “real programmers can deal with X” used to dismiss the problem by basically implying that anyone who has a problem just isn’t very good. It’s incredibly insulting and lazy, and no discussion was ever improved by saying it. In my opinion hubris, or extrapolating too far from your own experience, is a far more common sign of incompetence or inexperience than admitting that something is hard.

Runtime Code Specialization

Standard

Specializing code is nothing new, but I’m surprised that modern programming languages don’t attempt to make this easier.

Just to give you some context, the idea is that whenever you have an algorithm that takes multiple parameters where one or more of the parameters are fixed for some period of time, you specialize the code for the fixed parameters in order to achieve greater optimization at runtime. A key idea is that you may not know at compile time what the fixed value is, and the duration for which a value is fixed could be shorter than the execution of the program (mere milliseconds, even).

One example would be to compile a regular expression into machine code – at runtime – whose function is to take the input string and match it against the (fixed) regular expression. This can be many times faster than an algorithm based on interpreting the regular expression as you go.

Other examples are algorithms where there are runtime inefficiencies due to combinatorial explosion. Such as converting one image format to another. If you have enough different image formats it’s infeasible to write one function for each pair of formats, so you would typically write a single conversion function that converts from the source format into some high precision intermediate format, and then calls another function to convert to the destination format. This can be inefficient because you have to switch on the data format inside the conversion loop. With code specialization you look up the two conversion function once, and then call them directly (or even inline them, then simplify away the intermediate format when possible).

Now, all this can be done manually, and while things haven’t improved as much as I would like, things are getting better. In C# you could always use the System.Reflection.Emit namespace to generate MSIL at runtime, and then compile it into a method. This was tedious and low level, but better than generating x86 assembly at least! Starting with .Net 3.5 you can instead use expression trees. This is much better. Not only can you compile some simple lambda expression directly to expression trees, but even building them manually is much higher level.

For example, starting with a simple function to match “glob” patterns (e.g. “f?o.*”) like this:

bool match(string pattern, string testString, int i, int j)
{
    if (i == pattern.Length) 
        return j == testString.Length;

    switch (pattern[i])
    {
        case '?'    : 
            return matchHelper(pattern, testString, i+1, j+1);
        case '*'    :                    
            for (int k = j; k <= testString.Length; ++k)
            {
                if (matchHelper(pattern, testString, i+1, k)) 
                      return true;                        
            }
            return false;                

        default     : 
            return j < testString.Length && 
                   pattern[i] == testString[j] && 
                   matchHelper(pattern, testString, i+1, j+1);
    }
}

We can (manually) convert this to a function that, passed an expression holding the string to be tested and the starting index, generates an expression representing whether or not the pattern matches. We do this by interpreting the “pattern”, and generating code that just tests a passed in string expression against it. We don’t call any functions in the generated expression, but rather keep traversing the pattern doing as much of the testing as we can ahead of time, only putting the stuff that absolutely depends on the passed in test string into the expression tree:

// Generates an expression of type bool
static Expression GenerateMatchExpression(string pattern, int i, ParameterExpression str, Expression j )
{
    var strLenExp = Expression.Property(str, StringLengthProp);

    // If the pattern is empty, we match if we've reached the end of the string
    if (i == pattern.Length) 
        return Expression.Equal(j, strLenExp);

    // Shave off a single char from the string
    if (pattern[i] == '?') 
        return GenerateMatchExpression(pattern, i+1, str, Expression.Increment(j));
    
    // Remove the wild card, and try to match all substrings
    if (pattern[i] == '*')
    {
        var subStringStart = Expression.Variable(typeof(int));              
        var breakLabel = Expression.Label(typeof(bool));
        return Expression.Block(typeof(bool),
            new ParameterExpression[] { subStringStart },
            // start at current index, we'll increment this 
            // to check all "tails", including empty string
            Expression.Assign(subStringStart, j), 
            Expression.Loop(
                Expression.Block(                        
                    // strStart > str.Length
                    Expression.IfThen(Expression.GreaterThan(subStringStart, strLenExp),
                            Expression.Break(breakLabel, Expression.Constant(false))
                    ),

                    //check substring
                    Expression.IfThen(GenerateMatchExpression(pattern, i + 1, str, subStringStart), 
                        Expression.Break(breakLabel, Expression.Constant(true))
                    ),
                    Expression.Assign(subStringStart, Expression.Increment(subStringStart))
                ),
                breakLabel
            )
        );
    }
    
    // Check a single character.
    return  Expression.AndAlso(  
                Expression.LessThan(j, strLenExp), 
                Expression.AndAlso(  
                    Expression.Equal( Expression.Constant(pattern[i]), 
                                      Expression.Call(str, GetCharsMethod, j)),
                    GenerateMatchExpression(pattern, i + 1, str, Expression.Increment(j))
                )
            );
    
}

Then we pass in the string and index by wrapping this expression in a lambda like so:

var strParam = Expression.Parameter(typeof(string));
var exp = Expression.Lambda<Func<string,bool>>(
                GenerateMatchExpression(pattern, 0, strParam, Expression.Constant(0)), strParam
          );        
Func match = exp.Compile();

For example, if we pass in “a*b*c?”, we get an expression tree that looks like this:

.Lambda #Lambda1(System.String $var1) {
    0 < $var1.Length && 'a' == .Call $var1.get_Chars(0) && .Block(System.Int32 $var2) {
        $var2 = .Increment(0);
        .Loop  {
            .Block() {
                .If ($var2 > $var1.Length) {
                    .Break #Label1 { False }
                } .Else {
                    .Default(System.Void)
                };
                .If (
                    $var2 < $var1.Length && 'b' == .Call $var1.get_Chars($var2) && .Block(System.Int32 $var3) {
                        $var3 = .Increment($var2);
                        .Loop  {
                            .Block() {
                                .If ($var3 > $var1.Length) {
                                    .Break #Label2 { False }
                                } .Else {
                                    .Default(System.Void)
                                };
                                .If (
                                    $var3 < $var1.Length && 'c' == .Call $var1.get_Chars($var3) && .Increment(.Increment($var3)) == $var1.Length
                                ) {
                                    .Break #Label2 { True }
                                } .Else {
                                    .Default(System.Void)
                                };
                                $var3 = .Increment($var3)
                            }
                        }
                        .LabelTarget #Label2:
                    }
                ) {
                    .Break #Label1 { True }
                } .Else {
                    .Default(System.Void)
                };
                $var2 = .Increment($var2)
            }
        }
        .LabelTarget #Label1:
    }
}

Note how we get one loop generated for each ‘*’ character, in a nested fashion, and how the entire thing is inlined and specialized. This should compile to pretty efficient code. Of course it could be even more efficient if we first generated a DFA from this and then compiled that, but the point here is code specialization, not finite automata.

Now, this works, but holy hell that took a lot of work to get right, and this was for a fairly trivial example. Imagine if this was something more complicated! Imagine if you had to maintain this! The horror!

So, what do we actually need? We need a language that allows you to specialize (almost) arbitrary functions for some of their parameters, returning a new function with all the constants folded away, virtual functions inlined, etc. If this was as easy as writing down some simple attributes you could imagine such a language would be able to achieve far superior performance to even native languages like C++ simply because it’s infeasible to do this kind of code specialization on a wide scale unless there’s language support for it. You could potentially write a C# function that does this. I.e. create an expression tree by wrapping the code to be specialized in a lambda, then pass this expression to a function that massages the resulting tree given that some of the parameters are constant. Perhaps the JIT compiler can even be convinced to do some of the optimizations for you once it knows that some of the inputs are constants, but more likely than not you have to manually propagate the fact that these things are truly fixed through the expression tree.

This probably isn’t as simple as I make it out to be. For example, you may want to specialize an animation function on the blend tree, while allowing the specific blend values of each node to vary each time you invoke it. This could be done by having the programmer manually separate these two pieces of data out and only specializing the first, but ideally you could mix and match “fixed” data from “varying” data and have specialization only “boil away” the fixed stuff, while storing references to the varying stuff.

My contention is that client-facing apps have a lot of code that could benefit from this sort of strategy. Thinking about something like a game engine running at 60Hz it’s amazing just how much time you spend doing the exact same thing over and over again every frame where the only thing that’s different from frame to frame is the specific integer or float that gets passed into the ALU, but most branches and vtables go the same way every time. Being able to flag things as being constant for some duration of time and avoid all that “interpretation overhead” would be a huge win in a bunch of cases.

The next step would be to have a runtime that automatically traces the actual execution and specializes things for you as you go. Much like javascript compilers do. However, I’m talking about applying this to highly optimized ahead-of-time compiled programs, so the number of things the tracer would have to look out for would presumably be less extensive than for a dynamic language (e.g. we’re not looking to discover that a variable is an integer – we know that already). We’re basically just looking to do some constant folding, some vtable lookups (and subsequent inlining) and stuff like that. The static compiler could flag where it might be profitable to do runtime tracing, perhaps.

I kinda like the idea of having the programmer explicitly define specializations, though, because it means you can put the onus on the programmer to ensure certain things that a compiler wouldn’t want to assume. Like non-divergence of recursive functions, which means that e.g. a tree traversal function could be specialized with each recursive call being blindly inlined. The programmer would be responsible for proving that, taken the specified parameters as constants, the recursion will eventually stop.

Are there any languages out there that do this?

Space-efficient Resizable Arrays

Standard

This post is about resizable arrays that would serve as an alternative to std::vector. You’re probably familiar with geometrically growing arrays like std::vector, so I won’t waste time describing them in detail. Briefly, they offer amortized constant-time append and lookups, but with a pretty severe cost to space overhead (as low as 33% utilization, after shrinking until you hit a realloc, or 50% if all you ever do is append). Another issue is that the elements of the array get copied ever so often. The exact number of times an element gets copies depends on the initial size and total size of the vector, but roughly speaking expect north of 2 extra copies per element, in addition to any copy you had to do to get the data into the vector in the first place. This can be very expensive for big objects, and perhaps more importantly isn’t very consistent – occasionally you get a big hiccup as every single element is copied.

The data structure I’m describing has constant-time random access, append, etc. like std::vector, but at a fraction of the storage overhead, and with no extra copies of the elements. There are occasional spikes, but they amount to a single allocation (or deallocation) – no copying. The tradeoff is a more complicated procedure for random access (still efficient, and constant time, but certainly much more complicated than just a simple pointer addition like std::vector can do).

I ran into the paper “Resizable Arrays in Optimal Space and Time” a couple of weeks ago, and thought all this sounded pretty great. Especially for modern architectures where ALU work is cheap, relatively speaking. The basic idea is to represent arrays as an array of pointers to data blocks. This first array is called the “index block”. The data blocks get bigger and bigger as the total amount of elements increases. They’re roughly O(√n)-sized. This is enough that the size of the index block itself quickly becomes irrelevant because the data blocks dwarf the cost of tracking it in the index block. For all this to be practical, we need there to be an efficient, constant-time, method to discover which data block a given “virtual index” resides in. Then we can fetch from that data block using the index block as an indirection.

The paper describes the basic structure of this, as well as some motivation and proof of its asymptotic properties, so I won’t repeat it here. The key point is that each block is roughly √n-sized (with n being the size of the array at the time that block was allocated – early blocks are small, and later blocks are large). Only the last block is non-full, so the total amount of space overhead at any one time is O(√n). Because we only ever add (or remove) blocks, never reallocate them, we don’t need to copy any elements. This non-moving property also makes this data structure suitable for memory pools and such (growing the array won’t cause existing elements to move around – which is the usual problem with using std::vector for pools).

Looking around on the web, though, I couldn’t find anyone who had actually implemented this. This seemed curious and slightly worrying to me. I decided to implement a basic version in C++ whenever I had time. Well, after attempting an implementation I discovered that the paper has a critical bug in the core algorithm. I also found that others had indeed attempted to implement this and run into the same issue (see for example this email exchange). Turns out that the computation of p in the Locate algorithm is just plain incorrect (likely a typo – it corresponds to computing the number of data blocks for a single super block, rather than the total sum of data blocks for all super blocks up until that super block index).

Anyway, after much hassle I was able to arrive at a corrected version of this algorithm. WordPress won’t let me paste code in here (even when editing the markup manually with the code tag), so I’ll just refer you to the locate_data_block_and_elem
method. While I used some MSVC intrinsics to do a couple of the bit-fiddling operations a bit more efficiently, I haven’t spent any major energy optimizing it (this is the place to optimize though, I’d say). Feel free to improve on this in the comments J!

Here’s a link to the implementation: compact_vector. Note, this is very basic, and probably has bugs. It does the main things, but there’s plenty of stuff you’d need to add to make this industrial strength (e.g. right now I rely on elements having a default constructor, have no support for custom allocators etc.). It basically follows the main ideas of the paper, except that instead of tracking the size of the current block and super block (doubling each alternately), I just compute these two values from the number of super blocks each time they’re needed (this turned out to be no slower, and reduced the number of book-keeping variables, as well as simplified the code). I also keep the “spare” empty data block in a separate pointer instead of storing it in the index block, to keep the code a bit cleaner (no need to keep checking for this special case – the index block only refers to data blocks in actual use).

So here are some quick benchmarks. This is on one machine, without any real attempt to be super-rigorous, etc. etc., your mileage may vary, yadda, yadda. All benchmarks are for 50k elements. First, let’s look at appending 4 byte integers.

Array type     Cycles per element
Compact vector append time (4 bytes) 35.8
std::vector append time (4 bytes) 45.0

So we’re already 20% faster than std::vector for insertions. Likely because we don’t have to do all those extra copies. By the way, the space overhead for the compact vector here is about 2%.

Using some fatter elements, say 16 bytes, the difference is even more pronounced:

Array type     Cycles per element
Compact vector append time (16 bytes) 152.2
std::vector append time (16 bytes) 283.6

That’s about 46% faster. Okay, so now that we have the good news out of the way, what’s price we pay for this? I mean look at that function mentioned above, clearly we’re paying some cost to index into this thing, right? Well, yes of course, there’s no free lunch. For random access we see the following picture:

Array type     Cycles per element Ratio to std::vector
Compact vector random access (4 bytes) 16.6 8.3
std::vector random access (4 bytes) 2.0
Compact vector random access (16 bytes) 18.1 7.9
std::vector random access (16 bytes) 2.3
Compact vector random access (64 bytes) 20.6 4.0
std::vector random access (64 bytes) 5.1

This is for a loop that just goes straight through the array of integers and sums up the elements, using random access reads. This access pattern should be a worst case scenario for compact array relative the much simpler std::vector since it reduces the impact of cache misses (and indeed the relative cost difference goes down as the element size goes up).

However, rather than just focus on that massive 8x ratio number, I think it may be more interesting to look at the absolute values of the timings here. We’re talking about 17 cycles per element, including the loop overhead and the actual addition itself. In many cases where you do random access reads on std::vectors you end up doing far more work than that per element – especially if the accesses are truly random so you incur a cache hit per access (although technically for the compact array you may end up with two cache misses, the index block is small so you’re a lot more likely to get a hit there than on the full array).

For coherent reads, i.e. using iterators, we can do much better. We don’t need to translate the virtual index to a data block and intra-block index for each element, we can just step through the data blocks directly instead. We still need to do a check for each element to make sure we’re still in the same data block, and some minor computation to advance from one data block to the next (relating to computing the next block’s size), but the fast path is a quick pointer increment and compare which saves a lot of overhead:

Array type     Cycles per element
Compact array iterator scan time (4 bytes) 6.9
std::vector iterator scan time (4 bytes) 1.51

Here’s a graph of total storage used over time:

This is fairly typical behavior. The standard std::vector gets very good efficiency when the number of elements happens to be close, but below the current size of the underlying array, but you’re oscillating between 50% and close to 0%, averaging out at 25% overhead. In contrast, compact_vector has a bit more fixed overhead and some extra overhead for the the index block, but the storage requirements grows relatively smoothly (you can’t see it in the graph, but it does oscillate, just not quite as wildly). For what it’s worth, compact_vector hits ~25% memory overhead at around 200 elements (for the 64 bit version), and by 1000 elements we’re already down to around 10% (these figures get better for larger elements since the index block part of the overhead depends on element count and pointer size, not element size).

Anyway, I’ll stop talking now. Hope this is useful to someone!

UPDATE: And here’s a view of the actual overhead zoomed in to the first 4000 insertions. This shows you better how memory use isn’t just a difference in “spikiness” (allocation rate), the compact_vector genuinely tends to much lower overhead asymptotically compared to std::vector. The latter’s overhead averages out at 25% in the graph below (as expected), whereas the compact_vector quickly converges on single-digit percent overheads.

Garbage collection thoughts

Standard

One of the things that’s annoyed me for a while is that there aren’t any high performance, high level languages, really being seriously developed. It seems to me that most of the benefits from high level languages, if carefully designed, have no or very limited performance implications, but yet we get languages that either largely abandon safety and productivity concerns even when there’s no reason (iterations on C/C++ mostly, including Go and D), or we get high level languages that throw out performance considerations left and right for minor convenience reasons (e.g. C#, Java).

Luckily this appears to be changing. E.g. languages like Rust seem very promising.

This brings me to one of the more controversial features of high level languages: garbage collection. IMO this is almost a no-brainer. Automatic memory management is such a productivity and safety booster that any high level language should have it, and I’m fairly convinced that it can be done in a high performance manner as well. We’ve just been conditioned to see problems with it because the host languages haven’t really tried to do this efficiently by making the right tradeoffs.

The first rather obvious optimization is to avoid GC work whenever possible – the easiest way to do this is to reduce the number of allocations. Languages like C# and Java get this entirely wrong. They heavily encourage heap allocations so you tend to get them all over the place. Each object consists mostly of pointers to other heap allocated objects. Where’s the actual data dammit? As a result, the GC has a crazy amount of work to do. C++ gets this right, and so does Rust. Most “member fields” in data types should just be stored in-place rather than behind a heap allocated pointer. This complicates the language if you want GC (especially with regards to taking pointers to internal parts of a data structure), but it’s worth the extra complexity. In practice this means that rather than each object having a massive fan-out where every member is another allocation, most objects will have only a couple of heal-allocated members – in aggregate this drastically reduces the heap graph and therefore the amount of work the GC has to do.

Another optimization, which rust and C++ makes alike, is to have “unique” pointers that refer to objects that are uniquely owned by their pointer. These could live in a separate heap altogether. Since these signify unique ownership, the object dies when its parent dies. Ownership can be transferred, but never shared (and this is statically ensured in Rust). This is particularly useful for reference counting based GC, since there’s no RC overhead for these pointers, but tracing collectors can often stop tracing early when it hits a unique pointer, if the type it’s looking at has no GC pointers in its transitive “type graph”.

These two things basically boil down to this: the language should allow you to express simple memory ownership when you have it, and only resort to arbitrary (shared) heap allocations in extreme circumstances. The language should prefer and promote stack allocation, and in-place allocation of sub-objects. When you do need heap allocations, the language should nudge you towards unique ownership for most of them.

Even D, which is ostensibly supposed to be a language prioritizing both efficiency and safety, doesn’t get this right with its struct/class separation (I want to very carefully elect to put stuff on the heap only on the small subset data where it’s absolutely necessary, rather than being forced – or at least heavily encouraged – into heap allocations because something happens to be a class).

The third and maybe most important optimization is to make garbage collection a per-thread operation. You simply don’t allow garbage collected heap allocations to transfer between threads. This means you can use efficient “stop the world” collectors (and compactors), without actually stopping the whole world – only stop one thread. Assuming threads are cheap and ubiquitous (they should be) we may have hundreds or even thousands of isolated heaps – each of which may be stopped for a few milliseconds at a time, but “on average” there are always enough threads that can run that we keep the mutator resources (CPU cores) busy doing actual work and make overall progress. You might call it “concurrent garbage collection in aggregate”. Plus, most of these threads are probably blocked at any one time – running GC on a blocked thread is basically a ghetto version of concurrent GC (from “inside the thread” it looks as if GC just happened while you weren’t looking, just like a concurrent GC, but without expensive read or write barriers).

Anyway, the point is that before we even start talking about GC, the assumption is that we are in a language designed with memory efficiency in mind, and have far less garbage than any existing mainstream high level languages, and that we can run the garbage collectors on small, isolated islands without stopping the entire world (or resorting to expensive concurrent collectors requiring mutator memory barriers).

There are two issues that come up here that I’ve been pondering for a while: external fragmentation due to all these heaps, and tracing collection versus reference counting.

External fragmentation

A lot of high performance garbage collectors (such as Immix etc.) gain their performance largely due to their heap organization. Typically the heaps are allocated in big “regions”, e.g. 32KB each. This is problematic for per-thread GC since it implies a minimum heap size, which in turn implies that having tons (e.g. thousands) of tiny threads will lead to substantial memory waste in the form of external fragmentation. Even if each heap is efficient on its own (e.g. runs compaction periodically), for the overall system the empty space at the end of each heap will add up to substantial fragmentation that cannot be reclaimed.

Ideally, each heap would start small, like a few hundred bytes or so, and then grow using bigger and bigger regions as the overall heap size increases. I haven’t really seen any GC papers do this, they all seem to pick a fixed region size up front. The reasoning is probably quite simple: with a fixed region size you can find the start of the region by looking at the pointer itself (power of two sized regions are cheaper).

I only really mention this because I think I have an efficient way of identifying the region a pointer belongs to, even if region sizes vary, that I haven’t seen mentioned anywhere. Perhaps this is well known, and I wouldn’t be surprised if people have done it for decades, but it’s not completely obvious and it seems like a trick that should’ve been mentioned but has been conspicuously absent in many GC books and papers I’ve read. So I should probably write it down in case it’s helpful to someone.

So the standard way of partitioning the heap is to have power-of-two region sizes so you can mask the lower order bits of the pointer to get a pointer to the start of the region (you could place per-region meta data here, or store a pointer from the region to a meta data block “off to the side” – the latter reduces contention for the ways in a set associative cache). The problem with this in the presence of varying region sizes is that you don’t know ahead of time how many bits you need to mask out (the number of bits to mask out depends on the region size, after all).

The basic idea is to simply try all the different region sizes, and see if the proposed region is actually valid. The trick is to eliminate false positives. The main way to verify if an arbitrary pointer actually refers to a region is to simply store a pointer to all regions in a separate array that user-code cannot write to. If the current proposed region is not in this array, then it is not the start of a valid region. Call this the “region pointer array”. To make the lookup in this array fast, we also store, as the first DWORD of each actual region, an index into this array. Call this the “region index”. We also store the region size in the meta data as well (along with all the other stuff, such as occupancy bits, mark bits, reference count bits, and whatever else you may store per region).

To figure out which region a pointer refers to in this scheme, we simply loop through the different region sizes that our allocator allows, mask the appropriate number of bits from the pointer to get the start of the proposed region, read the region index in the first DWORD of this memory area (this may be garbage if we have the wrong size) use this to index into the region pointer array (taking care to avoid buffer overruns from garbage indices) and ensure that the pointer stored in that element refers right back to the proposed region. We also must ensure that the size of the region at this address actually contains the original pointer.

This seems like a scheme that would allow us to have very small initial region sizes that quickly grows as the heap does, while only doing very simple tests to find which region a pointer belongs to. No, it’s not as cheap as a simple bitwise and, but it should essentially optimize to that in the common case due to allocation coherency. E.g. take the tracing loop in a mark and sweep collector, you can imagine having special-cased tracing loops for each region size (or at least the most common ones), and then just jumping from one loop to the next when the region size changes, but keeping the common path fast by being able to just mask and compare the next pointer from the tracing stack (mask with a constant, compare against a register) to ensure you’re still in the same region as you were, and going through the generic region-finding algorithm above only when that test fails.

Tracing GC vs Reference counting

The next question is if we should be using a tracing GC or reference counting. The conventional wisdom seems to be that reference counting isn’t performance competitive but keep in mind that:

  • Most GC research seems to happen in the context of Java. I.e. an extremely heap-heavy language, where each object is small and consists largely of pointers to other heap objects. This favors tracing over RC.
  • Even in this context, modern reference counting optimization can reclaim the performance difference vs. tracing collectors. See Down for the Count? Getting Reference Counting Back in the Ring.
  • For our hypothetical language we will have large objects, and few heap pointers, so the overhead for reference counting will be lower both in space and time.
    • Few pointers mean fewer reference counts to keep up to date.
    • Few allocations mean fewer reference count fields that take up space.
  • Per-thread GC heaps means we don’t need atomic reference count operations.
  • Reference counting costs are proportional to the size of garbage, rather than the size of the heap. This seems better for scalability in the future where heap sizes will keep getting bigger and bigger.
  • A strongly typed high level language should be immutable by default, or at the very least track mutability, which means we can statically know that large object graphs are acyclic (as immutable values must be, in a strict language) – no need to trace these in any cycle collector.

One of the benefits RC has that I’m not too concerned about is promptness of deallocation. The idea that as soon as a sub graph goes out of scope, memory is immediately reclaimed. This is appealing, but doesn’t eliminate pause times as some might think (you tend to get big “bubbles” that burst as a lynch pin object is deallocated, causing the mutator to stall for a long time as deallocations cascade). Many RC optimizations rely on giving up on this benefit in order to be competitive with tracing GC, so it’s a benefit I’m prepared to sacrifice.

Basically, I think that with a language that has the properties I’ve outlined for a hypothetical modern and performance oriented language (or the non-hypothetical language Rust), reference counting may well come out ahead.

There’s one final bit of conjecture: if immutable values are the default and therefore common in this language, the vast majority of heap pointer mutations will happen on the stack (i.e. local variables). I.e. you will grab some pointers to the heap, shuffle them around on the stack for a while in various algorithms, and then they go out of scope. This may not be true, but it seems very plausible. This implies that along various static analysis to avoid reference counting when possible (such as Rust’s “borrowed pointers” which amount to safe versions of C’s naked pointers), you can simply add the most naïve form of deferred reference counting as well and eliminate most remaining reference counting operations that way.

The idea is basically this. Don’t do any reference counting operations as the result of writing to stack variables. Only do them for heap variables. In other words, grabbing a pointer to a heap allocation and storing it in a local variable does not introduce a reference counting operation, only mutating a heap object’s pointer to another heap object does (these operations should be rare, largely because mutable pointer fields in data structures would be rare). If a reference count goes to zero during execution we can’t just delete it immediately in case there are also stack variables referring to it, instead we put a pointer to it in a thread local “zero count buffer” or ZCB. Put all newly allocated heap objects in this buffer too (unless we can statically ensure that it will be immediately referenced by another heap object). Once this buffer is full, or some other heuristic is met (e.g. a certain number of bytes has been allocated), stop the current thread, scan the stack for references into the GC heap and increment reference counts for these objects. This restores accurate reference counts to all objects that are potentially dead. Then, scan the ZCB and delete any objects whose reference counts are still zero. Finally, go through the stack again and decrement the reference counts you just incremented so that reference counts once again do not take the stack into account. The benefit is twofold, the first one is that we only perform reference counting for stack-rooted pointers periodically, rather than every time one is written to, which by the conjecture above should eliminate the vast majority of remaining reference counting operations. The other is that we batch up our deallocations into groups instead of deallocating throughout normal execution which should improve locality. The cost of writing an object to the ZCB when its reference count goes to zero is low. If we enforce that the ZCB always has at least one free element left we can safely do an unconditional write into it and perform the test for a full ZCB afterwards, without having to stall the subsequent code. This means modern OOE CPUs will correctly predict the branch and keep going assuming the ZCB is not full, which is the common case.

There are other common reference counting optimization in the literature, such as coalesced reference counting. I believe that in a language such as Rust these will have low value (having less RC traffic to start with), and that a very simple deferred scheme like above would get most of the benefit while avoiding the cost of per-object dirty bits and the like.

Anyway. This post is getting long. It may also be the worst kind of blog post – too specific to be interesting for most people, and written by a non-expert in the area so that people who would be interested think it’s either obvious or stupid. Let me know which in the comments!