I’ve moved!


I’ve switched off of wordpress.com. Please use www.sebastiansylvan.com instead. All old posts have been copied over there as well.


Brackets are awesome, don’t use them!


This is a syntax bike-shedding post, where I will try to explain why I think languages should stop using curly brackets for code blocks. I don’t expect anyone to come away completely blown away by this or anything, but I’m a language nerd and often end up discussing in-development languages with their designers. And since approximately 90% of all language discussion is about syntax, I’ll just write a reusable note on brackets here. J

Why brackets are awesome

Brackets are awesome because you can use them to group things in a way that visually resembles the contents being “grabbed” from both directions. This serves to isolate the contents visually, and allows for nesting. Observe:


How awesome is that? It’s immediately clear just visually that ‘a’ and ‘b’ belong together, and that they’re nested inside another pair with ‘c’. There are several kinds of brackets, we could use curly brackets and they work just as awesomely at grouping and nesting things:


The reason this works is because the brackets visually look like they are physically containing any horizontal content between them due to the “grasping” nature of the brackets themselves, and the mirror-symmetry.

When brackets are not so awesome

It’s no coincidence that the previous two examples look visually pleasing, the brackets were designed with that use in mind. Observe what happens if we make a slight adjustment:


Ugh! Not nearly as nice! It gets worse when we flip it around 90 degrees:

Suffice to say, brackets only really work if you put stuff on a horizontal line with the brackets oriented the right way around.

This brings me how curly brackets are used in programming languages:


Here we’re really no better off than the previous example. Because the content is laid out vertically, the brackets are not oriented the right way around. The symmetry of the brackets is along the wrong axis, and they do not visually “grasp” the contents.

So is this some kind of horrible atrocity that leads any curly-bracket language to unavoidable doom? Well, no, obviously not. You learn to see the curly brackets as “start” and “end” tokens for code blocks, so the fact that the brackets are no longer awesome at grouping the content isn’t such a huge deal.

So why the fuss? Well because brackets are awesome for so many other things! There’s a ton of things in a programming language that could use a set of brackets to group and nest things (indexing, record literals, tuple literals, function arguments, generics, lifetime annotations, etc. etc.). But if we’re “wasting” one of the bracket types on vertical content then it’s no longer available for uses where it would be more valuable (unless you overload them to mean multiple different things in different context, which IMO makes things look more “samey” and cluttered – see round brackets in Lisp).

Alternatives to curly brackets for code blocks

I propose languages use simple “do”/”end” keywords for code blocks. They can be typed about as quickly as brackets, and IMO reduce the “line-noise” quality of the language. Plus, they lend themselves better to a unified indentation style – avoid the wasted lines of Allman style without spurious arguments about symmetry (since they’re just keywords, people are less likely to object to the “do”/”end” not lining up):

if (x == y) do

For language constructs that always have a code block “hanging off of them” the “do” keyword could be implicit without introducing any ugly asymmetry. For example, these C-style constructs could all be valid even though for each code block we’ve left off the initial “do” for brevity.

for (int i = 0; i < n; ++n)
    printf("%d\n", i);
if (x == y)
void func(int x)
    printf("%d\n", x);

This of course requires that each of these constructs have a clear syntactic “end point” before the code block. E.g. you could choose to get rid of the round brackets in an if-statement and make the “do” required:

if x == y do

Why (most) High Level Languages are Slow


In the last month or two I’ve had basically the same conversation half a dozen times, both online and in real life, so I figured I’d just write up a blog post that I can refer to in the future.

The reason most high level languages are slow is usually because of two reasons:

  1. They don’t play well with the cache.
  2. They have to do expensive garbage collections

But really, both of these boil down to a single reason: the language heavily encourages too many allocations.

First, I’ll just state up front that for all of this I’m talking mostly about client-side applications. If you’re spending 99.9% of your time waiting on the network then it probably doesn’t matter how slow your language is – optimizing network is your main concern. I’m talking about applications where local execution speed is important.

I’m going to pick on C# as the specific example here for two reasons: the first is that it’s the high level language I use most often these days, and because if I used Java I’d get a bunch of C# fans telling me how it has value types and therefore doesn’t have these issues (this is wrong).

In the following I will be talking about what happens when you write idiomatic code. When you work “with the grain” of the language. When you write code in the style of the standard libraries and tutorials. I’m not very interested in ugly workarounds as “proof” that there’s no problem. Yes, you can sometimes fight the language to avoid a particular issue, but that doesn’t make the language unproblematic.

Cache costs review

First, let’s review the importance of playing well with the cache. Here’s a graph based on this data on memory latencies for Haswell:

The latency for this particular CPU to get to memory is about 230 cycles, meanwhile the cost of reading data from L1 is 4 cycles. The key takeaway here is that doing the wrong thing for the cache can make code ~50x slower. In fact, it may be even worse than that – modern CPUs can often do multiple things at once so you could be loading stuff from L1 while operating on stuff that’s already in registers, thus hiding the L1 load cost partially or completely.

Without exaggerating we can say that aside from making reasonable algorithm choices, cache misses are the main thing you need to worry about for performance. Once you’re accessing data efficiently you can worry about fine tuning the actual operations you do. In comparison to cache misses, minor inefficiencies just don’t matter much.

This is actually good news for language designers! You don’t have to build the most efficient compiler on the planet, and you totally can get away with some extra overhead here and there for your abstractions (e.g. array bounds checking), all you need to do is make sure that your design makes it easy to write code that accesses data efficiently and programs in your language won’t have any problems running at speeds that are competitive with C.

Why C# introduces cache misses

To put it bluntly, C# is a language that simply isn’t designed to run efficiently with modern cache realities in mind. Again, I’m now talking about the limitations of the design and the “pressure” it puts on the programmer to do things in inefficient ways. Many of these things have theoretical workarounds that you could do at great inconvenience. I’m talking about idiomatic code, what the language “wants” you to do.

The basic problem with C# is that it has very poor support for value-based programming. Yes, it has structs which are values that are stored “embedded” where they are declared (e.g. on the stack, or inside another object). But there are a several big issues with structs that make them more of a band-aid than a solution.

  • You have to declare your data types as struct up front – which means that if you ever need this type to exist as a heap allocation then all of them need to be heap allocations. You could make some kind of class-wrapper for your struct and forward all the members but it’s pretty painful. It would be better if classes and structs were declared the same way and could be used in both ways on a case-by-case basis. So when something can live on the stack you declare it as a value, and when it needs to be on the heap you declare it as an object. This is how C++ works, for example. You’re not encouraged to make everything into an object-type just because there’s a few things here and there that need them on the heap.
  • Referencing values is extremely limited. You can pass values by reference to functions, but that’s about it. You can’t just grab a reference to an element in a List<int>, you have to store both a reference to the list and an index. You can’t grab a pointer to a stack-allocated value, or a value stored inside an object (or value). You can only copy them, unless you’re passing them to a function (by ref). This is all understandable, by the way. If type safety is a priority, it’s pretty difficult (though not imposible) to support flexible referencing of values while also guaranteeing type safety. The rationale behind these restrictions don’t change the fact that the restrictions are there, though.
  • Fixed sized buffers don’t support custom types and also requires you to use an unsafe keyword.
  • Limited “array slice” functionality. There’s an ArraySegment class, but it’s not really used by anyone, which means that in order to pass a range of elements from an array you have to create an IEnumerable, which means allocation (boxing). Even if the APIs accepted ArraySegment parameters it’s still not good enough – you can only use it for normal arrays, not for List<T>, not for stack-allocated arrays, etc.

The bottom line is that for all but very simple cases, the language pushes you very strongly towards heap allocations. If all your data is on the heap, it means that accessing it is likely to cause a cache misses (since you can’t decide how objects are organized in the heap). So while a C++ program poses few challenges to ensuring that data is organized in cache-efficient ways, C# typically encourages you to allocate each part of that data in a separate heap allocation. This means the programmers loses control over data layout, which means unnecessary cache misses are introduced and performance drops precipitously. It doesn’t matter that you can now compile C# programs natively ahead of time – improvement to code quality is a drop in the bucket compared to poor memory locality.

Plus, there’s storage overhead. Each reference is 8 bytes on a 64-bit machine, and each allocation has its own overhead in the form of various metadata. A heap full of tiny objects with lots of references everywhere has a lot of space overhead compared to a heap with very few large allocations where most data is just stored embedded within their owners at fixed offsets. Even if you don’t care about memory requirements, the fact that the heap is bloated with header words and references means that cache lines have more waste in them, this in turn means even more cache misses and reduced performance.

There are sometimes workarounds you can do, for example you can use structs and allocate them in a pool using a big List<T>. This allows you to e.g. traverse the pool and update all of the objects in-bulk, getting good locality. This does get pretty messy though, because now anything else wanting to refer to one of these objects have to have a reference to the pool as well as an index, and then keep doing array-indexing all over the place. For this reason, and the reasons above, it is significantly more painful to do this sort of stuff in C# than it is to do it in C++, because it’s just not something the language was designed to do. Furthermore, accessing a single element in the pool is now more expensive than just having an allocation per object – you now get two cache misses because you have to first dereference the pool itself (since it’s a class). Ok, so you can duplicate the functionality of List<T> in struct-form and avoid this extra cache miss and make things even uglier. I’ve written plenty of code just like this and it’s just extremely low level and error prone.

Finally, I want to point out that this isn’t just an issue for “hot spot” code. Idiomatically written C# code tends to have classes and references basically everywhere. This means that all over your code at relatively uniform frequency there are random multi-hundred cycle stalls, dwarfing the cost of surrounding operations. Yes there could be hotspots too, but after you’ve optimized them you’re left with a program that’s just uniformly slow. So unless you want to write all your code with memory pools and indices, effectively operating at a lower level of abstraction than even C++ does (and at that point, why bother with C#?), there’s not a ton you can do to avoid this issue.

Garbage Collection

I’m just going to assume in the following that you already understand why garbage collection is a performance problem in a lot of cases. That pausing randomly for many milliseconds just is usually unacceptable for anything with animation. I won’t linger on it and move on to explaining why the language design itself exacerbates this issue.

Because of the limitations when it comes to dealing with values, the language very strongly discourages you from using big chunky allocations consisting mostly of values embedded within other values (perhaps stored on the stack), pressuring you instead to use lots of small classes which have to be allocated on the heap. Roughly speaking, more allocations means more time spent collecting garbage.

There are benchmarks that show how C# or Java beat C++ in some particular case, because an allocator based on a GC can have decent throughput (cheap allocations, and you batch all the deallocations up). However, this isn’t a common real world scenario. It takes a huge amount of effort to write a C# program with the same low allocation rate that even a very naïve C++ program has, so those kinds of comparisons are really comparing a highly tuned managed program with a naïve native one. Once you spend the same amount of effort on the C++ program, you’d be miles ahead of C# again.

I’m relatively convinced that you could write a GC more suitable for high performance and low latency applications (e.g. an incremental GC where you spend a fixed amount of time per frame doing collection), but this is not enough on its own. At the end of the day the biggest issue with most high level languages is simply that the design encourages far too much garbage being created in the first place. If idiomatic C# allocated at the same low rate a C program does, the GC would pose far fewer problems for high performance applications. And if you did have an incremental GC to support soft real-time applications, you’ll probably need a write barrier for it – which, as cheap as it is, means that a language that encourages pointers will add a performance tax to the mutators.

Look at the base class library for .Net, allocations are everywhere! By my count the .Net Core Framework contains 19x more public classes than structs, so in order to use it you’re very much expected to do quite a lot of allocation. Even the creators of .Net couldn’t resist the siren call of the language design! I don’t know how to gather statistics on this, but using the base class library you quickly notice that it’s not just in their choice of value vs. object types where the allocation-happiness shines through. Even within this code there’s just a ton of allocations. Everything seems to be written with the assumption that allocations are cheap. Hell, you can’t even print an int without allocating! Let that sink in for a second. Even with a pre-sized StringBuilder you can’t stick an int in there without allocating using the standard library. That’s pretty silly if you ask me.

This isn’t just in the standard library. Other C# libraries follow suit. Even Unity (a game engine, presumably caring more than average about performance issues) has APIs all over the place that return allocated objects (or arrays) or force the caller to allocate to call them. For example, by returning an array from GetComponents, they’re forcing an array allocation just to see what components are on a GameObject. There are a number of alternative APIs they could’ve chosen, but going with the grain of the language means allocations. The Unity folks wrote “Good C#”, it’s just bad for performance.

Closing remarks

If you’re designing a new language, please consider efficiency up front. It’s not something a “Sufficiently Smart Compiler” can fix after you’ve already made it impossible. Yes, it’s hard to do type safety without a garbage collector. Yes, it’s harder to do garbage collection when you don’t have uniform representation for data. Yes, it’s hard to reason about scoping rules when you can have pointers to random values. Yes, there are tons of problems to figure out here, but isn’t figuring those problems out what language design is supposed to be? Why make another minor iteration of languages that were already designed in the 1960s?

Even if you can’t fix all these issues, maybe you can get most of the way there? Maybe use region types (a la Rust) to ensure safety. Or maybe even consider abandoning “type safety at all costs” in favor of more runtime checks (if they don’t cause extra cache misses, they don’t really matter… and in fact C# already does similar things, see covariant arrays which are strictly speaking a type system violation, and leads to a runtime exception).

The bottom line is that if you want to be an alternative to C++ for high performance scenarios, you need to worry about data layout and locality.

Variable Refresh Monitors and Animation


So variable refresh rates have a lot of hype surrounding them now, with representatives from GPU and monitor vendors are saying that it will just magically make existing games better with no effort required on the part of the developer. This is bullshit and you shouldn’t believe them. If you’re a developer you need to be careful to handle these new refresh rates and if  you’re a customer you should consider just locking your monitor to 60Hz for games that don’t do the right thing.

The problem

The issue is that games do animation. So in order to compute a frame for display it needs to essentially predict when that frame will be presented, and then compute the positions of all the objects at that time. This is only partially true because there’s lag in the display pipeline that we don’t know anything about, but since that lag affects every frame the same you can typically just ignore it and just advance the game state by 1/60Hz each frame and your animation will look smooth.

However, if the update frequency is variable then the amount of time it takes to render and display a frame changes every frame. So when we’re computing the next frame we don’t know how far ahead to “step”. It may have been 16ms last frame, but this frame it may be 15 or 17 – we just can’t tell in advance because we don’t know how long it will take the CPU and GPU to finish the frame. This means that objects no longer move smoothly in the world, they will move in a jerky fashion because the amount they get advanced each frame isn’t consistent with how long it took to actually display the frame.


So the first solution here is to just lock your framerate at some fixed number. 60Hz is a decent option, but if you have variable refresh rates maybe you’ll go for something odd like 45Hz or whatever. In fact, I would be surprised if the sweet spot for smoothness/performance just happens to be 60Hz by sheer coincidence. However, this is troublesome because our graphics APIs don’t give us an easy way to do this right now – perhaps these fancy new monitors could just let us set the refresh rate at arbitrary intervals instead and we can go back to using vsync, perhaps with some kind of clever fallback (if you miss the vsync, it uses variable refresh to present the frame as quickly as possible and then “resets” the framerate at that point, so that the next “vsync” happens 1/refresh later).

However, picking the refresh rate up front is tricky, even if we can pick any number we want. The game is going to run better on some scenes than others, and it will also vary from system to system. Maybe this could be automatic? The game could render each frame with a “fixed” target in mind, this target is set in such a way that we’re very confident we can hit it, and all the physics and animation is updated by that amount of time. If we can simultaneously tune this target based on how long it typically takes to update the frames, we’ll end up running faster where we can, and slower where we need to. So the idea is this:

  1. Render the frame.
  2. If the current frame time is less than the target frame time, do a busy wait then present just at the right time (this should be the overwhelming majority of frames).
  3. If the current frame time is more than the target frame time, present immediately and increase the target time.
  4. If the frame time was significantly less than the target frame time, and has been for some time, slightly decay the target frame time so that we get better fluidity.

There are many ways you could tweak this. For example, maybe you pick a target frame time based on a histogram of the most recent 100 frame times, or something – picking a time so that 99 of them would’ve come in under budget. Or maybe you fit a normal distribution to it and pick the 99th percentile target frame time.

Or maybe something simpler could work. For example, we could make sure our target frame rate is always 10% higher than the longest frame in the last second (or something). This means that as our actual frame times drop the target frame time will drop too (after a second) and we get better fluidity. If our frame times creep up slightly we will pre-emptively bump up our target to 10% on top of that to give ourselves breathing room in case they keep going up. If we do end up with a frame spike that’s worse than 10% of our worst frame in the last second, we will immediately bump up the target frame time to 10% above that new worst case. You would tune that percentage based on your game – if your frame times vary slowly you could use a lower buffer percentage, and if it varies quickly you may need a larger buffer percentage.

The biggest problem with this issue is step 2 above. The busy wait. In a typical renderer this means we can’t actually start the doing rendering for the next frame because we have to baby-sit the present for the previous frame, to make sure it doesn’t fire too early. This wastes valuable CPU cycles and GPU/CPU parallelism. Maybe in DX12 we could have a dedicated “presenter thread” that just submits command buffers and then does the busy-wait to present, while the main render thread actually does the rendering into those command buffers. Really what we want here is a way to tell the OS to sync to an arbitrary time period. For example, if the last couple of frames have all taken around 20 ms (which we can tell by measuring both the CPU time, well as inspecting some GPU profiling queries a frame later to see how long the GPU took) we could call some API to set the vsync period to be 22 ms just to give us some buffer and avoid jittering. This way the OS could potentially use a HW timer to avoid eating up CPU cycles.

Either way, let’s at least stop pretending that variable refresh rates are magic that can just be enabled for any game and look good. The developer needs to know when these things are enabled and take steps to make sure animation and physics remains smooth. It would be ideal if future APIs could actually help us out in doing this, but either way we need to do something to handle this.

Flags are a code smell


In most major code bases, it’s common to have objects with lots of Boolean flags indicating what state they’re in. E.g. is an enemy targeting the player or not? Is it trying to flee or not? If you have a set of mutually exclusive flags an enum can be used instead, but this is often not possible, and doesn’t change much in terms of the downsides.

In my opinion these are code smells. Smells don’t necessarily indicate actual issues, just patterns that are likely to be used incorrectly and thus warrants extra care.

In many cases if you think about how to avoid flags you’re forced change the code in ways that result in clearer code (specifically several simple functions rather than one general function that checks a dozen flags), as well as improved performance due to fewer branches and cache misses. Let me stress that the main gain here is the improved code quality, so don’t be fooled into thinking this is some kind of optimization – better performance is a happy side effect. And yes, there are cases where flags are the clearer option as well (see below).

An example

For a somewhat contrived example, consider a game rendering system. Each object in the world has some rendering state associated with it, this would be things like what meshes and materials to use, but also various flags determining how it’s to be rendered.

Let’s say we have a flag called isDynamic, which tells us if the object can move or if it’s static. We can change the flag throughout the life of the object, but when it’s set to false we assume it’s going to be static for some time (so we pay some up-front cost to pre-calculate some acceleration structures that pay off during actual rendering).

Sometimes you want to temporarily toggle rendering for an object so we check this in the renderer as well by adding an isVisible flag.

In addition, the object as a whole may have an isEnabled flag which lets us disable the whole object (physics, animation, rendering all ignore it when it’s not enabled).

Ok, so far we have three flags which means we have 23 = 8 possible combinations of flags, but do we really have eight separate legal states? No! E.g. an object which is not enabled can’t be visible. You could also argue that an object which is disabled or invisible is neither dynamic nor static. So we have significantly more possible states than legal states, and it’s very easy to accidentally end up in an illegal state by forgetting to check a flag in one of the many branches throughout the renderer (e.g. perhaps you go down the “isDynamic” branch somewhere for an object which is invisible, thus getting yourself into an inconsistent state). As you add more flags the set of possible states explodes, and it’s common for the interdependencies to be complex and easy to mess up in all the places you have to check.

An alternative

Instead of storing all these flags, take a step back and think about what the point of each flag is. The isDynamic flag is there to distinguish between dynamic and static objects. A different way of doing this is for the renderer to simply keep an array of dynamic objects and an array of static objects.

The isVisible flag is different. It means the object should be in neither the dynamic or static array. For the whole-object enabled flag we can do something similar, but for each sub-system.

The exact way you avoid any given flag will differ depending on what the purpose of the flag is, but you can usually find a way to avoid it.

So that’s the point of all this? Well the main benefit is that all the code spread out through the renderer can now be simpler. Instead of checking flags for a complicated set of legal states all over the code, we just have completely separate functions for each kind of state. E.g. there’s one method to go through all the static objects and render them, another to go through all the dynamic ones. These functions may call into shared helper functions here and there (to avoid duplication), but if you want to know how dynamic objects get rendered there’s a single place to start reading, and it will be easier to understand because the code is handling fewer cases.

The complexity isn’t entirely gone, it’s just moved. It’s now in methods like “SetVisible”, “SetDynamic” and so on. This is far better, since the complexity is now localized, making it easier to ensure correctness.

A brief note on performance: As we’re now separating things into different arrays to keep the logic clearer, you can often also split the data up across different arrays, which improves the cache locality for the functions that deal with that particular part of the state. For example, objects may use different data if they’re static or dynamic so if you can avoiding have to “step over” the static data in the dynamic code-path that’s going to save lots of cache misses.

What about checking which state an object is in?

If you use the proposed scheme most code never actually have to check any per-object flags, because the data already comes pre-sorted to the “processing” code (and is kept sorted by the state transition code). However, there are exceptions. Sometimes you really do want to check what state a specific object is in. This means doing an array lookup for each of the different arrays to see if the object is there.

To make this fast, you could store the index for each array in the object, which gives you fast lookups. This does means you’re still storing state-related information in the object itself (one index per array), but it can be private information so you still get the other benefits – the renderer still doesn’t have to check flags.

Another option is to make sure each object uses the same index in all the arrays it’s in. You generate a unique ID for each object, and use this as the location for the object in all arrays. The obvious way to do this is to allocate MAX_OBJECT_COUNT-sized arrays for all possible states, but that can of course be wasteful (and costly – gaps in the array cost cache misses when traversing). A better way is to use a “sparse array”, which only stores objects that are actually in the array, but still allows O(1) lookup. A hash table is a reasonable backing-store for this, but you can do better if you let your IDs be unique ints with uniform distribution (no need to actually hash anything, use the ID as-is in a Robin-hood style hash table).

When not to use this

I said above that flags are a code smell, not code bugs! There’s plenty of cases where using flags is entirely appropriate. E.g. if you set flags multiple times per frame then the increased cost of moving it between different arrays all the time is likely a bad idea. Also, if you use the sparse array idea above checking a flag is a lot more expensive so if you do this frequently you should probably at least cache the result of this lookup in a flag. Another scenario is if you only have one of something, there’s really no reason to have multiple arrays in that case, since most arrays will be empty. Also, sometimes the added cognitive overhead of multiple arrays just isn’t worth it if the code is trivial. Last but not least, sometimes it’s just damn difficult to figure out how to avoid a flag and spending too much effort jumping through hoops to satisfy some rule of thumb is just dogmatic.

Even if you go with the suggested approach above, you may still need an actual flag to handle the state transitions appropriately. E.g. if you make an object visible after it’s been invisible you probably need to know if you should add it to the dynamic or static array now that it’s visible. The point isn’t to have no flags at all (despite the title), it’s to move the resulting complexity to the code that actually deals with the state transitions, instead of spreading (and duplicating) that same complexity all over the code.

More resources

If this approach appeals to you, you really should look into data-oriented design. I’m a fan of this approach (first and foremost consider the data, and how it gets processed, don’t obsess over object identity), but I would advocate moderation. There are often benefits to having some of the state with the object, instead of having everything be implicit based on what component arrays it’s in (in terms of readability, and sometimes even performance). Resist the temptation to be an extremist about this, that’s really no different than the programmer who uses design patterns every chance they get.

For more on data oriented design, check out http://www.dataorienteddesign.com/dodmain/ .

Another good intro to this and other things, from a performance perspective is “Code Clinic: How to Write Code the Compiler can Actually Optimize” by Mike Acton at GDC 2014 (available at the GDC vault with membership – you really need to see the talk here though, slides alone aren’t enough).

Another practical case study is Pitfalls of Object Oriented Programming by Tony Albrecht.


The Perils of Future-Coding



In my opinion, one of the most insidious form of technical debt is what I like to call future-coding. You might also call it over-engineering, second-system syndrome, etc. It’s code written in an overly elaborate and general way in order to, the future-coder reasons, pre-emptively handle use cases that we’ll “probably see in the future”.

I find it more treacherous than plain “stupid code” because it predominantly affects smart people. It’s mostly smart people with limited experience – once you see the downsides of future coding a few times, you tend to knock it off (although there are powerful confirmation biases at work) – but still, these are bright people you’d want to hire and keep around in you organization because they’ll do great things in the future.

Let me illustrate my point using a hypothetical example with two equally smart programmers, where one just has a little bit more experience than the other and is less prone to future-coding.

The Tale of Two Street Lights

Let’s say you’re making a game where a large portion of the time the player is navigating an urban environment. One day the artists ask for a feature to make it easier to place street lights in a level. They say: “Can we have a way to click a path through the level and have the game place lights at an even spacing along it?” This seems like an easy enough request so you start implementing it.

Here’s where our story forks into two divergent paths.

First option – the pragmatic coder

The pragmatic coder writes some quick UI in the editor where the artist clicks on points in the level that gets stored into an array.

On the runtime side he writes a dozen-line for-loop that simply walks the path segments at equal spacing and spawns a street light each time. Job done, the pragmatic coder moves on to the next task.

Second option – the future-coder

The future-coder writes the same UI as the pragmatic coder, but for the runtime he decides upon a different approach. After all, he’s read books about design patterns and large scale software design, and he knows about the importance of the Single Responsibility Principle, and keeping code factored for changes that might occur in the future.

Thus, he decides to abstract over the exact type of path used. Sure, the artists only asked for a simple path today, but in the future they might want a B-spline or some other kind of curve, so he pulls out the logic for the path into its own Path object that can step along the curve in some kind of iterator fashion. Of course, that’s not enough, in order to change the behavior of the curve he puts this Path object behind an interface and adds a factory to spawn the right kind of path based on some kind of configuration parameter. He also updates the UI to allow for future kinds of paths. Maybe he even implements a B-spline option just so that he can test this properly with more than one option.

Now, the code to spawn street lights first creates an abstract path from a factory, but how does he determine the spacing of the objects? The artists asked for simple equal spacing today, but in the future they might want something different, so he writes another abstract class that handles the spacing policy (with a factory to go with it).

Street lights in your game happens to be reasonably symmetric, so today just spawning them with their default orientation works fine, but in the future the artists might want to use this system to spawn objects where the orientation needs to depend on the curve, so our future-coder updates the path object to also supply tangents and bi-tangents, and writes an abstract class to handle the specific placement policy in order to take this new information into account.

Phew, that was a lot of work – he’s added a lot of new classes and glue code, and the code that actually uses it is a lot more opaque and hard to read than what the pragmatic programmer came up with. It will be worth it in the future, however, when this extra flexibility will pay off. The future-coder goes home late today, but feels good that he’s written an elegant system that will stand the test of time, rather than just a dumb one-off implementation.

Reality Kicks In!

Let’s go through a few scenarios of what might happen next.

Good news, no changes!

Turns out this simple system was all the artists ever needed. They never ask for any changes, and the code is shipped unchanged from the original implementation.

The pragmatic coder feels good that his code did exactly what was needed and never caused any issues or technical debt in the project.

The future-coder feels bad that he spent all this time on an extensible system and nobody ever made the most of the flexibility in his elegant design, but at least it was used at all and the artists liked it, so it’s not too bad.

The feature gets cut

A week later the artists decide that they don’t mind placing street lights manually after all, since it gives them extra control. The automatic street light placing feature goes unused, and is eventually deleted.

The pragmatic coder barely remembers the feature and couldn’t give a damn.

The future-coder feels bad that his elegant system was completely deleted. He grumbles to the other programmers about the awful artists and designers who always keep changing the requirements and wasting so much of his time asking for stuff that they end up not using.

Requirements change

The artists are happy with the feature, but they have one minor request: “Yes, I know we asked for equal spacing of the street lights, but actually we want to make sure that there’s always a street light on each corner of the path, and that the spacing within each path segment is slightly adjusted to ensure this”. To the artists this feels like a simple change so they don’t feel too bad about asking for it.

For the pragmatic coder this is a trivial change. He updates his dozen-liner to first compute the number of street lights required for each segment by dividing the length by the desired spacing and rounding it to the nearest integer, then computing an updated spacing by dividing the segment length by this count. This takes a few minutes of his time and he moves on to other things.

The future-coder starts thinking about how to accommodate this new request into his streetlight-placing framework. Shit-shit-shit, he thinks, a frustrated panic starts to spread in his mind. He never anticipated this change and can’t really see how to update his system to handle it. What does it even mean to place a light at “each corner” when the path is completely abstract? After all, if a future path is defined with a spline, the “corner” may not have a control point on it.

And even if you did have a way to define what a “corner” means, how do you feed that data to the object that governs the spacing policy, now that it’s completely decoupled from the path object? This will require some major re-architecting he thinks, and postpones the task until he can find a larger slot in his schedule. Later that night the future-coder sits in a bar with a programmer coworker, complaining about those damn artists that keep changing the specs on him, oblivious to the irony that it’s precisely his overly complicated attempt to be robust to spec-changes that is causing him problems in the first place.

Ah, a solution! He decides to define “corners” in terms of the curvature of the path, with a configurable threshold, and then feeds this data to the spacing policy object so it can use it to compute the appropriate spacing. The code gets kind of messy since the path and spacing classes are now getting more coupled, and there’s a lot of tunable parameters involved, but it does the job so he checks it in and goes home.

The next day the artists informs him that his solution doesn’t work at all. They want to use this as way to ensure exact placement of streetlights even on straight-line stretches of road, by placing dummy control points in the path. So this notion of using curvature to define street light positions just doesn’t work.

Let’s leave the future-coder to work out a solution to this self-inflicted problem. I have no idea what he could be doing to rescue his system at this point, but whatever he comes up with it’s unlikely to be pretty, leading to even more technical debt and complexity.

Requirements change – redemption for our future-coder!

In this scenario the requirements change in a different way. The artists now want to have the option of using either a straight path with sharp corners, or a smooth B-spline path.

The pragmatic coder adds a flag to the UI, and a switch in his runtime code that either calls the old placement code, or calls some new code that interprets the control points as a B-spline and steps through them using his math library’s B-spline code. The code is still easy to understand, almost certainly contains no bugs, and it didn’t take too long to update.

The future-coder feels redeemed! He may even have had B-spline option already implemented so he can simply point the artist at the appropriate UI option. But even if he has to implement the B-spline option from scratch it’s a simple matter of implementing a new Path object that iterates over a B-spline, and update all the relevant factories and configuration options. Yes it’s more code than what the pragmatic coder needed to write to make this change, and it’s still much harder to read and navigate, and it’s not at all as clear that it doesn’t contain bugs. But look at the UML-diagram! It’s so elegant, and follows the single responsibility principle, and dammit we’ve already seen that we’ve needed one additional option already, so in the future when the number of options grows to a dozen or so we’ll really see the benefit of this elegant framework!

The future-coder wins the lottery

This time the scenario is that the artists keep asking for more and more kinds of paths over time, with increasingly elaborate spacing options, and it becomes one of the most reused systems in the whole project.

After the a few different path options have been added, each having a few different spacing options, the pragmatic coder starts seeing the limits of his original design, and decides that a more abstract and generic design will have greater benefits than downsides. So he refactors his code so that the path is specified by an abstract path object, and pulls the spacing policy out into its own object too. At the end of the day he ends up with a system much like the future-coder, but doesn’t feel too bad because he spent almost no time on the original systems in the first place, and the simplicity of them meant he could get the initial system going quickly, and respond to change-requests faster than the future-coder could’ve.

The future-coder feels like a goddamn programming genius. He totally predicted this and it paid off! He is now more certain than ever that just spending some time up front predicting the future will pay off down the line.

He’s blind to the fact that for every time he happens to get it right, there are a hundred other instances where his failed predictions led to unnecessarily complicated code, time wasted debugging because you can’t tell that the code obviously has no deficiencies, and additional development effort because the extra scaffolding required by his general systems made it harder to address changes that weren’t anticipated.

It’s like he’s won at the roulette table in Vegas – he’s convinced that he’s just that good, that he’s found a system. He’ll keep spending his money at the casino, without ever really doing the math on whether or not his overall winnings is higher than his losses. Everyone says they’re up on average in Vegas right? It’s a mystery that Casinos still make money!


We probably all have a bit of the future-coder in us, but we must resist him! The bottom line is that the future-coder almost never wins. In reality, predicting the future is virtually impossible, so don’t even try. It hardly ever pays off. Best case, you’ve just wasted a lot of time, but in reality abstraction has cognitive overhead throughout the project when reading or debugging code, and can make it harder to deal with unanticipated changes (ironically), as well as cause performance issues. And even when it does pay off it doesn’t actually buy you all that much and is actually a net negative because it skews your intuition into thinking that future-coding is a good thing.

The human brain didn’t evolve to think in abstract terms. You will probably have noticed this when trying to understand a new mathematical concept – it’s much easier if you start with concrete examples rather than the fully general definitions and proofs! Simple code that just does one thing in the dumbest way possible takes very little time for someone to understand, whereas if you have to jump through abstract classes and factories (each named in suitably abstract ways) it can take an absurd amount of time to understand something that really isn’t all that complicated.

All these extra virtual calls act as barriers to inlining, in addition to the overhead of the virtual function calls themselves, which cause performance issues. Yes, I know you’ve been told that you shouldn’t worry about performance for code that isn’t in a hotspot, but the reality is that for many applications there aren’t any real hotspots anymore. It’s depressing to look at the profiler for gameplay code and see that your warmest “hotspot” takes 5% of your frame time, and has already been optimized to death. By that time it’s often too late – performance has been siphoned off by tens of thousands of small inefficiencies added over many years, all because people didn’t want to worry about minor performance hits for code that wasn’t on the mythical “hot-path”. Simple code tends to be cheaper to execute, most of the time, in addition to being better in almost every other way as well.

So am I saying that we should just hack things together in the quickest way possible, with no regard for future maintainability? No, of course not. I’m advocating doing things in the simplest way possible, not the quickest way possible. Often, the simplest way requires removing a bunch of cruft that has accumulated over time, deleting stateful flags, pulling logic out into clearly named functions etc. Sometimes, it even requires an abstract factory.

You should optimize for simplicity, readability and modifiability. If you do, you will also end up with code that’s more maintainable, easy to modify and optimize in the future. Take the example above where our heroes needed to add support for a B-spline. Let’s assume that there was no library for interpolating a B-spline already so that they had to add it. I’d expect both types of programmers to pull that code out into its own function. My motivation for doing this isn’t that I’m trying to predict the future, though! You pull it out into its own function because doing so makes the code easier to read today, by making sure the B-spline functionality is hidden behind a sensibly named function instead of cluttering up the “flow” of the main function with unnecessary details. The minor cognitive cost of the extra function call is vastly outweighed by the increased readability of hiding unimportant details. It’s not about predicting the future, it’s about increasing readability.

Sensibly named function calls without side effects almost always add more to readability then they cost, if they’re virtual less so, if they have side effects even less so, if they’re in a class even less so, and so on. It’s all a tradeoff, and the break-even point for any given abstraction will change depending on the specifics of what you’re doing.

Very few abstraction techniques have no merit at all, but the road to programmer hell is paved with best-practices, applied too early. You should postpone applying these abstraction techniques until the cognitive and other costs of using them are clearly outweighed by their benefits (e.g. when you have demonstrated reusability for a component). There are often genuine systems even in game engines where it’s perfectly reasonable to apply many of the abstraction techniques you’ve read about in books about design patterns and large scale software design etc., but the vast majority of code is better off if you keep things simple, and avoid trying to predict the future.

More on Robin Hood Hashing


I posted about Robin Hood hashing in a previous post. In the comments to that post David Wragg pointed out a flaw in my implementation: if you repeatedly delete and reinsert elements into a hashtable, the average probe count keeps rising. This is a good observation, and it’s true. Repeatedly deleting items makes the table slower but it (until you insert enough to trigger a rebuild of the table, which will filter out any tombstones – note: the number tombstones do not affect the rate of rebuilds).

Now, first of all it’s worth noting that the probe count grows slowly. After a slight tweak to the insertion method, at least (if the existing element is deleted, then tweak the replacement condition so it happens even when the probe counts are merely equal, rather than the existing one being strictly smaller). In David’s test case it grew to about 30 after a million deletions (with my hash function it’s more like 20).

The other point to make is that we already accept periodic “batch” operations such as the table-rebuilding that happens whenever you hit the max load factor. For that reason, it’s relatively reasonable to just reinsert all items again (skipping tombstones) whenever some number of deletions have occurred (e.g. 5N or something – it doesn’t have to be very frequent since the average probe count grows so slowly with the number of deletions).

So, there’s a simple and fairly bullet proof solution to this problem, but what if we prefer an incremental approach, where we make each deletion slightly slower, but never have to run a big batch job to improve the average probe count?

I’m not sure if there are existing solutions with established asymptotic properties, but I tried a simple strategy that seems to work pretty well in my test. The idea is this: whenever we delete an element, we scan ahead for a small number of slots and rearrange them to limit (or remove) the impact of the delete on the average probe count. So long as this operation tends to improve the average probe count at a faster rate than the probe count would otherwise grow, you’ll end up with an average probe count that stays reasonable.

Here’s the code that runs at the very end of the “erase” method (after the element has already been destroyed):

int cur_ix = found_ix;      
for (int i = 0; i<NUM_DELETE_FIXUPS; ++i)
    const uint32_t next_ix = (cur_ix + 1) & mask;
    if (next_ix == found_ix) break; // looped around

    // if the next elem is empty or has a zero probe dist, 
    // then any query will always fail on the next elem anyway, 
    // so might as well clear the current one too.
    if (elem_hash(next_ix) == 0 || 
        probe_distance(elem_hash(next_ix), next_ix) == 0)
        elem_hash(cur_ix) = 0;
        return true;

    // Now, the next elem has probe count of at least 1, 
    // so it could use our deleted slot to improve its 
    // query perf. 
    // First, pretend we're a dupe of the next elem.
    elem_hash(cur_ix) = elem_hash(next_ix);

    // Since we're now a dupe of the next elem, ordering is 
    // arbitrary, which means we're allowed to swap places.
    // Note> current slot has been destructed, hence 'new'.
    new(&buffer[cur_ix]) elem(std::move(buffer[next_ix]));

    cur_ix = next_ix;
elem_hash(cur_ix) |= 0x80000000; // mark as deleted
return true;

You can choose a small number for NUM_DELETE_FIXUPS, or remove the condition in the loop (it’ll go through one pass of the table at worst, although that’s highly unlikely since it’ll stop on any empty element or an element with a zero probe distance). In my tests I’ve found that 2 will bound the variance and average probe count, but 3 will keep them slightly lower at marginal extra cost so that’s what I’ve set it to.

Again, this is just me messing around – I haven’t made any real attempt to prove the properties of this approach. There may be more rigorously investigated approaches in the literature somewhere. Given that there exists a simple fix with very low amortized cost (periodic reinsertion of elements), I don’t really consider this issue a deal-breaker even if this incremental approach doesn’t work out.