Garbage collection thoughts


One of the things that’s annoyed me for a while is that there aren’t any high performance, high level languages, really being seriously developed. It seems to me that most of the benefits from high level languages, if carefully designed, have no or very limited performance implications, but yet we get languages that either largely abandon safety and productivity concerns even when there’s no reason (iterations on C/C++ mostly, including Go and D), or we get high level languages that throw out performance considerations left and right for minor convenience reasons (e.g. C#, Java).

Luckily this appears to be changing. E.g. languages like Rust seem very promising.

This brings me to one of the more controversial features of high level languages: garbage collection. IMO this is almost a no-brainer. Automatic memory management is such a productivity and safety booster that any high level language should have it, and I’m fairly convinced that it can be done in a high performance manner as well. We’ve just been conditioned to see problems with it because the host languages haven’t really tried to do this efficiently by making the right tradeoffs.

The first rather obvious optimization is to avoid GC work whenever possible – the easiest way to do this is to reduce the number of allocations. Languages like C# and Java get this entirely wrong. They heavily encourage heap allocations so you tend to get them all over the place. Each object consists mostly of pointers to other heap allocated objects. Where’s the actual data dammit? As a result, the GC has a crazy amount of work to do. C++ gets this right, and so does Rust. Most “member fields” in data types should just be stored in-place rather than behind a heap allocated pointer. This complicates the language if you want GC (especially with regards to taking pointers to internal parts of a data structure), but it’s worth the extra complexity. In practice this means that rather than each object having a massive fan-out where every member is another allocation, most objects will have only a couple of heal-allocated members – in aggregate this drastically reduces the heap graph and therefore the amount of work the GC has to do.

Another optimization, which rust and C++ makes alike, is to have “unique” pointers that refer to objects that are uniquely owned by their pointer. These could live in a separate heap altogether. Since these signify unique ownership, the object dies when its parent dies. Ownership can be transferred, but never shared (and this is statically ensured in Rust). This is particularly useful for reference counting based GC, since there’s no RC overhead for these pointers, but tracing collectors can often stop tracing early when it hits a unique pointer, if the type it’s looking at has no GC pointers in its transitive “type graph”.

These two things basically boil down to this: the language should allow you to express simple memory ownership when you have it, and only resort to arbitrary (shared) heap allocations in extreme circumstances. The language should prefer and promote stack allocation, and in-place allocation of sub-objects. When you do need heap allocations, the language should nudge you towards unique ownership for most of them.

Even D, which is ostensibly supposed to be a language prioritizing both efficiency and safety, doesn’t get this right with its struct/class separation (I want to very carefully elect to put stuff on the heap only on the small subset data where it’s absolutely necessary, rather than being forced – or at least heavily encouraged – into heap allocations because something happens to be a class).

The third and maybe most important optimization is to make garbage collection a per-thread operation. You simply don’t allow garbage collected heap allocations to transfer between threads. This means you can use efficient “stop the world” collectors (and compactors), without actually stopping the whole world – only stop one thread. Assuming threads are cheap and ubiquitous (they should be) we may have hundreds or even thousands of isolated heaps – each of which may be stopped for a few milliseconds at a time, but “on average” there are always enough threads that can run that we keep the mutator resources (CPU cores) busy doing actual work and make overall progress. You might call it “concurrent garbage collection in aggregate”. Plus, most of these threads are probably blocked at any one time – running GC on a blocked thread is basically a ghetto version of concurrent GC (from “inside the thread” it looks as if GC just happened while you weren’t looking, just like a concurrent GC, but without expensive read or write barriers).

Anyway, the point is that before we even start talking about GC, the assumption is that we are in a language designed with memory efficiency in mind, and have far less garbage than any existing mainstream high level languages, and that we can run the garbage collectors on small, isolated islands without stopping the entire world (or resorting to expensive concurrent collectors requiring mutator memory barriers).

There are two issues that come up here that I’ve been pondering for a while: external fragmentation due to all these heaps, and tracing collection versus reference counting.

External fragmentation

A lot of high performance garbage collectors (such as Immix etc.) gain their performance largely due to their heap organization. Typically the heaps are allocated in big “regions”, e.g. 32KB each. This is problematic for per-thread GC since it implies a minimum heap size, which in turn implies that having tons (e.g. thousands) of tiny threads will lead to substantial memory waste in the form of external fragmentation. Even if each heap is efficient on its own (e.g. runs compaction periodically), for the overall system the empty space at the end of each heap will add up to substantial fragmentation that cannot be reclaimed.

Ideally, each heap would start small, like a few hundred bytes or so, and then grow using bigger and bigger regions as the overall heap size increases. I haven’t really seen any GC papers do this, they all seem to pick a fixed region size up front. The reasoning is probably quite simple: with a fixed region size you can find the start of the region by looking at the pointer itself (power of two sized regions are cheaper).

I only really mention this because I think I have an efficient way of identifying the region a pointer belongs to, even if region sizes vary, that I haven’t seen mentioned anywhere. Perhaps this is well known, and I wouldn’t be surprised if people have done it for decades, but it’s not completely obvious and it seems like a trick that should’ve been mentioned but has been conspicuously absent in many GC books and papers I’ve read. So I should probably write it down in case it’s helpful to someone.

So the standard way of partitioning the heap is to have power-of-two region sizes so you can mask the lower order bits of the pointer to get a pointer to the start of the region (you could place per-region meta data here, or store a pointer from the region to a meta data block “off to the side” – the latter reduces contention for the ways in a set associative cache). The problem with this in the presence of varying region sizes is that you don’t know ahead of time how many bits you need to mask out (the number of bits to mask out depends on the region size, after all).

The basic idea is to simply try all the different region sizes, and see if the proposed region is actually valid. The trick is to eliminate false positives. The main way to verify if an arbitrary pointer actually refers to a region is to simply store a pointer to all regions in a separate array that user-code cannot write to. If the current proposed region is not in this array, then it is not the start of a valid region. Call this the “region pointer array”. To make the lookup in this array fast, we also store, as the first DWORD of each actual region, an index into this array. Call this the “region index”. We also store the region size in the meta data as well (along with all the other stuff, such as occupancy bits, mark bits, reference count bits, and whatever else you may store per region).

To figure out which region a pointer refers to in this scheme, we simply loop through the different region sizes that our allocator allows, mask the appropriate number of bits from the pointer to get the start of the proposed region, read the region index in the first DWORD of this memory area (this may be garbage if we have the wrong size) use this to index into the region pointer array (taking care to avoid buffer overruns from garbage indices) and ensure that the pointer stored in that element refers right back to the proposed region. We also must ensure that the size of the region at this address actually contains the original pointer.

This seems like a scheme that would allow us to have very small initial region sizes that quickly grows as the heap does, while only doing very simple tests to find which region a pointer belongs to. No, it’s not as cheap as a simple bitwise and, but it should essentially optimize to that in the common case due to allocation coherency. E.g. take the tracing loop in a mark and sweep collector, you can imagine having special-cased tracing loops for each region size (or at least the most common ones), and then just jumping from one loop to the next when the region size changes, but keeping the common path fast by being able to just mask and compare the next pointer from the tracing stack (mask with a constant, compare against a register) to ensure you’re still in the same region as you were, and going through the generic region-finding algorithm above only when that test fails.

Tracing GC vs Reference counting

The next question is if we should be using a tracing GC or reference counting. The conventional wisdom seems to be that reference counting isn’t performance competitive but keep in mind that:

  • Most GC research seems to happen in the context of Java. I.e. an extremely heap-heavy language, where each object is small and consists largely of pointers to other heap objects. This favors tracing over RC.
  • Even in this context, modern reference counting optimization can reclaim the performance difference vs. tracing collectors. See Down for the Count? Getting Reference Counting Back in the Ring.
  • For our hypothetical language we will have large objects, and few heap pointers, so the overhead for reference counting will be lower both in space and time.
    • Few pointers mean fewer reference counts to keep up to date.
    • Few allocations mean fewer reference count fields that take up space.
  • Per-thread GC heaps means we don’t need atomic reference count operations.
  • Reference counting costs are proportional to the size of garbage, rather than the size of the heap. This seems better for scalability in the future where heap sizes will keep getting bigger and bigger.
  • A strongly typed high level language should be immutable by default, or at the very least track mutability, which means we can statically know that large object graphs are acyclic (as immutable values must be, in a strict language) – no need to trace these in any cycle collector.

One of the benefits RC has that I’m not too concerned about is promptness of deallocation. The idea that as soon as a sub graph goes out of scope, memory is immediately reclaimed. This is appealing, but doesn’t eliminate pause times as some might think (you tend to get big “bubbles” that burst as a lynch pin object is deallocated, causing the mutator to stall for a long time as deallocations cascade). Many RC optimizations rely on giving up on this benefit in order to be competitive with tracing GC, so it’s a benefit I’m prepared to sacrifice.

Basically, I think that with a language that has the properties I’ve outlined for a hypothetical modern and performance oriented language (or the non-hypothetical language Rust), reference counting may well come out ahead.

There’s one final bit of conjecture: if immutable values are the default and therefore common in this language, the vast majority of heap pointer mutations will happen on the stack (i.e. local variables). I.e. you will grab some pointers to the heap, shuffle them around on the stack for a while in various algorithms, and then they go out of scope. This may not be true, but it seems very plausible. This implies that along various static analysis to avoid reference counting when possible (such as Rust’s “borrowed pointers” which amount to safe versions of C’s naked pointers), you can simply add the most naïve form of deferred reference counting as well and eliminate most remaining reference counting operations that way.

The idea is basically this. Don’t do any reference counting operations as the result of writing to stack variables. Only do them for heap variables. In other words, grabbing a pointer to a heap allocation and storing it in a local variable does not introduce a reference counting operation, only mutating a heap object’s pointer to another heap object does (these operations should be rare, largely because mutable pointer fields in data structures would be rare). If a reference count goes to zero during execution we can’t just delete it immediately in case there are also stack variables referring to it, instead we put a pointer to it in a thread local “zero count buffer” or ZCB. Put all newly allocated heap objects in this buffer too (unless we can statically ensure that it will be immediately referenced by another heap object). Once this buffer is full, or some other heuristic is met (e.g. a certain number of bytes has been allocated), stop the current thread, scan the stack for references into the GC heap and increment reference counts for these objects. This restores accurate reference counts to all objects that are potentially dead. Then, scan the ZCB and delete any objects whose reference counts are still zero. Finally, go through the stack again and decrement the reference counts you just incremented so that reference counts once again do not take the stack into account. The benefit is twofold, the first one is that we only perform reference counting for stack-rooted pointers periodically, rather than every time one is written to, which by the conjecture above should eliminate the vast majority of remaining reference counting operations. The other is that we batch up our deallocations into groups instead of deallocating throughout normal execution which should improve locality. The cost of writing an object to the ZCB when its reference count goes to zero is low. If we enforce that the ZCB always has at least one free element left we can safely do an unconditional write into it and perform the test for a full ZCB afterwards, without having to stall the subsequent code. This means modern OOE CPUs will correctly predict the branch and keep going assuming the ZCB is not full, which is the common case.

There are other common reference counting optimization in the literature, such as coalesced reference counting. I believe that in a language such as Rust these will have low value (having less RC traffic to start with), and that a very simple deferred scheme like above would get most of the benefit while avoiding the cost of per-object dirty bits and the like.

Anyway. This post is getting long. It may also be the worst kind of blog post – too specific to be interesting for most people, and written by a non-expert in the area so that people who would be interested think it’s either obvious or stupid. Let me know which in the comments!



48 thoughts on “Garbage collection thoughts

  1. Interesting.

    My position is that languages with GC for high performance apps have two issues: Policy and Data placement.

    Policy: if the developer knows the allocation patterns of an algorithm, a custom GC policy can be designed to optimize storage allocation. For instance: no deallocations -> Arena/Stack Allocator, few deallocations ->Two space coping collector. If a developer uses C++ an optimized allocation strategy can be used for each specific module of code.

    Data Placement: where is data stored when an instance is created ? Cache misses are a big hit on performance. Developers need to place data according to algorithm data access patterns. GC languages (Java, C#) typically do not have this kind flexibility.

    GC research seem to focus on getting the most out of the average case in the context of GC languages such as functional (haskell, lisp, ocaml, etc …) and dynamically typed languages (python, ryby, lua, etc …).

    For High performance apps where allocation policy and data access patterns can be predicted or designed I believe that custom allocation/deallocation wins over GC. I might be wrong, but I can’t see how a GC app can outperform a carefully designed app without GC where data allocation/dellocation and data access patterns are known in advance and taken advantage of.

    • While custom allocators can outperform a generic allocator, the actual situations where this happens are few and far between, so long as you’re using a modern allocator with all the associated man-hours put into it.

  2. I guess the way a GC app outperforms a carefully designed app without GC is by having 100% memory safety, unconditionally, eliminating a whole class of bugs, freeing up time to optimize more code. It’s a fallacy to think that those access violations and memory scribbles don’t have performance cost – they do, because there’s never enough time to optimize, and crashes take priority.

    One upside of per-thread GC (and lots of cheap threads, basically one per “system” at least) would be that you can potentially have more GC algorithms to chose from, since each just needs to run on a tiny little heap that’s isolated from the rest of the program.

    I agree with data placement being an issue. It’s crucial, and that’s why I had a big caveat at the start that we would need to be talking about a language that doesn’t make stupid tradeoffs in that area for questionable convenience gains (like C#, Java). You absolutely should be able to choose how data is laid out in great detail, and use arena allocators with “borrowed” pointers to pass memory around (to use Rust’s terminology – which does implement arena allocation as a library) and things like that. In most cases you would try to keep allocation simple – heap memory is owned and released by something specific (e.g. using unique pointers). I stress: in most cases you would *not* allocate shared (I.e. GC’d) memory. You’d stay away from it as much as possible.

    GC is for the pedestrian glue code and other places where data really does need to pass around in a shared way for a while before going away once enough system no longer care, and worrying about the ownership gets hairy (and uninteresting, from a performance standpoint). These are the places where memory crashes creep in today in C++ anyway, IME. In places where memory usage has been carefully considered and implemented using some custom algorithm, there’s typically not an issue because someone actually sat down and thought about it. It’s the little bits in between where you just needed some memory to pass around a while and then go away, that’s where you need the language and runtime to just deal with it for you.

  3. You should not treat C# the same as Java. Like D (or D like C#), there is a clear separation between heap-allocated classes and stack-allocated structs and value types. In Java, everything is an object, in C# not. Both reference types and value types can have attributes that define their memory layout and alignment. You can pass value types by reference to avoid unnecessary copying. These are several things you can’t do with Java, so C# is less heap-heavy.
    You can control the lifetime of an object or struct through well-designed use of the IDisposable pattern. It has to be used explicitly, but can be ‘automated’ to be called when out of scope with the using keyword.
    You can also control ownership of an object with careful use of weak references and strong (regular) references. I’ve not thought enough if this is all that is needed for various kinds of ownerships, but often it is enough.

    Even if you really want more control, you can mark your assembly as ‘unsafe’. Then, you can use pointers, references, unsafe casting, and stack allocations (through stackalloc).

    All of this, of course, require that you are aware of this features of the language and know how to use them. It is very easy to use them badly (or not use them at all). But also it is very easy to program badly in every language.

    As an aside, in the CLR v4+ there is a concurrent GC that can walk the reference-graph of all threads completely in parallel, and only pauses a single (not all) thread when it is absolutely necessary.

    I admit that C# is not a language as flexible (in terms of memory and performance) as C/C++, and that there are things that would be good to get added (e.g. structs destructors, more deterministic finalizers, etc), but as a high-level ‘managed’ language it is very powerful and efficient (if used correctly). It is a more performance oriented way of programming than Java (from what C# is partially inspired).

    • While C# has structs, it’s really a red headed step child of the language. You don’t get far trying to use stucts for everything (e.g. you can’t heap allocate them, store or manipulate references to them, etc.). I vehemently disagree with the policy of separating types into “heap” vs “stack” types (which D does as well). Sometimes I want stuff on the stack, and sometimes on the heap, it makes no sense to have to make one decision up front for all future uses.

      All you have to do is look at actual C# programs and see how many classes vs structs you see. I think you’ll find that C# is very much a heap-oriented language. You can jump through hoops to avoid heap allocations with structs, and sometimes you can make it work, but it’s clearly not designed to favor that idiom IMO. So at the end of the day I think the difference between Java and C# here is mostly academic – idiomatic code in either will be heavily heap oriented.

      • D doesn’t use struct/class to distinguish between stack/heap types, respectively. It uses struct/class to distinguish between value/reference types. Yes, classes ‘default’ to heap storage and structs to stack storage, but the scope keyword allows stack allocation of classes and the new keyword will heap allocate any type, including structs. BTW, D has become expressive enough that scope could be implemented as a library function.

      • I think you’re a little off on C#; saying idiomatic C# is heap oriented ignores the large amount of people using it to do high performance computing and native interop who largely focus on using pooled allocators tuned to their problem set (usually with value types). Your statements concerning structs are patently false as well; there are a number of ways to allocate a struct on the heap: as a member of a class, when closed over by a lambda, when lifted by an async block, or using unsafe blocks are the canonical examples that come to mind.

        Using the Marshal class you can obtain references to a struct or class on the heap in a ‘safe’ way as well, should you desire to manipulate it.

        In an unsafe context you can do anything you would expect in a low level language; allowing the hot spots of the code to do things like reuse memory intelligently or allocate new space on the stack with stackalloc.

      • Ron, I think saying that idiomatic C# is “heap heavy” is compatible with loads of people jumping through the necessary hoops to avoid heap allocations. I’m aware of the unsafe keyword, and the low-level stuff (e.g. various marshaling stuff) you can do if you really need to, and I realize you can wrap a struct in a class to put it on the heap, but I’m saying these things are cumbersome workarounds that only solidify my assessment that programming with structs is not the “encouraged” approach (in the sense that the language semantics and syntax gently guide you towards it as the most convenient way to do things – it may be the encouraged approach in the sense that it’s what you need to do to get the best performance).

        Sandford, interesting. I’ll have to look into this in greater detail. You’re saying I can use “scope” to put class instance as an “inline” member inside another class as well, or is it just for the stack? Or are there any other tricks for collapsing large object graphs into “fat” objects where members are stored inline instead of behind a pointer?

  4. You should seriously try Go. You mention it in one place along C/C++ and D. But what you’re saying is not quite true about Go.

    1. Go is a safe language. It’s so safe that google’s appengine runs it, we have sandbox as well, not many other languages do. Yes, Go has pointers, but the major sources of unsafety are things like out-of-bounds array access, pointer arithmetic, unsafe type systems. Go’s type system is very strict, arrays are bound checked (always), and there is no pointer arithmetic.

    2. Go is productive. Productivity is not just about writing code, it’s about reading it as well. You can’t call C# and Java productive, they are not. Both languages are super verbose and hard to read. Go is easy to write, many said it’s just like a scripting language. Go is easy to read, very lightweight syntax and simple enough semantics. Not to mention that it has a very fast compilation speed. I’m saying that from my personal experience (wrote > 30k lines of Go code).

    3. Go actually does reduce generated garbage. Unlike Java, Go has a very explicit control over memory layout. Structs can be a part of other structs, just like in C/C++. Sure, there are no user-defined constructors and destructors in Go. But Go has “defer” statement, which gets executed when function returns, somewhat like D’s scope, except it’s dynamic and not tied to a scope.

    4. Yes, Rust, unlike Go has this whole notion of task-local stuff. Some people think it’s a solution to garbage collector problem, because semantics of a language allows you to do task-local garbage collection cycles, others (including me) think that this idea extremely complicates semantics and brings little benefit.

    I tried writing code in Rust and I find it extremely unpleasant to work with, where’s the productivity you’re talking about?

    • I wouldn’t consider a language where every reference is potentially null “memory safe”, and I wouldn’t consider a language with shared mutable state “concurrency safe”, so while Go is better than some other alternatives, it too falls in the trap of giving up safety for no real reason (e.g. statically eliminating null dereferences costs you no performance).

      I also don’t like their approach to default values. If you have a “Person” object whose “name” field contains the string “”, than that’s really just as incorrect as if it had contained the string “?:(#*S)” (i.e. random garbage). In other words allowing uninitialized memory to be read is bad – even if you marginally improve on things by introducing some determinism, using default values. Again, there really isn’t a good reason to do allow this – you wouldn’t lose any performance or productivity by not leaving this hazard in the language.

      • Statically eliminating null dereferences is impossible if you want to have null at all (real null, not just Option type).
        You are suggesting that you wouldn’t loose any performance not allowing to access uninitialized variables – how would you implement that? I suspect that reading from struct in Go is as fast as in C.

      • Statically eliminating null dereferencing is not impossible at all, plenty of languages do it. Just make sure that pointers can’t be accessed before they’re initialized, and null isn’t a valid initializer. If you do that, then transitively a pointer can never be null when you dereference it. An option type of a pointer could be implemented as a pointer with a magic “null value” under the covers, if you’re worried about performance. That makes it just as “real” as any other nullable pointer type, it’s just clear from the type that it can be null (instead of every single pointer potentially being null).

        Disallowing access to uninitialized variables would happen statically, using simple (conservative) data flow analysis. Similar to how warnings for accessing uninitialized variables are implemented in most compilers these days.

      • Could you point compiler that statically eliminates uninitialized variables? It is either not Turing-complete or can be used to solve halting problem, like this:

        var aVariable
        if(does_stop(collatz_sequence)) aVariable = 0
        print aVariable

        Of course, compiler could always refuse to compile code like this (javac does, for example), but that would be false positive.

      • Disallowing uninitialized variable access isn’t the same as statically proving that no uninitialized variables in code that has no restrictions on variable declaration or use. You’re allowed, when designing languages, to impose restrictions on what code can be written in order to preserve properties you care about. The simplest version of this would be to simply require that all variable declarations include an initializer (a la Haskell) – you couldn’t possibly read an uninitialized value that way.

        You can relax that with data flow analysis, but at some point you’re going to tell the programmer to either A) shuffle their code around to avoid code where the compiler can’t tell that a variable is initialized, or B) explicitly give the variable a “default” value. Option A is perfectly fine, because code that’s too complicated for the compiler to understand is generally too complicated for humans to understand too, and option B is a lot better than taking the approach that e.g. Go does (i.e. you opt-in to default values only where needed, instead of having that mess everywhere).

        To put it more concretely. Let’s say you have some code where you can’t figure out how to initialize a pointer to something non-null due to complicated initialization logic or whatever. Well in that case just make it a nullable pointer (even though you really do intend for the pointer to be non-nullable after the initialization is done) and take the hit of checking the pointer at the use-site. Obviously you’d makes sure you have enough tools in the language so that these spurious nullable pointers are as few as possible (and in practice, they are exceedingly rare), but even in the extreme corner cases you never end up with a null dereference crash at runtime.

      • (to zielmicha)

        It’s conservative, of course. But it works very well in practice.

        Also, just because we can’t statically eliminate all null dereferences doesn’t mean we shouldn’t try to eliminate most of them. The vast majority of pointers in Rust code tend to be @T, not Option.

  5. This has bothered me for a long time too. Some languages set maximum performance as a goal, and then try to maximize expressivity and safety within those constraints (C++). Other languages set maximum expressivity (Ruby) and/or safety and abstraction (Haskell) as the goal, and then see how much performance they can get out of it (to varying degrees of success). Whereas the big corporate-backed middle languages like Java and C# seem to have no particular goal except to not be too different. (So maybe I like languages that try to be Pareto-optimal in some way, but I’m not sure if that’s completely accurate.)

    And yeah, Rust is a breath of fresh air.

    WRT GC, have you seen the paper “A unified theory of garbage collection”[1]? The idea is that tracing and reference counting GC are duals to each other, and as you make compromises in one of them to mitigate some of its drawbacks, it increasingly takes on the characteristics of the other – deferred reference counting is a great example of this.


    • Thanks, I know about Erlang. While it does a lot of things right (no shared mutable state, thread-local GC), it’s really nowhere close to beating e.g. C++ for performance code, which is kinda where I’m going with this.

  6. What is high performance computing exactly? From my perspective if the humans using your system are happy with its performance then chances are your wasting valuable time worrying about nothing. Speed only matters when it actually matters to customers, not a developer.

    • Well my own perspective is video games. There’s no way I could write a AAA video game engine in Erlang. At 60Hz everything is performance sensitive.

      There are plenty of domains where performance isn’t that important, and the right approach is to do precisely what you say – get “enough” performance and then don’t worry about it. But then there are domains where performance is a critical aspect that ties almost directly to sales (e.g. better graphics sells), and those are the domains I’m concerned about.

  7. 1) It’s funny: the designer of the Nimrod languages seems to have reached to many similar conclusions..
    2) About counting GC vs tracing GC, I’m not sure: one of the most advanced GC C4 is a tracing GC so I would argue that this is the benchmark to be compared to..

    • I had never heard of Nimrod. Scanning the front page of the web site, it looks very interesting! I shall give this a go! Thanks!

      C4 looks interesting, but it appears to have a read barrier which scares me. Not modifying mutator code appeals to me because it means the best-case performance should be comparable with non-GC mutator code. I’d much rather get performance for the GC itself back by making some careful design decisions in the language itself, than needing that kind of heavy lifting in the GC algorithm. For example what I mentioned before: if you’ve effectively split your heap into a hundred or more smaller heaps (by using thread-local GC, and having light weight threads so that each “component” is its own thread) you may now be able to use the most efficient stop-the-world collector available but run them independently (and even concurrently for even more parallelism). The fact that each one only operates on a tiny portion of the heap, concurrently with mutator execution for the rest of the program, means that in aggregate it’s “effectively” concurrent.

  8. “It may also be the worst kind of blog post – too specific to be interesting for most people, and written by a non-expert in the area so that people who would be interested think it’s either obvious or stupid.”

    Not at all! As someone who is interested in Rust but only knows bits and pieces about it this post was very interesting to me. Thanks for writing it! 🙂

  9. Just a little note: the academic community has been investigating alternatives to full-blown GC for decades now, e.g., ML Kit for Standard ML in the early 1990s. For some reason, it’s taken Rust, 20 years later, to raise awareness of these ideas in a wider community.

  10. I wish we could stick with reference counting in Rust, since the potential performance advantages are numerous, as you mention. There are a couple of big issues though:

    1. Cycles. Even the “Down for the Count” paper used backup GC/cycle collection to deal with them. Once you have backup GC or cycle collection you have the same sort of latency issues that you deal with in stop-the-thread GC—the pauses just happen less often (or they may not—it depends on the number of object cycles). Firefox probably has the biggest deployment of Bacon cycle collection in industry and the pauses are a thorn in our side; the work to get the pause times under control has been awesome, but it’s mostly consisted of adding a whole bunch of application-specific hacks to remove large object subgraphs that we know can’t contain cycles from consideration by the collector.

    2. Temporary read-only parallel access to large object graphs. Sadly, this is parallel web page layout, where there isn’t any alternative to automatic memory management, as the DOM and layout trees are complex doubly-linked graphs which content JavaScript can unpredictably alter at will. All browser engines use either reference counting without cycle collection (everything but Firefox) or reference counting with cycle collection (Firefox) for this. We have ideas as to how to use compiler-enforced purity to temporarily allow read-only shared access for @ graphs, but for RC this only really works if the reference counts are thread safe—whereas with tracing garbage collection we can just root the nodes from the parent thread and we’re fine.

    • I’m not at all familiar with firefox’ cycle collector so I’m not sure which optimizations it does (or even can do). There’s three main ones I can think of:
      1. Type-based cycle elimination. If the type graph has no cycles, nor can values.
      2. Mutability based cycle elimination. A (strict) purely functional language can have no cycles. Rust isn’t purely functional, but it’s closer than most languages (encouraging immutability). You’d hope this would discourage cycles.
      3. You’re trying to find cyclic *garbage*, not just cycles. This means only objects that have had at least one deref operation need to be considered by the cycle collector.

      How much of that manual work you’re doing in firefox would be unnecessary in Rust (e.g due to 1 and 2)?

      Also, I’m having a simple “acyclic” pragma would be easy to add. I’m not sure if this is more manual work than dealing with a GC.

      For the parallel access I see your point. I’m not sure how best to solve it. Maybe if you did a deferred scheme like I discussed you could insert memory fences only once you discover that (at least one) thread-local ZCB is full, since that’s the only point where you will take irrevocable action based on the ref count. That would at least make the common ref-count ops cheap.

      • Optimization (3) is done by all cycle collectors (it’s called the purple buffer).

        Optimization (1) is not really true as you state it, unfortunately, because it doesn’t account for existential types. For us, existential types appear in closures and objects. These types can potentially refer to anything; closures can close over anything, after all. Closures are unfortunately pretty common even in functional code. This is probably the #1 source of cycles in practice, actually—most of the leaks in rustc are right now due to data structures that contain closures that close over the data structure itself.

        As for optimization (2), it’s true that we have fewer cycles than many other languages (like JavaScript for example). But they do come up, and we’ve had several instances of folks running servers with a lot of leaks because of them. 😦

        But I do think that even when we have garbage collection, we should still support reference counting via “smart pointer” objects in the standard library for acyclic data. I think this would be similar to your proposed “acyclic” pragma. Rust is about giving the programmer low-level control, and there’s no reason to not support this when we have the type system machinery to make it memory-safe. In that case, it’s your responsibility to make sure there are no cycles.

  11. Pingback: Garbage Collection | Java

  12. C# is hardly applicable for creation of low power, CPU demanding (video&audio) mobile apps. The core functionality must be implemented in C++ (or another deterministic language). And after reading this
    I don’t feel like Microsoft can revamp .NET and C# performance story. So I expect there new Microsoft’s framework in 2 years.

    >>…Let me know which in the comments!
    Liked your post very much! Also, the topic for a future post, could you write “AAA video game engine” without C++ templates and meta-programming facilities in general?

  13. Pingback: On GC in Games (response to Jeff and Casey) | A Random Walk Through Geek-Space

  14. These days I am considering D language as a replacement for C++ and was curious about its GC, your article gives an insight on all the widespread techniques and also new ideas – great one, thanks!
    I particularly like the point you made about how silly is to have references as a rule, and then complicating the garbage collectors because of that. An example of tooling gone wrong

  15. If you have thread-local heaps, how do you do async programming? Async programming is an absolute requirement for large scale network services. It’s an absolute requirement for battery efficient applications too. While a thread is active and waiting on a synchronous operation, your application will keep the CPU awake. With async callbacks, the CPU can go into an idle state until the pending operation completes and wakes it up. Then the callback will happen on a different thread but needs access to data from the first thread.

    • You also have an exchange heap type thing where you can store messages that get transferred to other threads (without copying). You’d have static checks that you can’t access the message after it’s sent.

  16. Hi I’ve seen several articles related to memory management on your blog, they are really interesting.

    In your last message you rightly point that in case of zero-copy communications, the sender of a message must not access the message after it is send. If you want to verify that property statically, you will certainely put a lot of constraints either on the kind of programs you can write or the shape of the messages you can send. Constraints such as unique aliasing of messages (Rust), tree-shaped objects-graph messages (Kilim), immutable messages (Erlang) or exchange heaps (Singularity).

    I explored these issues during my PhD thesis and proposed a strategy that enforces isolation (the memory safety property) at runtime using an actor-based concurrent programming model where you can send arbitrary messages (as long as you own its content) and don’t need unique pointers, memory copy or static checking. (it’s in english).

  17. To not even see a mention of generational garbage collection seems like a bit of an oversight to me. Java actually has several Garbage Collectors, people just don’t seem to know about them. High-performance, business critical, low-latency applications can be built in Java if you use the CMS or G1 garbage collector over the Parallel or Serial Collectors. The JVM is actually damn good at handling these types of applications, and it’s a shame that nobody seems to notice this.

    • I haven’t seen a single generational collector that has low enough latency to even approach solving this problem. I know GC folks seem to think <100ms is "real time" or whatever, but that's not what people who make interactive applications think. It needs to be <1ms pause times *all the time*, and even G1 doesn't do that.

      Concurrent helps, but has a lot of cost typically (and it still occasionally stops the world). I really think incremental with per-thread heaps is the sweet spot here.

  18. C4 (& Azul’s vega VM) are regularly used for very high performance applications. HFT firms are some of their major customers!

    I suspect you are letting past biases cloud your judgement of read barriers. Modern day CPU’s spend almost all their time in pipeline misses and cache misses. Judging by the C4 white paper, the barrier is unlikely to trigger these so the performance impact will be negligible (you’d need to contact the people from Azul for hard numbers which I’m pretty sure they have).

    Also, you’ve largely ignored the benefits of object relocation (which RC collectors generally can’t do). Relocation means that dead data never needs to be looked at (why mark-and-sweep generally looses to mark-and-copy). It means that data gets compacted and reduces TLB misses. Finally it means that allocation can be just bumping a pointer.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s