Robin Hood Hashing should be your default Hash Table implementation

Standard

May 8, 2013 27 Comments Uncategorized

There’s a neat variation on open-addressing based hash tables called Robin Hood hashing. This technique isn’t very well-known, but it makes a huge practical difference because it both improves performance and space utilization compared to other “standard” hash tables (e.g. chaining). It also eliminates the major downside of other open addressing hash tables.

Here are the benefits, summarized:

High load factors can be used without seriously affecting performance. 0.9 is perfectly reasonable as a default (0.95 or higher works too, it mainly affects insertion cost a bit).
No linked lists or other extra pointers. This reduces cache misses and storage overhead. Your underlying structure can be a simple flat array since it’s just open addressing under the hood.
Lookup and insertion logic is fast. Again, no linked lists to traverse or other complications, just a linear probe sequence with a few checks per element.
Unlike other open addressing schemes, looking for non-existent elements is still fast.

For a simple benchmark where I inserted 100k english words into a map, then deleted 10% of them, and then looked them all up again, the timings for my simple Robin Hood hash table was 23%, 66% and 25% lower for insertions, deletions and lookups respectively compared to the VS 2012 std::unordered_map, it did this using 30% less memory overall.

It’s all about the variance

So the basic idea is to take normal open addressing, but use one clever trick in order to drastically reduce the variance of the expected average and maximum probe lengths. We’ll se later why reducing variance is so important. To give you an idea how Robin Hood Hashing improves things, the probe length variance for a RH table at a load factor of 0.9 is 0.98, whereas for a normal open addressing scheme it’s 16.2. At a load factor of 0.99 it’s 1.87 and 194 respectively.

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name – the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) – it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Furthermore, the expected value of the longest probe sequence approaches about 6 for a load of 0.9 (it’s not actually constant, but grows very, very slowly – see this paper), which is pretty reasonable.

What about failed lookups?

So now you’re probably thinking that this sounds good, but in the back of your mind you know that normal open addressing tables do pretty well too, but have a major downside in that searching for non-existing elements is troublesome to handle. To allay your fears I’ll first tell you a very simple (but totally practical) solution that is enabled by the low variance, and then we’ll see the even better version later.

First a recap. The usual problem is this: since the search algorithm terminates when you find an “empty” slot in the underlying array, it can take a very long time to determine that an element doesn’t exist in the array when the table grows full.

How does Robin Hood hashing solve this? Well the simplest solution is to exploit the fact that the expected longest probe count is low (~6). Just modify the standard search algorithm to ignore empty slots (rather than terminate the search) and keep going until you’ve probed longer than the known maximum probe length for the whole table. This maximum probe length will be small, with a very high probability, even for very loaded tables.

This solution would obviously be inefficient for the standard open addressing scheme, since the expected longest probe count is much higher (e.g. at a load of 0.9 and ~64K elems it is about 70).

The details

In order to know what the probe count of an existing element is (which is key to the algorithm) we could just re-hash the element to re-compute its “desired” slot index and then subtract this from its actual location. A faster way is to simply cache the hash value for each key.

Storing the hash value is generally a sensible strategy anyway, because it means you have a fast early-out for comparisons, so we take that approach here. In other words, the hash table elements consist of a 32 bit hash value, the key, and the value. For my implementation I stored the hash values in a separate array in order to get more hash probes per cache line (at the expense of a second mandatory cache miss to actually fetch the key). This gave a small speed-up for my test where the key size was large (sizeof(std::string)), but there’s a #define in the code to put the hash value alongside its key.

In order to indicate that a slot is unused, we modify the hash function to never return 0, and use a stored hash value of 0 to mean “uninitialized slot”.

First, let’s look at the insertion algorithm, since this is where the actual Robin Hood trick comes in.

void insert_helper(uint32_t hash, Key&& key, Value&& val)
{
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {			
        if(elem_hash(pos) == 0)
        {			
            construct(pos, hash, move(key), move(val));
            return;
        }

        // If the existing elem has probed less than us, 
        // then swap places with existing
        // elem, and keep going to find another slot for that elem.
        int existing_dist = probe_distance(elem_hash(pos), pos);
        if (existing_dist < dist)
        {	
            if(is_deleted(elem_hash(pos)))
            {
                construct(pos, hash, move(key), move(val));
                return;
            }			
            swap(hash, elem_hash(pos));
            swap(key, buffer[pos].key);
            swap(val, buffer[pos].value);
            dist = existing_dist;
        }

        pos = (pos+1) & mask;
        ++dist;			
    }
}

The algorithm is pretty simple. We simply loop until we’ve found an uninitialized slot (hash value == 0). If we found an existing slot whose probe distance is less than our current probe count (‘dist’), we swap places and continue.
Note: using move semantics here matters (e.g. for efficient swapping).

In order to delete an element, we follow the same strategy as for normal open addressing and mark the slot as a tombstone. Tombstones are treated specially in the insert algorithm. We overwrite the tombstone only when we would’ve wanted to swap anyway (see the check for is_deleted above).

We must mark the tombstone using a whole bit rather than just reserving a single hash value (like we did for uninitialized slots), because we need to know the probe count of the tombstone.

bool erase(const Key& key)
{
    const uint32_t hash = hash_key(key);
    const int ix = lookup_index(key);

    if (ix == -1) return false;

    buffer[ix].~elem();
    elem_hash(ix) |= 0x80000000; // mark as deleted
    --num_elems;
    return true;
}

This is all pretty straightforward. We first find the element, then we call its destructor, and finally set the “tombstone” bit.

And lastly, the lookup algorithm:

int lookup_index(const Key& key) const
{
    const uint32_t hash = hash_key(key);
    int pos = desired_pos(hash);
    int dist = 0;
    for(;;)
    {							
        if (elem_hash(pos) == 0) 
            return -1;
        else if (dist > probe_distance(elem_hash(pos), pos)) 
            return -1;
        else if (elem_hash(pos) == hash && buffer[pos].key == key) 
            return pos;				

        pos = (pos+1) & mask;
        ++dist;
    }
}

This just finds the desired position of the element using the hash, then probes linearly from there on in. There are two conditions to signify that the element doesn’t exist, and finally one check to see if we’ve found the element.

The first exit condition checks for completely uninitialized elements. This is because if the element we are looking for had probed an uninitialized slot during insertion, it would have simply been place there. So the existence of an uninitialized slot along our probe sequence means the element must not be in the table.

The second condition is more interesting. This is our replacement for simple checking the probe distance against a table-wide maximum probe count. We know that when we probe an element during insertion, the one with the longer probe count of the two gets to keep the slot. So if we’re looking for an element that exists, we should expect to never see an existing element with a shorter probe count then our current count (if that had happened, there would’ve been a swap during insertion!).

In other words, we early out as soon as our current probe count exceeds the probe count of the stored element. Since the average probe count for stored elements is 2.55, we can exit pretty quickly for non-existent elements (much earlier than stopping after a table-wide maximum probe count).

This second condition is why we need to maintain the hash value even for tomb stones. Imagine what would happen if we simply marked the slot as uninitialized when it was deleted – the next insertion that comes across it would simply occupy it thereby getting an unfairly low probe count, and more importantly messing up the second condition above. Instead, we just flag deleted elements as tombstones, and only reuse the slot in the insertion algorithm if a swap would’ve happened anyway.

Finally, the last condition simply compares the hashes and the keys and returns the found element.

Here’s a the full code for this blog post: code. I should note that I wrote this specific implementation for this blog post, so it hasn’t been extensively used or tested (or optimized). It’s quite likely that there are bugs.

27 thoughts on “Robin Hood Hashing should be your default Hash Table implementation”

David Wragg (@dwragg) says:

July 31, 2013 at 15:19

Hi,

I haven’t encountered Robin Hood hashtables before, so I found this post intriguing. It led me to write my own implementation to make sure I understood the principles correctly. But while doing that I found a problem which I can also reproduce with your C++ implementation.

The problem concerns the use of tombstones, and is triggered when entries are repeatedly erased and inserted. In erase(), an entry is marked as a tombstone, and retains its probe-count. In insert_helper(), new insertions only replace tombstones with lower probe-counts. So tombstones can force insertions to have higher probe-counts than they would if they could replace any tombstone. Thus repeated cycles of inserts and erases can gradually fill the table with high probe-count entries.

I added a test program to your github gist to demonstrate the problem: See https://gist.github.com/dpw/6121563#file-test-cpp . It repeatedly fills a table with 100 entries, and then deletes them all, printing the result of average_probe_count() as it goes. It shows the the average probe count steadily rises. After 10000 inserts/erases, I see that the average probe count has risen to about 93!

I think this problem can be addressed by using the technique you mention of maintaining the maximum probe count, as that would allow an insertion to replace any tombstone. But I wonder if there are other fixes without the downsides of the maximum probe count approach.

Reply
- sebastiansylvan says:
  
  July 31, 2013 at 15:56
  
  Hmm, Thanks! Good catch. Maybe I’ve misunderstood the paper here – I’ll take a look when I get a chance, but I’m pretty sure the idea was to just treat them like any other element to maintain variance etc…
  
  The good news is that the maximum probe count should be small so it’s a perfectly viable option to fix by reverting to that simpler strategy (you only really pay the cost for non-existent lookups). It still kind of bugs me though!
  
  EDIT: you could probably fix it by just getting rid of the tombstone bit and making the slot empty, then treating this new empty slot as a probe count of zero and moving the ‘hole’ up until it hits another slot with probe count zero (an empty slot, or an element that lucked out on its insertion and hasn’t been moved yet). This would maintain the probe count invariant required for lookups, but would require a bit of a scan on each delete.
  
  EDIT2: I believe the problem is line 147 in insert_helper where we swap places with the existing element if the probe distance is *less* than the current distance. This will cause the probe distance to grow unbounded. If we change it to <= instead, it will replace deleted slots without having to "out probe" them first. I believe this would maintain the maximum probe distance (it would never reduce it though, for that you'd have to shuffle elements down every time you deleted something, I think). The downside is that it will do a few extra (but harmless) swaps with non-tombstoned slots – if the extra swaps end up being a performance issue, you could try only modifying the comparison for deleted slots to avoid it (you could even do this using no branches – shift the tombstone bit down to get a number that's either 0 or 1, then subtract this from the existing_eleme_probe_dist.. make sure you have a signed int here… This will in effect turn the subsequent < check into a <= for deleted slots).
  
  Reply
  - David Wragg (@dwragg) says:
    
    July 31, 2013 at 21:43
    
    Taking EDIT2 first: Changing the inequality on line 147 reduces the severity of the problem, but does not eliminate it. My test program still gets an average probe count of about 30, which seems excessive.
    
    EDIT: This is an interesting idea, and seems to work. But I guess some research would be needed to understand how long that scan is.
    
    Finally, I bit the bullet and started reading Celis’ thesis myself. When he discusses deletions on page 55, he notes the growth in the average probe count (PSL in his terminology), but states that the variance remains low. My experiments agree with this. I was at first surprised that he does not seem to consider the growth in the average probe count to be a problem, but as I read more of the thesis is started to make sense: In Chpater 4 he proposes mean-centered searching, such as the “smart searching” heuristic on page 38, which starts the search around the entry corresponding to the average probe count. That’s a neat trick, but I don’t get how smart searching can be efficient for unsuccessful searches, though he is clearly asserts it is so in the table at the top of page 64. I guess I need to read it more thoroughly, but that’s enough for today.
    
    If it does work, it’s quite magical! Though I can see why people may have been put off by its subtlety.
  - sebastiansylvan says:
    
    July 31, 2013 at 23:28
    
    Or course, you could batch up the first idea rather than scanning on every delete. If you do it every N deletions it’s still O(1) amortized.
    
    So, do a single scan through the array and compact it – i.e. move elements down by swapping with the tombstone if their probe distance is > 0. Convert tombstone to an empty slot when it can’t be swapped (because it’s next to an element that has a 0 probe distance). You could maybe trigger this more cleverly (e.g. track the sum of probe distances for deleted items and detect a pathological case that way).
    
    This is probably not equivalent to rehashing, not quite sure, so maybe you want to just do that instead. It seems kind of “morally” fine to me to do a rehash every N deletions, just like you do a rehash every N*max_load_factor inserts. 🙂
    
    BTW, I believe I read another paper from a few years ago that showed that mean centered searching wasn’t faster, but can’t find it right now.
Emmanuel Goossaert says:

November 11, 2013 at 22:18

I have implemented Robin Hood hashing and I have run into the same problems you guys are describing: the mean PSL is increasing. I have experimented with test cases that have different patterns of deletions and insertions, and for some of these test cases, the variance is also increasing. Overall, it appears that Robin Hood hashing is doing worse than basic linear probing. I have blogged about it here: http://codecapsule.com/2013/11/11/robin-hood-hashing/

From those results, I think that Robin Hood hashing is not viable for any serious project. But maybe I missed something or maybe I’m not looking at it the right way… Anyway, whenever you have a moment, I’d be really happy if you could take a look at the results I’ve gotten and maybe drop a comment. I’d be very interested in hearing what you think about it.

Cheers!

Reply
- Sebastian Sylvan says:
  
  November 12, 2013 at 01:11
  
  Did you see my followup post?
  
  We already re-hash the table when we run up against the max load factor, so doing it after a set number of deletions (since the last re-hash) seems like a perfectly reasonable practical solution.
  
  There may also be an incremental solution (see the post). It keeps the mean PSL down, at least, though I didn’t plot the variance.
  
  EDIT: Oh, I just remembered that I actually did plot variance, and the incremental fixup does keep it down too.
  
  Reply
Emmanuel Goossaert says:

November 14, 2013 at 00:16

The follow-up post is interesting, although in Celis’s thesis, the mean and variance remain low after numerous iterations of repeated insertions and deletions, and this without any rebuilding.

In the plots you’ve mentioned, have you been able to reproduce analytically the same bounds for the mean and variance as the theoretical bounds presented in Celis’s thesis?

Reply
- Sebastian Sylvan says:
  
  November 14, 2013 at 05:49
  
  It’s been a while since I read it, but I thought it mentioned that the mean PSL increased after deletions? Anyway, it takes an absurd number of deletions before it becomes a real issue (assuming you tweak the condition used when trading places, as mentioned in the second post).
  
  Also, if you compare with a linear probing table you should pick a higher probe count, like 90%. That’s when you run into big issues with absurd probe lengths (but RH hashing stays reasonable).
  
  I’ve done no work to analyze the analytic bounds, but experimentally I couldn’t get it into a pathological state after the incremental fixup on deletes was implemented.
  
  Reply
Derp says:

December 18, 2013 at 20:41

How quick is it relative to std::unordered_map with a pool allocator like boost::pool_allocator? Perhaps the std table is slower because it is calling malloc/free on each insert/delete (assuming it is using chaining, I don’t have a windows machine to see how it works).

Reply
- Sebastian Sylvan says:
  
  December 21, 2013 at 06:55
  
  Well the allocator only really comes into play on insertions. It seems unlikely that it would help for lookups (cache lines are only so large and in general insertions won’t happen in “collision-order” so each cell in a linked list bucket will mostly be a different cache line). You could probably improve things by storing 2-3 items per cell, which would waste less cache line space per linked list cell.
  
  Either way, I’m not sure why I’d bother. I think it would be hard to beat a single array packed to 95% for simplicity and compactness, and given that this is feasible with RH hashing I’ll stick to that.
  
  Reply
Derp says:

December 21, 2013 at 13:07

The allocator comes into play for inserting and erasing.

Saying ‘why bother?’ is a cop-out 🙂 If you are going to claim that people should use robin-hood hashing because it is faster and uses less space than std::unordered_map, then at least give unordered_map a fair assessment, i.e. with a decent pool allocator (boost::pool_allocator with a null mutex will do) and the same hash function as whatever your RHR uses (although I see you are using std::hash which is not so great, you will be able to get a considerable improvement if you use a better hash function, in particular for integers with Thomas Wang’s functions). The pool should give unordered_map a big improvement in speed and memory usage. Note that I’m not trying to rain on your parade here; I’m just trying to point out that you might be surprised how well unordered_map can perform if used with a pool and good hash function.

Reply
- Sebastian Sylvan says:
  
  December 22, 2013 at 12:09
  
  It is a fair comparison. It’s using the same hash with the same allocator. You’re suggesting that I skew the comparison in favor of std::unordered_map. Feel free to try that if you want (you have the code), but at best you’ll improve insertions, and only at the cost of increased external fragmentation (i.e. memory use).
  
  Reply
  - Derp says:
    
    December 22, 2013 at 12:50
    
    No, as I already said it will improve both inserting and erasing for a table using chaining. free() is not cheap :-). Erasing will actually see the biggest improvement. And yes I have timed this, the unordered_map in libc++ (clang) sees a big improvement in erase times.
    
    Using a pool might use more memory but it will not cause external fragmentation. A pool can reduce memory usage if malloc suffers internal fragmentation (this is what I had in mind in my earlier comment) with the chunk size being used, which will be a very real possibility in the case of a hash table.
    
    I am suggesting the test be skewed in favour of unordered_map, but what is wrong with that? If I was going to use unordered_map and speed was important I’d use a pool, so to me comparing the speed of another table to unordered_map without a pool just doesn’t seem fair to me, fair enough if you disagree though.
  - Sebastian Sylvan says:
    
    December 22, 2013 at 13:23
    
    Feel free to post benchmarks. I doubt it’ll get faster than not doing any allocation at all (even a pool will see a cache miss per allocation, in real-world scenarios where the pool is cold in the cache upon insertion). Having to allocate memory is unlikely to be faster than not having to allocate memory. You’re suggesting approaches to making std::unordered_map “less slower”, but why bother spending all that extra effort and memory when there’s a faster and more compact alternative?
    
    Using a pool will absolutely add additional external fragmentation. Every time you introduce another allocator you increase external fragmentation (because allocator A can’t in general reuse free memory from allocator B).
    
    Feel free to bias the benchmark to make unordered_map look better if you like. I don’t doubt you can make it *better* (by increasing memory usage… it’s already a lot more bloated, and with a separate allocator on top you’re not helping things), but I do seriously doubt it’ll beat the RH-table (especially if you clear the cache between insertions to make sure you get a cache miss per allocation like you would in most real world scenarios).
Derp says:

December 21, 2013 at 13:22

Also unordered_map probably isn’t storing the hash value in each node and comparing it before the key, so that would be another thing to control for.

Reply
- Sebastian Sylvan says:
  
  December 22, 2013 at 13:25
  
  Why? Even with that hash value it still uses less memory than unordered_map, so why can’t we use some of those savings to make it faster?
  
  Reply
  - Derp says:
    
    December 22, 2013 at 14:02
    
    Because it’s something that could confuse the results of a benchmark. I’m looking at this from the perspective of wanting to objectively compare only the method of managing the hash table, so I would remove the influence of any ‘tricks’ like this from the experiment.
    
    On ‘why bother spending all that extra effort and memory when there’s a faster and more compact alternative’. Well, that is actually my point, why bother with the effort of writing a RH hash table if unordered_map can be so trivially improved? (plugging in an allocator is hardly extra effort).
    
    On external fragmentation I think that is a slightly unrealistic criticism (unless you are on an embedded system I suppose). I see what you are saying now but the pool would have to be huge before that is a problem, and I still would not really describe that as external fragmentation.
    
    I’d be happy to do some benchmarks, do you have a more complete implementation of RHR? My benchmark requires erase(const_iterator) and insert() with a key that already exists in the table. Also I will state now that I cannot share my benchmark code, as it is based on a real world application.
  - Sebastian Sylvan says:
    
    December 22, 2013 at 23:40
    
    Ok, first of all we’re talking about what should be your *default* hash table. If anyone creates a separate allocator per hash table usage they should lose write-access to the source repository, because that’s a terrible idea! So for comparing which is the better *default* hash table you absolutely should *not* involve custom allocators or any other tricks which only make sense if they’re used sparingly.
    
    For external fragmentation, it’s actually the other way around. It’s when you have tons of small pools that the unused memory for each pool becomes a major issue (because they’re a larger percentage of the overall memory). If you only do this for one or two large hash tables that are used a lot, then the extra few pages of memory per pool are unlikely to be a big deal.
    
    I maintain that the current setup is fair and objective, and reject all your claims to the contrary. What you’re doing is trying to figure out if there are ways of making std::unordered_map faster if you’re already stuck with it, which is fine and uesful, but you’re *not* making the comparison more fair (quit the opposite), nor are answering the question on which should be your *default*.
Derp says:

December 23, 2013 at 03:09

Well, if someone replaced the well tested, bug-free std::unordered_map implementation in a project I was in charge of with a RHR implementation that offered no or little speed improvement over unordered_map with a trivial template parameter change to use a pool for the cases where you have a table on the hot path, then I’d probably revoke his commit privileges too 😉 (and I never said using a pool should be the default).

External fragmentation refers to having lots of small chunks that together add up to a lot of memory, but are individually too small to satisfy incoming requests, i.e. something which occurs in a general purpose allocator that services arbitrary chunk sizes. Having a bunch of fixed size pools that can’t share each others resources is similar, but not the same and IMHO is a misuse of the term.

I’m not `trying to figure out ways of making unordered_map faster if I’m stuck with it’. I posted my original comment because I was interested in seeing if RHR was faster even if the benchmark controlled for unordered_map’s inefficient calling of malloc/free. It was a simple suggestion and I don’t see why you are being so obstinate about something so trivial. If you don’t want to benchmark it or provide benchmarkable code (the gist code blows up after a certain number of insert/erases), then fine, I’m done.

Reply
- Sebastian Sylvan says:
  
  December 23, 2013 at 04:24
  
  Let’s not lose sight of reality here, okay? So remember that the main reason for using a hash table is typically that you care about lookup speed, and your suggestions are not doing anything about that.
  
  Also, if your argument is about code inertia and not performance then say so. I can fully understand that lots of code bases can’t just switch to a new hash table out of inertia and risk aversion.
  
  You’re suggesting something that’s a terrible idea to apply widely, and the best you can hope for is better insertions and deletions (and even then it’s pretty unlikely to beat RH since it’ll still have more cache misses which are *very* expensive), but lookups will still be significantly slower which tends to be a pretty important property for hash tables.
  
  If you feel that it’s somehow important to spend time benchmarking this tiny facet of std::unordered_map then feel free to do so, but don’t insist that my comparison isn’t fair unless I compare it in highly specialized scenarios with custom allocators in order to make things look less bad for std::unordered_map. How about *you* spend *your* free time teasing out the details of some irrelevant scenario instead of rudely demanding that *I* do it?
  
  External fragmentation refers to fragmentation that’s external to the allocated blocks, that’s all it means. It has nothing to do with the size of the free blocks, it simply means that there’s free memory that isn’t available for use and that memory is external to allocated blocks. In this case it’s because you’d end up with a bunch of different allocators each of which have free memory, but most if it isn’t available. You end up with “pockets” of free memory scattered throughout your heap (one for each allocator) instead of being able to coalesce it as you go (which is often possible if you have one allocator).
  
  I’m really not sure what you’re hoping to accomplish here. My tests show that RH maps have *substantial* improvement across the board (speed in all three major operations, as well as memory usage). You suggested (but didn’t demonstrate) that you could narrow that gap for insertions and deletions in *some* circumstances. By how much? You don’t know, but that doesn’t stop you from calling my comparisons unfair.
  
  Also, a meta point: if you want to be jackass do it somewhere else. This is my blog not a public forum, and I’ve tolerated quite a lot of rudeness from you already. I’m under no obligation to put up with you implying intellectual dishonesty on my part or calling me obstinate. If you want to discuss things here please be polite. This is your final warning.
  
  Reply
Derp says:

December 23, 2013 at 13:15

Ok, calling me rude and a jackass is below the belt; My initial comments were perfectly polite, it was *you* who began responding rudely and I returned the bad behaviour. If you don’t like people being rude to *you*, then don’t be rude to *them*. You need to stop being so defensive about trivial things. I would actually be *happy* if RH was quicker than a pool and commented because I figured I could get a quick answer on that, seeing as you already had a benchmark set up.

Reply
- Sebastian Sylvan says:
  
  December 23, 2013 at 15:29
  
  Come on now, that’s just silly revisionism (bad idea when the comments are still around for anyone to see…).
  
  I responded to your first post (which was indeed polite) by explaining why your suggestion would only have limited effectiveness and why it didn’t seem like a fruitful area of investigation. You responded by calling it a cop-out and accusing my comparison of being unfair. That’s your *second* post where you couldn’t stay polite and constructive. Despite that I kept answering your questions with more detail on why it wasn’t worth the time to investigate your pet-theory, and you ended up calling me obstinate as thanks for my continued engagement.
  
  As a clue as to which one of us has a problem here, maybe see the discussions that happened above (with two separate people who managed to stay polite), the first of which led to a productive discussion and a followup post with an improvement to the deletion algorithm.
  
  And while I hate to reward bad behavior I did do a quick benchmark just in case someone else reads this. In no scenario does deletions or lookups get even close to the RH table. The best scenario for unordered_map is if you all-but eliminate the allocation overhead for unordered_map (by making sure the allocations are sequential in memory)where it beats no-tricks RH by 8% on insertions only (I did not bother changing the RH table to a custom allocator). However, as predicted, the moment you actually do any deletions and try to reuse old blocks from the pool the advantage disappears pretty much immediately (due to the extra cache misses since free blocks are now scattered around in different cache lines) – in fact it approaches the non-pool performance for insertions at an alarming rate (indicating that cache misses dominate the extra logic for the more sophisticated allocator.. deletions stay faster with the pool, though).
  
  Reply
qwertie says:

January 29, 2014 at 17:49

If unordered_map can be easily improved simply by adding a new general-purpose allocator, then it’s reasonable to add that as a “third option” in the benchmark. But only if that allocator exists and is available for download; from the perspective of a typical C++ coder that has never written an STL allocator, writing a faster allocator isn’t better than just downloading this RH hashtable and using that. And if the allocator isn’t general-purpose, I would judge the RH hashtable, which IS general-purpose, to be superior.

Both of your definitions of external fragmentation sound right to me. But further up the thread I have the impression Sebastian thinks that unused memory in a hashtable counts as external fragmentation, which is not correct (“Every time you introduce another allocator you increase external fragmentation (because allocator A can’t in general reuse free memory from allocator B)” – er, I think unused memory owned by allocator B should count as internal fragmentation.)

I don’t think that Derp was rude, but I do think he was unreasonable in not accepting the cached hashcode and every bit as “obstinate” as Sebastian. RH uses less memory even though it caches the hashcode–what is there to complain about? And hashtables that cache the hashcode are not uncommon or unreasonable (e.g. .NET’s standard Dictionary does so).

Plus, hashcode caching can be optional — I would suggest turning it off with a template parameter. Then you could automatically turn off caching for integers or pointers which can be hashed quickly (is there even a need to hash integers in the first place?). Once upon a time I made a hashtable declared like this:

// a set of T; hash_map is derived from hash_set<std::pair, …>
template<class T, class Traits = hash_traits
CompactMode = boost::mpl::or_<boost::is_pointer,
boost::mpl::bool_<Math::num_traits::is_numeric> >::value > >
class hash_set : protected Traits {

Reply
- Sebastian Sylvan says:
  
  January 29, 2014 at 21:54
  
  Any unused memory that can’t be used for new allocations is fragmentation. Internal fragmentation is unused memory *inside* an existing allocation (e.g. due to allocators with restricted block sizes, like it may round up to nearest power of two or something), and external fragmentation is everything else. An allocator will not allocate memory from the underlying OS one element at a time (or else, what’s the point?), they allocate big blocks at a time. Thus each allocator has a bunch of free memory just sitting there unused until an actual allocation comes in to claim it. This would be fine if there was only one allocator, but if you have one allocator per collection (or at least per key-value-pair size) you could end up with dozens or hundreds of allocators, each of which having a bit of unused memory that you can’t actually access if you try to just allocate a plain old object. Thus, looking at the whole heap, there are all these multi-KB pockets of unused (and unallocated) space that you’re not actually allowed to touch to service an arbitrary memory allocation, which is pretty much the definition of external fragmentation.
  
  This isn’t academic btw, I’ve seen a bunch of programs that accumulate allocators because everyone thinks *their* subsystem is a special snow flake that needs to have its own allocator because it can allocate more compactly/efficiently, but they neglect to count the free memory stored in their allocator as fragmentation (since it’s unavailable to service arbitrary allocations), when viewed in a whole-heap perspective. So you end up with big bloated applications where 40% of the memory is actually sitting unused in various allocators.
  
  You can of course play semantic tricks. E.g. allocate the maximum number of linked list nodes that you will ever need up front (event though this can depend on the number of collisions etc., if your table stores the first element directly in the main array). That way it’s no longer external fragmentation (because the memory *is* used by the application code), but you’ve just renamed it hash table overhead, not actually gained anything.
  
  It’s potentially possible to skip the cached hash values, but it wouldn’t be easy. You need a bit over one bit per element to tag deleted items as well as unused items (which you could do with 2 bits per elem in a bit field), but more importantly you need to know the probe distance for deleted items to handle insertions correctly. If you don’t store the hash value, you’d have to keep the key around for re-hashing so you can compute the probe distance for a deleted item. For lots of deleted items (and complicated keys, e.g. strings which have their own allocations) this would add its own memory overhead. Thus, I do think that storing the cached hash values is the best option. It behaves consistently without surprises, and helps cut down on the number of hashes too.
  
  Reply
qwertie says:

January 29, 2014 at 18:03

I notice now that at the very beginning Derp mentioned boost::pool_allocator. But that isn’t something you can just plug in, it’s a wrapper for a user-defined allocator, so plugging it in is not “trivial” as you still have to find/write a memory manager.

Reply
Bill Torpey says:

April 10, 2014 at 15:54

Did something very similar years ago for a ticker-plant application. Basically, any time we retrieved an entry from overflow chain, we swapped it with the head — the idea being that entries that are accessed frequently would “bubble up”.

Not sure I would do it quite the same way today (what with NUMA etc.) but at the time this was a simple and effective technique.

We called it a “non-deterministic hash” as a tribute to our regular lunch spot 🙂

Reply
Pingback: Robin Hood Hashing | Collected Links

	xcbsmith on Why (most) High Level Language…
	xcbsmith on Why (most) High Level Language…
	xcbsmith on Why (most) High Level Language…
	mrtimuk on Two Performance Walls App…
	John Bootane on Two Performance Walls App…

A Random Walk Through Geek-Space

Brain dumps and other ramblings

Robin Hood Hashing should be your default Hash Table implementation

It’s all about the variance

What about failed lookups?

The details

27 thoughts on “Robin Hood Hashing should be your default Hash Table implementation”

Leave a reply to Sebastian Sylvan Cancel reply

It’s all about the variance

What about failed lookups?

The details

Share this:

Related

27 thoughts on “Robin Hood Hashing should be your default Hash Table implementation”

Leave a reply to Sebastian Sylvan Cancel reply