Speed improvements using hash tables

I wrote the Forth Lisp Python Continuum (Flpc)'s self hosted compiler in stages. When I completed the parser and gave it larger and larger pieces of its own source code, it was running too slow. I tried many things to speed it up, one that helped was using hash tables.

They helped make dictionaries [names] which can

set(dict, key, value) to associate a (string) key with a value
get(dict, key) to retrieve the value we had previously associated.

An example of a dictionary:

{"key1": 3,
 "key2": "foo",
 "key3": []}

Runnable sample code from this post is in the more familiar Python instead of Flpc. In Python, the two above two functions are usually __setitem__(self, key, value) and __getitem__(self, key). However, we will avoid using too many language features so the resulting prototype could be translated more directly. In particular, we'll pretend Python lists are fixed length and don't have conveniences like index and enumerate already implemented.

Naive dictionnary implementation

Before hash tables, dictionaries in Flpc were implemented as two arrays keys and values so that keys[i] is associated with values[i] for each index i (i.e., get(keys[i]) returns values[i]).

get naively performed a linear search through all of keys.

def array_index(array, key):
    for i in range(len(array)):
        if array[i] == key:
            return i
    raise ValueError(f"Value {key} not found.")

def get(self, key):
    return self.values[array_index(self.keys, key)]

and set would check if the value is present, overwriting it if it is.

def set(self, key, value):
    try:
        i = array_index(self.keys, key)
    except ValueError:
        i = array_index(self.keys, Empty)
    self.keys[i] = key
    self.values[i] = value

where Empty is the initial value set to all of keys, representing no key at that index [resizable]. Get the source code here.

Hash tables

Now let's implement hash tables. The specific variant we implement is "hash tables with open addressing and linear probing".

When looking for the index in the keys array, instead of searching from the beginning of the array, we'll start in the middle of it and wrap around when we reach the end.

def array_index(array, key, i):
    while True:
        if array[i] == key:
            return i
        elif array[i] == Empty:
            raise ValueError(f"Value {key} is not in the dictionary")
        i = (i + 1) % len(array)

We'll make both get and set start their search on some location depending only on the key.

def get(self, key):
    return self.values[array_index(self.keys, key, start(key) % len(self.keys))]

def set(self, key, value):
    try:
        i = array_index(self.keys, key, start(key) % len(self.keys))
    except ValueError:
        i = array_index(self.keys, Empty, start(key) % len(self.keys))
    self.keys[i] = key
    self.values[i] = value

That's it! We've implemented hash tables. Get the source code here.

But we still need to define start (source here), know why this works and why its fast.

Basic idea

Hash tables make a time-space trade-off. We'll pick a large keys (and thus values) array, much larger than the number of keys we'll actually store in them. In the ideal situation, start spreads the keys around this array enough that all keys we'll actually store in the hash table begin their search at the same index. That is, all start(key) % len(self.keys) are different.

Then the search always succeeds in one step and returns start(key) % len(self.keys) as the index. This is because keys[start(key) % len(self.keys)] is either Empty before the first time key is set to a value or keys[start(key) % len(self.keys)] is equal to key afterwards.

One step is the best we could do. At the cost of more memory, we've got get and set to run at the same speed as an array even though the keys we're setting aren't contiguous.

I won't say too much about how we pick start but the idea is to choose something that spreads out the starting positions as much as possible.

Why it works

In a less ideal world, we can get start(key1) % len(keys) == start(key2) % len(keys). Call this index index. In that case, the first of the two keys written to keys will be at index and the other at index + 1 [cascade]. The further away a key ends up from its starting index, the more time it will take to search for it.

When we call set(our_key, our_value), we find the first Empty spot

      i |  index index+1 index+2 index+3
keys[i] |   key1    key2   Empty     ???

and write our key

      i |  index index+1 index+2 index+3
keys[i] |   key1    key2 our_key     ???

The value of keys is never rewritten to once is empty so on any later search, we'll end up at the same index (index + 2 in the example) again.

How to make it fast

We'll pick start to be a hash function, something that maps strings to an index. There's a lot to say about hash functions that we won't and instead just pick a known good function that's easy to implement.

def djb2_hash(key):
    h = 5381
    for c in key:
        h = ((h << 5 + h) + ord(c)) & 0xffffffff
    return h

start = djb2_hash

There's quite a rabbit hole to go down about this particular function djb2 (but I won't do that here).

Since the goal of the hash function is to distribute starting points evenly in our array, let's see that this works (or at least has a good chance of working) if our hash function chooses starting positions completely randomly.

What follows is a slightly wrong analysis to get an idea of how much time it will take. Or you can instead read a correct one [linear_probing].

So suppose our hash table has s slots, there are already k keys and we try to insert a new k+1st key.

Since the k+1st key has a random starting position, the probability of picking one of the k slots with a key already in it is k/s. If we're unlucky and collide with a key, then the probability that we hit another key when moving forward is (k-1)/(s-1). This is because there are s-1 slots excluding our starting slot and k-1 keys in those slots.

This is actually not true because while there are k-1 keys amongst s-1 slots, they're not all equally likely to be filled! It would be true if we look at a random slot start(key, i) instead of start(key) + i for the ith slot to examine.

With that aside aside, let's continue. If we're again unlucky and see a filled slot, then the probability that we hit another key when moving forward is (k-2)/(s-2). And we can keep going.

(k-i)/(s-i) is at most k/s and so the probability that we see t filled slots in our search is at most (k/s)^t. The expected number of full slot we see (before eventually finding an empty slot) is the sum of these values [sum]. That's at most the infinite sum, which is a geometric series of value 1/(1-k/s) - 1 [geometric] or 1/(1-k/s) total slots examined if we now count the final empty slot in which the key is inserted.

That means if we have twice as many slots as keys, we expect to look at 1/(1-1/2) = 2 slots on average!

The actual expected number for linear probing is 1/(1-(k/s)²) [linear_probing] which means we'll look at 4 slots which is still not that bad, or we could just have more slots.

Deletion

You'll notice that key deletion is conspicuously missing from this entire post which makes things simpler. That about what would happen if we just set a key back to Empty for key deletion?

def del(self, key, value):
    try:
        i = array_index(self.keys, key, start(key) % len(self.keys))

Flpc

Being a bootstrapping-centric implementation, Flpc faces another chicken or egg problem for dictionaries: the implementation show in this post needs objects and two arrays. But objects needs attributes looked up, which are implemented using a dictionary. And really, we need a dictionary from function names to either memory location of where the functions are stored or primitives for the bytecode interpreter.

Even the naive implementation needs arrays of some sort! To resolve this, the function names dictionary is hard-coded into a function's body.

def names_get(key):
    if key == "string1":
        return value1
    if key == "string2":
        return value2
    if key == "string3":
        return value3
    ...

This does not invalidate what we said before: indeed we needed an array of some sort. It just happens to be the same arrays we use for functions. Flpc functions are represented in memory as arrays of integers, each encoding either a memory location, primitive or (reference to a) constant. (See the Forth tutorial for an in-depth description of the model. Other details vary slightly.) But they do not offer individual getter and setter method. Instead, memory.get and memory.set lets us read and write to absolute positions in memory (so offsets have to be calculated by hand).

We still need a setter for function names lookup. When compiled, the in-memory representation of the getter names.get looks something like

pick1 push: "string1" == push: value1 return_if
pick1 push: "string2" == push: value2 return_if
pick1 push: "string3" == push: value3 return_if
...
end_of_function

Above we're showing what the integer values in memory represent rather than the integers themselves. return_if takes boolean and value, and returns the value if the boolean is true (and discards both if the boolean is false).

So to set a new value, we can append a few more entries at the end of names.get's body, such as pick1 push: "string4" == push: value4 return_if and move the call to end_of_function afterwards (see functions.add in boot.flpc for the actual implementation).

This gets us the naive dictionary for function names. To mimic the hashtable implemented earlier in this post, we need to be able to start at an if-statement in the middle of the function body and we have to wrap around at the end.

In Flpc's bootstrapping sequence, we'll instead use the naive dictionaries to implement everything up to objects and hashtables objects and then rebind names.get to the body of a new function hashtable.get(key names) that uses these hashtables. This resolves the chicken or egg problem. Naive-dictionary-as-a-bunch-of-if-statements came first. Unfortunately, this means attribute lookups also use naive hash tables of some sort. They look something like this.

hashtable.attrib <- fun[name receiver searcher]:
    return_if(string_equal(name "get")      hashtable.get)
    return_if(string_equal(name "set")      hashtable.set)
    return_if(string_equal(name "instance") hashtable.instance)
    return_if(string_equal(name "print")    hashtable.print)
    return_if(string_equal(name "len")      memory.get(receiver))
    return_if(string_equal(name "keys")     memory.get(receiver + 1))
    return_if(string_equal(name "values")   memory.get(receiver + 2))
    return_if(string_equal(name "type")     "hashtable")
    return(instance_attrib(name receiver searcher))

There still one remainding challenge, we have to insert the existing key-value pairs into the new hash table names. Fortunately, each if-statement occupies a fixed length of 7 cell. So we can iterate through this part of memory and pull out the (already compiled!) key and value that are at specific offsets. This all happens in stage1b2.flpc.

convert_names <- fun[]:
  end = functions.end()
  index = names.get + 5
  cond = not(index > end)
  repeat_if:
    drop1(`cond)
    names . set(memory.get(index) memory.get(index + 3))
    index = `index + 7
    cond = not(index > end)

index and end will contain the indices in memory (=addresses). memory.get(index) is a string "string1", "string2" and so on and memory.get(index + 3) contains value1, value2 and so on.

Timing

I timed some of the precompiled files with and without hash table, using time. time is not always the best tool but is good enough here to get a better idea. I looked at other results before event considering speeding up names.get in the first place.

For flpc-partial.f, naive names.get

real 457.95
user 295.37
sys 0.10

Hash table names.get

real 69.22
user 67.41
sys 0.04

For flpc-gen.f, naive names.get

real 33.37
user 32.49
sys 0.90

Hash table names.get

real 11.29
user 8.45
sys 2.83

A bit underwhelming. All it did was give us a ~4x speed up. But we might have moved the bottleneck away from function names lookup. Since speed is dominated by the slowest part (and occasionally parts), even this modest speed up tells us that function names lookup was one of the slower parts.

Attribute dictionaries

We now want to make attribute lookup (such as hashtable.attrib from above) use hash tables. Unfortunately, unlike names.get, each if-statement's body does not take up the same number of cells in memory. For example

return_if(string_equal(name "get")      hashtable.get)

takes up less than

return_if(string_equal(name "keys")     memory.get(receiver + 1))

because memory.get(receiver + 1) is done in the functions body itself. We could potentially avoid this by wrapping every value in a function

return_if(string_equal(name "get")      hashtable.get_())
return_if(string_equal(name "keys")     hashtable.keys())

and hashtable.get_() would just return hashtable.get. But instead we used an ugly hack with jump tables jumping into the old hashtable.attrib (from our new definition of hashtable.attrib). Maybe this will be replaced with something better later.

Footnotes

[names] Dictionaries are sometimes also called "maps" or "associative arrays".
[keys] More generally, we can find values from non-strings too.
[resizable] We could instead use Python list's append which effectively resizes the list as it grows but that increases the requirement from needing only arrays to needing resizable array. And we don't need resizable array to implement fixed/bounded size hash tables.
[cascade] And this can have a cascading effect. Even though a third key is alone to start at index + 1 == start(key3) % len(keys), key2 could be written to keys[index + 1] first so key3 ends up at keys[index + 2] (or vice-versa).
[linear_probing] Everything I've found leads back to the Art of Computer Programming Volume 3 pages p 678-682. Unfortunately, its not free and my medium effort level search hasn't found anywhere that re-explains it. An interesting passage from those pages:

The author cannot resist inserting a biographical note at this point: I first formulated the following derivation in 1962, shortly after beginning work on The Art of Computer Programming. Since this was the first nontrivial algorithm I had ever analyzed satisfactorily, it had a strong influence on the structure of these books. Ever since that day, the analysis of algorithms has in fact been one of the major themes of my life.

[sum] Basically, it doesn't matter if we add up the rows or columns of a matrix first and then add up the result: we'll end up with the sum of all entries in the matrix.
```
  e1 e2 e3 e4 ...
     e2 e3 e4 ...
        e3 e4 ...
           e4 ...
           ...
```
The probability of seeing exactly i full slots is the same as the expectation ei of the indicator random variable for seeing exactly i full slots. The expected number of full slots seens is the sum over i (from 1 to k) of i * ei and that's adding up columns first. What we've calculated is the sum of the expected value of indicator random variables for seeing at least i full slots and that's adding up rows first (by linearity of expectation).
[geometric] If we multiply the sum of a^t from 0 to infinity by a, we get the same sum back with the first term a⁰=1 removed. So 1-a times this sum is 1 and the sum is 1/(1-a).

Posted on May 16, 2020

Blog index RSS feed Contact