Implementing Hash Maps

Introduction
Algorithm
Implementation
Conclusion

Introduction

A hashmap is a type of key-value store that uses a hash function to index.

Algorithm

A hashmap is fairly simple, but in essence you have an array of 2-tuples with the first containing a key and the second containing a value. To insert into the array you first hash the key and then take the modulus of the key with respect to the length of the array.

Two items being added to a hashmap.

However, as you might imagine, there can occasionally be collisions, like this.

A third item being added to a hash map but experiencing a hash collision.

To solve this you can either resize the array, shove the collided items in a sub-map ¹ 1 Which will over time cause the hash map to have \(O(log(n))\) search times., or shift them over in some constant direction. Here we choose to resize the array, which is an \(O(n)\) operation.

All three itmes being added again to the newly resized has hmap.

To find items in the array you first find the index of the key just like you did for insertion and then you compare the key to the key in the 2-tuple you stored at that index. If it is the same, then you return the value from the tuple.

The value of foo being obtained from the hash map.

Implementation

Hash Function

This function simply multiplies the list of numbers together with a pad, creating a somewhat random number. It is the exact same as the one in my Implementing Bloom Filters tutorial.

(defn hash-obj [obj]
  (let [pad 3484115242153581271N]
    (->> (str obj)
         (map int)
         (apply (fn [x & xs] (cons (mod (*' x pad) Long/MAX_VALUE) xs)))
         (reduce #(mod (*' %1 %2 pad) Long/MAX_VALUE))
         long)))

The Hashmap Constructor

To define a hash map we first have to define a simple object for it and a constructor that creates an array of nils of the relevant size.

(defrecord HashMap [arr size])

(defn new-hashmap [size] (->HashMap (vec (repeat size nil)) size))

Adding Items

Now that we can get stuff from the hashmap, let's consider adding stuff to it as well. Here we add a function that takes the map, a key, and a value, and then adds that to the hash map. Should a collision occur it resizes.

(defn add-hash
  [{:keys [arr size] :as map} key val]
  (if (nil? (nth arr (mod (hash-obj key) size)))
    (assoc-in map [:arr (mod (hash-obj key) size)] [key val])
    (let [new-hashmap (new-hashmap (* 2 size))]
      (->> arr
           (filter identity)
           (cons [key val])
           (reduce (fn [new-map [key val]] (add-hash new-map key val)) new-hashmap)))))

Here we also see the sole weakness of the hash map, either resizing has to occur or you have to accept degraded performance once a certain size has been reached. Resizing is, while deterministic, not idempotent and it is absolutely possible that it could go on infinitely.² 2 This is because (ignoring for a moment that our hash function is terrible and probably not properly random) the probability of a collision should be equal to the fraction of the map that is filled. This probability is never exactly 0 for an insertion (and of course resizing itself involves insertions), leading to possible dramatic growth in some cases. You also will not be able to fill all the indices in the hash map, meaning that each component will inevitably have only some set of the items.

To demonstrate the resizing ability, consider the execution of the following.

First we define a new map.

(def hm (new-hashmap 5))

Then we start adding opinions to it about different programming languages.

(def hm-with-opinions
  (-> hm
      (add-hash "Clojure" "Is extremely awesome.")
      (add-hash "JavaScript" "Kind of sucks.")))

Which gives us this, and as you can see the two items have been added at their respective indices.

#hm-demo.HashMap{:arr
                 [["Clojure" "Is extremely awesome."]
                  nil
                  nil
                  ["Javascript" "Kind of sucks."]
                  nil],
                 :size 5}

Then we just insert my opinion on PHP.

(add-hash hm-with-opinions "PHP" "Oh god no, please spare me from this hell.")

And now, as you can see, after adding the last element the hasmap automatically resizes, doubling in size.

{:arr
 [["PHP" "Oh god no, please spare me from this hell."]
  nil
  nil
  nil
  nil
  ["Clojure" "Is extremely awesome."]
  nil
  nil
  ["JavaScript" "Kind of sucks."]
  nil],
 :size 10}

Getting Items

The get-hash function simply takes the hahmap and a key. Should the key correspond with a value in the map it returns true, else nil or the value provided in not-found, which replicates the functionality of the native get function.

(defn get-hash
  ([map key] (get-hash map key nil))
  ([{:keys [arr size]} key not-found]
   (let [idx  (mod (hash-obj key) size)
         item (nth arr idx )]
     (if (or (nil? item)
             (not= key (first item)))
       not-found
       (second item)))))

Conclusion

Hash maps are a great way to store key-value pairs with a constant lookup time. However, due to their limited size-efficiency they can have somewhat non-deterministic insertion times, which can be painful if you are working with very large maps.