Most Java developers use Map, especially HashMap. HashMap is a simple but powerful way to store and get data. But how many developers know how HashMap works internally? A few days ago, I read a lot of source code of java.util.HashMap (including Java 7 and Java 8) to gain a deep understanding of this basic data structure. In this post, I will explain the implementation of java.util.HashMap, describe the new features added in the Java 8 implementation, and discuss performance, memory, and some known issues when using HashMap.
Internal storage
The Java HashMap class implements the Map<K, V> interface. The main methods in this interface include:
V put(K key, V value)V get(Object key)V remove(Object key)Boolean containsKey(Object key)
HashMap uses an internal class Entry<K, V> to store data. This inner class is a simple key-value pair with two additional data:
A reference to another entry, so that HashMap can store objects like link lists.
A hash value used to represent the key. Storing this value can prevent HashMap from regenerating the hash value corresponding to the key every time it is needed.
Here is a part of the code for Entry<K, V> under Java 7:
static class Entry<K,V> implements Map.Entry<K,V> {final K key;V value;Entry<K,V> next;int hash;…}HashMap stores data into multiple unidirectional lists (sometimes called buckets or containers orbins). All lists are registered in an Entry array (Entry<K, V>[] array), and the default length of this internal array is 16.
The following figure describes the internal storage of a HashMap instance, which contains an array of nullable objects. Each object is connected to another object, thus forming a linked list.
All keys with the same hash value will be placed in the same linked list (bucket). Keys with different hashes may end up in the same bucket.
When the user calls put(K key, V value) or get(Object key), the program calculates the index of the bucket the object should be in. The program will then iterate over the corresponding list to find the Entry object with the same key (using the equals() method of the key).
In the case of calling get(), the program will return the Entry object corresponding to the value (if the Entry object exists).
For the call to put(K key, V value), if the Entry object already exists, the program will replace the value with a new value, otherwise the program will create a new Entry (key and value in the parameter) at the header of the one-way linked list.
The index of the bucket (linked list) is generated through 3 steps of the map:
First get the hash code of the key.
The program repeats hash code to block bad hash functions for keys, because this has the potential to put all the data on the same index (bucket) of the internal array.
The program takes the duplicate hash code and uses a bitmask of the array length (minimum 1) for it. This operation ensures that the index will not be larger than the size of the array. You can think of it as a calculated optimized modulus function.
Here is the source code for generating the index:
// the "rehash" function in JAVA 7 that takes the hashcode of the keystatic int hash(int h) {h ^= (h >>> 20) ^ (h >>> 12);return h ^ (h >>> 7) ^ (h >>> 4);}// the "rehash" function in JAVA 8 that directly takes the keystatic final int hash(Object key) {int h;return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);}// the function that returns the index from the rehashed hashstatic int indexFor(int h, int length) {return h & (length-1);}To work more efficiently, the size of the inner array must be a power of 2. Let's see why:
Assuming that the length of the array is 17, the value of the mask is 16 (array length-1). The binary representation of 16 is 0…010000, so that for any value H, the result of "H & 16" is 16 or 0. This means that arrays of length 17 can only be applied to two buckets: one is 0 and the other is 16, which is not very efficient. But if you set the length of the array to a power of 2, for example 16, then the bitwise indexing works to "H & 15". The binary representation of 15 is 0…001111, and the value output by the index formula can range from 0 to 15, so that an array of length 16 can be fully used. For example:
If H = 952, its binary representation is 0..01110111000, and the corresponding index is 0…01000 = 8
If H = 1576, its binary representation is 0..011000101000, and the corresponding index is 0…01000 = 8
If H = 12356146, its binary representation is 0..0101111001000101010010, the corresponding index is 0…00010 = 2
If H = 59843, its binary representation is 0..01110100111000011, its corresponding index is 0…00011 = 3
This mechanism is transparent to the developer: if he selects a HashMap of length 37, the Map will automatically select the next power value greater than 37 (64) as the length of the internal array.
Automatically resize
After obtaining the index, the get(), put() or remove() methods will access the corresponding linked list to see if the Entry object for the specified key already exists. Without modification, this mechanism may cause performance problems, as this method requires iterating over the entire list to see if the Entry object exists. Assume that the length of the internal array takes the default value of 16, and you need to store 2,000,000 records. In the best case, there will be 125,000 Entry objects per linked list (2,000,000/16). The get(), remove() and put() methods require 125,000 iterations each time they are executed. To avoid this, HashMap can increase the length of the internal array, thus ensuring that only a few Entry objects are retained in the linked list.
When you create a HashMap, you can specify an initial length and a loadFactor through the following constructor:
</pre>public HashMap(int initialCapacity, float loadFactor)<pre>
If you do not specify parameters, the default initialCapacity value is 16, and the default loadFactor value is 0.75. initialCapacity represents the length of the linked list of the internal array.
When you use the put(…) method to add a new key-value pair to the Map, the method checks whether the length of the inner array needs to be increased. To achieve this, the Map stores 2 data:
Map size: It represents the number of records in HashMap. We update the value when we insert or delete it into the HashMap.
Threshold: It is equal to the length of the internal array*loadFactor, and this threshold will also be updated at the same time every time the length of the internal array is adjusted.
Before adding a new Entry object, the put(…) method checks whether the current Map's size is greater than the threshold. If it is greater than the threshold, it creates a new array with twice the length of the current internal array. Because the size of the new array has changed, the index function (that is, the bit operation result that returns the "hash value of the key & (array length -1)") also changes. Resizing the array creates two new buckets (linked lists) and reassigns all existing Entry objects to the bucket. The goal of adjusting the array size is to reduce the size of the linked list, thereby reducing the execution time of put(), remove() and get() methods. For all Entry objects corresponding to keys with the same hash value, they are allocated to the same bucket after resizing. However, if the hash values of the two Entry objects are different, but they were on the same bucket before, then after adjustment, there is no guarantee that they are still on the same bucket.
This image describes the situation of the pre-adjust and after-adjustment internal arrays. Before adjusting the array length, in order to obtain the Entry object E, the Map needs to iterate over a linked list containing 5 elements. After adjusting the array length, the same get() method only needs to traverse a linked list containing 2 elements, so the run speed of the get() method after adjusting the array length is increased by 2 times.
Thread safety
If you are already very familiar with HashMap, you definitely know that it is not thread safe, but why? For example suppose you have a Writer thread that will only insert existing data into the map, a Reader thread that will read data from the map, so why doesn't it work?
Because under the automatic resize mechanism, if the thread tries to add or obtain an object, the map may use the old index value, so that the new bucket where the Entry object is located will not be found.
In the worst case, when 2 threads insert data at the same time, 2 put() calls will start at the same time and the array will automatically resize. Since two threads modify the linked list at the same time, it is possible that the Map exits in the internal loop of a linked list. If you try to get data from a list with an inner loop, the get() method will never end.
HashTable provides a thread-safe implementation that prevents the above from happening. However, since all synchronous CRUD operations are very slow. For example, if thread 1 calls get(key1), then thread 2 calls get(key2), and thread 2 calls get(key3), then at a specified time, only 1 thread can get its value, but all 3 threads can access these data at the same time.
Since Java 5, we have a better and thread-safe HashMap implementation: ConcurrentHashMap. For ConcurrentMap, only buckets are synchronized, so that if multiple threads do not use the same bucket or resize the internal array, they can call the get(), remove() or put() methods at the same time. In a multithreaded application, this approach is a better choice.
Invariance of bonds
Why is it a good implementation to use strings and integers as keys to HashMap? Mainly because they are immutable! If you choose to create a class yourself as a key, but you cannot guarantee that the class is immutable, you may lose data inside HashMap.
Let's look at the following use cases:
You have a key whose internal value is "1".
If you insert an object into a HashMap, its key is "1".
HashMap generates hash values from the hash code of the key (i.e. "1").
Map stores this hash value in the newly created record.
You change the internal value of the key and change it to "2".
The hash value of the key changed, but HashMap doesn't know this (because the old hash value is stored).
You try to get the corresponding object by the modified key.
Map calculates the hash value of the new key (i.e. "2") to find the linked list (bucket) where the Entry object is located.
Case 1: Since you have modified the key, Map will try to look for the Entry object in the wrong bucket, but it is not found.
Case 2: You are lucky that the bucket generated by the modified key and the bucket generated by the old key are the same. The Map will then traverse the linked list and an Entry object with the same key has been found. But in order to find the key, Map will first compare the hash value of the key by calling the equals() method. Because the modified key will generate different hash values (the old hash values are stored in the record), then Map has no way to find the corresponding Entry object in the linked list.
Here is a Java example where we insert two key-value pairs into the Map, and then I modify the first key and try to get the two objects. You will find that only the second object returned from the map, the first object has been "lost" in the HashMap:
public class MutableKeyTest {public static void main(String[] args) {class MyKey {Integer i;public void setI(Integer i) {this.i = i;}public MyKey(Integer i) {this.i = i;}@Overridepublic int hashCode() {return i;}@Overridepublic boolean equals(Object obj) {if (obj instanceof MyKey) {return i.equals(((MyKey) obj).i);} elsereturn false;}}Map<MyKey, String> myMap = new HashMap<>();MyKey key1 = new MyKey(1);MyKey key2 = new MyKey(2);myMap.put(key1, "test " + 1);myMap.put(key2, "test " + 2);// modifying key1key1.setI(3);String test1 = myMap.get(key1);String test2 = myMap.get(key2);System.out.println("test1= " + test1 + " test2=" + test2);}}The output of the above code is "test1=null test2=test 2". As we expect, Map does not have the ability to obtain the string 1 corresponding to the modified key 1.
Improvements in Java 8
In Java 8, the internal implementation in HashMap has been modified a lot. Indeed, Java 7 uses 1000 lines of code to implement it, while Java 8 uses 2000 lines of code. Most of what I described earlier is still correct in Java 8, except using linked lists to save Entry objects. In Java 8, we still use arrays, but it will be saved in Node, which contains the same information as the previous Entry object and also uses a linked list:
Here is part of the code that implements Node in Java 8:
static class Node<K,V> implements Map.Entry<K,V> {final int hash;final K key;V value;Node<K,V> next;So what is the big difference compared to Java 7? Well, Node can be extended to TreeNode. TreeNode is a data structure of a red and black tree that can store more information, so we can add, delete or obtain an element under the complexity of O(log(n)). The following example describes all the information saved by TreeNode:
static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {final int hash; // inherited from Node<K,V>final K key; // inherited from Node<K,V>V value; // inherited from Node<K,V>Node<K,V> next; // inherited from Node<K,V>Entry<K,V> before, after;// inherited from LinkedHashMap.Entry<K,V> parent;TreeNode<K,V> left;TreeNode<K,V> right;TreeNode<K,V> prev;boolean red;Red and black trees are self-balancing binary search trees. Its internal mechanism ensures that its length is always log(n), whether we add or delete nodes. The most important benefit of using this type of tree is that it is the case that many data in the internal table have the same index (bucket). At this time, the complexity of searching the tree is O(log(n)), and for the linked list, the complexity is O(n) for performing the same operation.
As you can see, we do store more data in the tree than linked lists. According to the inheritance principle, the internal table can contain Node (linked list) or TreeNode (red and black tree). Oracle decides to use these two data structures according to the following rules:
- For the specified index (bucket) in the internal table, if the number of nodes is more than 8, the linked list will be converted into a red and black tree.
- For the specified index (bucket) in the internal table, if the number of nodes is less than 6, the red and black tree will be converted into a linked list.
This image describes an internal array in a Java 8 HashMap, which contains both tree (bucket 0) and linked lists (bucket 1, 2 and 3). Bucket 0 is a tree structure because it contains more than 8 nodes.
Memory overhead
JAVA 7
Using HashMap consumes some memory. In Java 7, HashMap encapsulates key-value pairs into Entry objects, an Entry object contains the following information:
Reference to the next record a pre-calculated hash value (integer)
A reference to a key and a reference to a value
Additionally, HashMap in Java 7 uses an internal array of Entry objects. Suppose a Java 7 HashMap contains N elements and its internal array has CAPACITY, then the additional memory consumption is about:
sizeOf(integer)* N + sizeOf(reference)* (3*N+C)
in:
The size of an integer is 4 bytes
The size of the reference depends on the JVM, operating system, and processor, but is usually 4 bytes.
This means that the total memory overhead is usually 16 * N + 4 * CAPACITY bytes.
Note: After Map is automatically resized, the value of CAPACITY is the next smallest power of 2 greater than N.
Note: Starting from Java 7, HashMap adopts a lazy loading mechanism. This means that even if you specify the size for HashMap, the internal array used (consuming 4*CAPACITY bytes) that is not allocated in memory before we first use the put() method.
JAVA 8
In Java 8 implementations, compute memory usage becomes more complicated because Node may store the same data as Entry, or add 6 references and a Boolean property (specifying whether it is a TreeNode).
If all nodes are just Nodes, then the memory consumed by Java 8 HashMap is the same as that consumed by Java 7 HashMap.
If all nodes are TreeNode, then the memory consumed by Java 8 HashMap becomes:
N * sizeOf(integer) + N * sizeOf(boolean) + sizeOf(reference)* (9*N+CAPACITY)
In most standard JVMs, the result of the above formula is 44 * N + 4 * CAPACITY bytes.
Performance issues
Asymmetric HashMap vs balanced HashMap
In the best case, both get() and put() methods only have O(1) complexity. However, if you don't care about the hash function of the key, your put() and get() methods may execute very slowly. The efficient execution of put() and get() methods depends on the data being allocated to different indexes of the internal array (bucket). If the hash function of the key is not designed properly, you will get an asymmetric partition (regardless of the size of the internal data). All put() and get() methods will use the largest linked list, which will be slow to execute because it requires iterating over all records in the linked list. In the worst case (if most of the data is on the same bucket), your time complexity becomes O(n).
Here is a visual example. The first graph describes an asymmetric HashMap and the second graph describes an equalized HashMap.
skewedHashmap
In this asymmetric HashMap, it will take time to run the get() and put() methods on bucket 0. Getting record K takes 6 iterations.
In this balanced HashMap, it only takes 3 iterations to get the record K. These two HashMaps store the same amount of data and the internal arrays are the same size. The only difference is the hash function of the key, which is used to distribute records to different buckets.
Here is an extreme example written in Java, in which I use a hash function to put all the data into the same linked list (bucket) and then I added 2,000,000 pieces of data.
public class Test {public static void main(String[] args) {class MyKey {Integer i;public MyKey(Integer i){this.i =i;}@Overridepublic int hashCode() {return 1;}@Overridepublic boolean equals(Object obj) {…}}Date begin = new Date();Map <MyKey,String> myMap= new HashMap<>(2_500_000,1);for (int i=0;i<2_000_000;i++){myMap.put( new MyKey(i), "test "+i);}Date end = new Date();System.out.println("Duration (ms) "+ (end.getTime()-begin.getTime()));}}My machine configuration is core i5-2500k @ 3.6G, and it takes more than 45 minutes to run under java 8u40 (I stopped the process after 45 minutes). If I run the same code, but I use the hash function like this:
@Overridepublic int hashCode() {int key = 2097152-1;return key+2097152*i;}It takes 46 seconds to run it, which is much better than before! The new hash function is more reasonable than the old hash function when processing hash partitions, so calling the put() method is faster. If you run the same code now, but use the hash function below, it provides a better hash partition:
@Overridepublic int hashCode() {return i;}Now it only takes 2 seconds!
I hope you can realize how important hash functions are. If you run the same test on Java 7, the first and second will be worse (because the put() method in Java 7 has O(n) complexity, while the complexity in Java 8 has O(log(n)).
When using HashMap, you need to find a hash function for the keys that can spread the keys to the most likely bucket. To do this, you need to avoid hash conflicts. A String object is a very good key because it has a good hash function. Integer is also good because its hash is its own value.
Resizing overhead
If you need to store a large amount of data, you should specify an initial capacity when creating a HashMap, which should be close to your desired size.
If you don't do this, the Map will use the default size, i.e. 16, and the value of factorLoad is 0.75. The first 11 calls to put() method will be very fast, but the 12th (16*0.75) call will create a new internal array with length 32 (and the corresponding linked list/tree), and the 13th to 22nd call to put() method will be very fast, but the 23rd (32*0.75) call will recreate (again) a new internal array, and the length of the array will be doubled. Then the internal resize operation will be triggered when the put() method is called 48th, 96th, 192nd…. If the amount of data is not large, the operation of rebuilding the internal array will be very fast, but when the amount of data is large, the time spent may range from seconds to minutes. By specifying the desired size of the Map during initialization, you can avoid the consumption caused by resizing operations.
But there is also a drawback here: if you set the array to be very large, for example 2^28, but you just use 2^26 buckets in the array, then you will waste a lot of memory (about 2^30 bytes in this example).
in conclusion
For simple use cases, you don't need to know how HashMap works, because you won't see the difference between O(1), O(n), and O(log(n)). But it is always beneficial to understand the mechanism behind this frequently used data structure. In addition, this is a typical interview question for Java developer positions.
For large data volumes, it becomes very important to understand how HashMap works and understand the importance of a hash function for keys.
I hope this article can help you have a deep understanding of the implementation of HashMap.