Storage structure
First, HashMap is stored based on a hash table. There is an array inside it. When an element is to be stored, first calculate the hash value of its key and find the corresponding subscript of the element in the array based on the hash value. If there is no element at this position, put the current element in directly. If there is an element (remembered as A here), link the current element to the front of element A and then put the current element into the array. So in Hashmap, the array actually saves the first node of the linked list. Below is a picture from Baidu Encyclopedia:
As shown in the figure above, each element is an Entry object, where the key and value of the element are saved, and there is a pointer that can be used to point to the next object. All keys with the same hash value (that is, conflict) string them together using a linked list, which is the zipper method.
Internal variables
//Default initial capacity static final int DEFAULT_INITIAL_CAPACITY = 16;//Maximum capacity static final int MAXIMUM_CAPACITY = 1 << 30;//Default load factor static final float DEFAULT_LOAD_FACTOR = 0.75f;//Hast table transient Entry<K,V>[] table;//Number of key-value pairs transient int size;//The threshold of expansion int threshold;//The load factor of hash array final float loadFactor;
In the above variables, capacity refers to the length of the hash table, that is, the size of the table, and the default is 16. The load factor loadFactor is the "full degree" of the hash table, and the JDK's documentation says this:
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The general meaning is: the loading factor is the measure of how full the hash table can be installed before expansion. When the number of "key-value pairs" in the hash table exceeds the product of the current capacity and the loading factor, the hash table hashed (that is, the internal data structure is rebuilt), and the capacity of the hash table becomes about twice the original.
As can be seen from the above variable definition, the default loading factor DEFAULT_LOAD_FACTOR is 0.75. The larger this value, the higher the space utilization rate, but the query speed (including get and put) will slow down. After understanding the loading factor, threshold can also understand it. It is actually equal to the capacity* loading factor.
Constructor
public HashMap(int initialCapacity, float loadFactor) { if (initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); // Find a power of 2 >= initialCapacity int capacity = 1; while (capacity < initialCapacity) // Calculate the smallest power of 2 that is greater than the specified capacity capacity <<= 1; this.loadFactor = loadFactor; threshold = (int)Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1); table = new Entry[capacity]; // Allocate space to the hash table useAltHashing = sun.misc.VM.isBooted() && (capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD); init();}There are several constructors, and they will eventually call the above. It accepts two parameters, one is the initial capacity and the other is the loading factor. At the beginning, we first determine whether the value combination is legal, and if there is any problem, an exception will be thrown. What is important is the calculation of the capacity below, its logic is to calculate the smallest power of 2 greater than initialCapacity. In fact, the purpose is to make the capacity more than or equal to the specified initial capacity, but this value must be an exponential multiple of 2, which is a key issue. The reason for this is mainly to map hash values. Let’s first look at the hash method in HashMap:
final int hash(Object k) { int h = 0; if (useAltHashing) { if (k instanceof String) { return sun.misc.Hashing.stringHash32((String) k); } h = hashSeed; } h ^= k.hashCode(); // This function ensures that hashCodes that different only by // constant multiples at each bit position have a bounded // number of collisions (approximately 8 at default load factor). h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4);}static int indexFor(int h, int length) { return h & (length-1);}The hash() method recalculates the hash value of the key and uses relatively complex bit operations. I don’t know the specific logic. Anyway, it is definitely a better method, which can reduce conflicts or something.
The indexFor() below is the subscript of the element in the hash table based on the hash value. Generally, in a hash table, you use hash values to modulate the table length. When length (that is, capacity) is a power of 2, h & (length-1) is the same effect. Moreover, the power of 2 must be an even number, then after subtracting 1, it is an odd number, and the last bit of the binary must be 1. Then the last bit of h & (length-1) may be 1 or 0, which can be hashed evenly. If length is an odd number, then length-1 is an even number and the last bit is 0. At this time, the last bit of h & (length-1) may only be 0, and all the resulting subscripts are even, so half of the space is wasted. Therefore, the capacity in HashMap must be a power of 2. You can see that the default DEFAULT_INITIAL_CAPACITY=16 and MAXIMUM_CAPACITY=1<<30 are both like this.
Entry Object
The key-value pairs in HashMap are encapsulated into Entry objects, which is an internal class in HashMap. Let's take a look at its implementation:
static class Entry<K,V> implements Map.Entry<K,V> { final K key; V value; Entry<K,V> next; int hash; Entry(int h, K k, V v, Entry<K,V> n) { value = v; next = n; key = k; hash = h; } public final K getKey() { return key; } public final V getValue() { return value; } public final V setValue(V newValue) { V oldValue = value; value = newValue; return oldValue; } public final boolean equals(Object o) { if (!(o instanceof Map.Entry)) return false; Map.Entry e = (Map.Entry)o; Object k1 = getKey(); Object k2 = e.getKey(); if (k1 == k2 || (k1 != null && k1.equals(k2))) { Object v1 = getValue(); Object v2 = e.getValue(); if (v1 == v2 || (v1 != null && v1.equals(v2))) return true; } return false; } public final int hashCode() { return (key==null ? 0 : key.hashCode()) ^ (value==null ? 0 : value.hashCode()); } public final String toString() { return getKey() + "=" + getValue(); } void recordAccess(HashMap<K,V> m) { }}The implementation of this class is simple and easy to understand. Methods such as getKey(), getValue() are provided for call. To determine equality, it is necessary that both key and value are equal.
put operation
Put first before you get it, so look at the put() method first:
public V put(K key, V value) { if (key == null) return putForNullKey(value); int hash = hash(key); int i = indexFor(hash, table.length); for (Entry<K,V> e = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null;}In this method, first determine whether the key is null. If yes, call putForNullKey() method, which means that HashMap allows the key to be null (in fact, value can be). If not null, calculate the hash value and get the subscript in the table. Then go to the corresponding linked list to query whether the same key already exists. If it already exists, the value will be directly updated. Otherwise, call addEntry() method for insertion.
Take a look at the putForNullKey() method:
private V putForNullKey(V value) { for (Entry<K,V> e = table[0]; e != null; e = e.next) { if (e.key == null) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(0, null, value, 0); return null;}It can be seen that when the key is null, the value will be updated if it exists, otherwise addEntry() will be called to insert.
The following is the implementation of the addEntry() method:
void addEntry(int hash, K key, V value, int bucketIndex) { if ((size >= threshold) && (null != table[bucketIndex])) { resize(2 * table.length); hash = (null != key) ? hash(key) : 0; bucketIndex = indexFor(hash, table.length); } createEntry(hash, key, value, bucketIndex);}void createEntry(int hash, K key, V value, int bucketIndex) { Entry<K,V> e = table[bucketIndex]; table[bucketIndex] = new Entry<>(hash, key, value, e); size++;}First, determine whether to expand the capacity (expanding capacity will recalculate the subscript value and copy the element), then calculate the array subscript, and finally insert the element using the header insertion method in createEntry().
get operation
public V get(Object key) { if (key == null) return getForNullKey(); Entry<K,V> entry = getEntry(key); return null == entry ? null : entry.getValue();}private V getForNullKey() { for (Entry<K,V> e = table[0]; e != null; e = e.next) { if (e.key == null) return e.value; } return null;}final Entry<K,V> getEntry(Object key) { int hash = (key == null) ? 0 : hash(key); for (Entry<K,V> e = table[indexFor(hash, table.length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; } return null;}This is simpler than put(). You also need to determine whether the key is null, and then the traversal query of the linked list.
Performance optimization
HashMap is an efficient and universal data structure that can be seen everywhere in every Java program. Let’s introduce some basic knowledge first. As you may also know, HashMap uses the hashCode() and equals() methods of key to divide the values into different buckets. The number of buckets is usually slightly larger than the number of records in the map, so that each bucket will include fewer values (preferably one). When searching through keys, we can quickly locate a bucket (using hashCode() to modulo the number of buckets) and the object we are looking for in constant time.
You should already know all these things. You may also know that hash collisions can have a catastrophic impact on the performance of hashMap. If multiple hashCode() values fall into the same bucket, these values are stored in a linked list. In the worst case, all keys are mapped into the same bucket, so the hashmap degenerates into a linked list - the search time is from O(1) to O(n). Let's first test the performance of hashmap in Java 7 and Java 8 under normal circumstances. In order to control the behavior of hashCode() method, we define a Key class as follows:
class Key implements Comparable<Key> {private final int value;Key(int value) {this.value = value;}@Overridepublic int compareTo(Key o) {return Integer.compare(this.value, o.value);}@Overridepublic boolean equals(Object o) {if (this == o) return true;if (o == null || getClass() != o.getClass())return false;Key key = (Key) o;return value == key.value;}@Overridepublic int hashCode() {return value;}}The implementation of the Key class is fairly standard: it rewrites the equals() method and provides a fairly decent hashCode() method. To avoid excessive GC, I cached the immutable Key object instead of starting to create it again every time:
class Key implements Comparable<Key> {public class Keys {public static final int MAX_KEY = 10_000_000;private static final Key[] KEYS_CACHE = new Key[MAX_KEY];static {for (int i = 0; i < MAX_KEY; ++i) {KEYS_CACHE[i] = new Key(i);}}public static Key of(int value) {return KEYS_CACHE[value];}Now we can start testing. Our benchmark uses continuous Key values to create HashMaps of different sizes (multiplier for 10, from 1 to 1 million). In the test, we will also use keys to search and measure the time it takes for HashMaps of different sizes:
import com.google.caliper.Param;import com.google.caliper.Runner;import com.google.caliper.SimpleBenchmark;public class MapBenchmark extends SimpleBenchmark {private HashMap<Key, Integer> map;@Paramprivate int mapSize;@Overrideprotected void setUp() throws Exception {map = new HashMap<>(mapSize);for (int i = 0; i < mapSize; ++i) {map.put(Keys.of(i), i);}}public void timeMapGet(int reps) {for (int i = 0; i < reps; i++) {map.get(Keys.of(i % mapSize));}}}}Interestingly, in this simple HashMap.get(), Java 8 is 20% faster than Java 7. The overall performance is also quite good: although there are a million records in HashMap, a single query only took less than 10 nanoseconds, which is about 20 CPU cycles on my machine. Very shocking! But this is not what we want to measure.
Suppose there is a bad key, it always returns the same value. This is the worst scenario, and you shouldn't use HashMap at all:
class Key implements Comparable<Key> {//...@Overridepublic int hashCode() {return 0;}}The results of Java 7 are expected. As the size of HashMap grows, the overhead of the get() method is getting bigger and bigger. Since all records are in the ultra-long linked list in the same bucket, searching an average of one record requires traversing half of the list. Therefore, it can be seen from the figure that its time complexity is O(n).
However, Java 8 performs much better! It is a log curve, so its performance is orders of magnitude better. Despite the worst case scenario of severe hash collisions, this same benchmark has a time complexity in JDK8 of O(logn). If you look at the curve of JDK 8 alone, it will be clearer. This is a logarithmic linear distribution:
Why is there such a great performance improvement, even though the big O symbol is used here (big O describes the asymptotic upper bound)? In fact, this optimization has been mentioned in JEP-180. If the record in a bucket is too large (currently TREEIFY_THRESHOLD = 8), HashMap will dynamically replace it with a special treemap implementation. This will result in better results, O(logn), not bad O(n). How does it work? The records corresponding to the KEY that conflicted in the front are simply appended to a linked list, and these records can only be found through traversal. However, after exceeding this threshold, HashMap begins to upgrade the list to a binary tree, using the hash value as the branch variable of the tree. If the two hash values are not equal but pointing to the same bucket, the larger one will be inserted into the right subtree. If the hash values are equal, HashMap hopes that the key value is best implemented by the Comparable interface so that it can be inserted in order. This is not necessary for HashMap's key, but it is of course the best if implemented. If this interface is not implemented, you should not expect to achieve performance improvements in the event of severe hash collisions.
What is the use of this performance improvement? For example, a malicious program, if it knows that we are using a hashing algorithm, it may send a large number of requests, resulting in serious hash collisions. Then constantly accessing these keys can significantly affect the server's performance, which leads to a denial of service attack (DoS). The leap from O(n) to O(logn) in JDK 8 can effectively prevent similar attacks, while also slightly enhancing the predictability of HashMap performance. I hope this improvement will eventually convince your boss to agree to upgrade to JDK 8.
The environment used for the test is: Intel Core i7-3635QM @ 2.4 GHz, 8GB memory, SSD hard disk, using default JVM parameters, running on a 64-bit Windows 8.1 system.
Summarize
The basic implementation of HashMap is as analyzed above, and finally the summary is made: