HashMap Overview
HashMap is an asynchronous implementation of the Map interface based on hash table. This implementation provides all optional mapping operations and allows the use of null values and null keys. This class does not guarantee the order of mappings, especially it does not guarantee that the order will last.
HashMap's data structure
In the Java programming language, there are two basic structures, one is an array and the other is a simulated pointer (reference). All data structures can be constructed using these two basic structures, and HashMap is no exception. HashMap is actually a "linked list hash" data structure, that is, the structure of arrays and linked lists, but in jdk1.8
The implementation of red and black tree is added. When the length of the linked list is greater than 8, it is converted into the structure of red and black tree.
As can be seen from the above figure, HashMap uses the chain address method in Java. The link address method, simply put, is a combination of arrays and linked lists. There is a linked list structure on each array element. When the data is hashed, the array subscript is obtained and the data is placed on the linked list of the corresponding subscript elements.
*/ static class Node<K,V> implements Map.Entry<K,V> { final int hash;// Used to locate the position of the array index final K key; V value; Node<K,V> next;//Next Node in the linked list Node(int hash, K key, V value, Node<K,V> next) { this.hash = hash; this.key = key; this.value = value; this.next = next; }Node is an internal class of HashMap, which implements the Map.Entry interface, which is essentially a map (key-value pair).
Sometimes two keys are positioned to the same position, indicating a Hash collision occurred. Of course, the more uniform the calculation results of the Hash algorithm are, the smaller the probability of Hash collision, and the higher the access efficiency of the map.
There is a very important field in the HashMap class, which is the Node[] table, that is, the hash bucket array. It is obviously an array of Nodes.
If the hash bucket array is large, even the poor Hash algorithm will be more scattered. If the hash bucket array array array is small, even a good Hash algorithm will have more collisions, so it is necessary to weigh the space cost and time cost. In fact, it is to determine the size of the hash bucket array based on the actual situation, and on this basis, the designed hash algorithm will reduce Hash collisions. So how can we control maps to make the probability of Hash collisions small, and hash bucket arrays (Node[] tables) take up less space? The answer is a good Hash algorithm and capacity expansion mechanism.
Before understanding the Hash and expansion process, we need to understand several fields of HashMap. From the source code of HashMap's default constructor, it can be seen that the constructor initializes the following fields, the source code is as follows:
int threshold; // The key-value that can be accommodated is ultimate float loadFactor; // Load factor int modCount; int size;
First, the initialization length of the Node[] table (the default value is 16), the Load factor is the load factor (the default value is 0.75), and the threshold is the number of Node (key-value pairs) of the maximum data amount that HashMap can accommodate. threshold = length * Load factor. That is to say, after the array has defined its length, the larger the load factor, the more key-value pairs it can accommodate.
Based on the definition formula of the load factor, it can be seen that threshold is the maximum number of elements allowed according to this Load factor and length (array length). If this number exceeds this, resize (enlarge capacity). The expanded HashMap capacity is twice the previous capacity. The default load factor 0.75 is a balanced choice for space and time efficiency. It is recommended that you do not modify it unless in the case of special time and space, if there is a lot of memory space and high time efficiency requirements, the value of the load factor Load factor can be reduced; on the contrary, if the memory space is tight and the time efficiency requirements are not high, the value of the load factor loadFactor can be increased, which can be greater than 1.
The size field is actually easy to understand, it is the number of key-value pairs that actually exist in HashMap. Note the difference between the length of the table and the number of thresholds that accommodate the maximum key value pairs. The modCount field is mainly used to record the number of changes in the internal structure of HashMap, and is mainly used for the rapid failure of iteration. To emphasize, changes in the internal structure refer to changes in the structure, such as put new key-value pairs, but the value corresponding to a key is overwritten and does not belong to structural changes.
In HashMap, the length of the hash bucket array table must be in the n power of 2 (must be a composite number). This is an unconventional design. The conventional design is to design the size of the bucket as a prime number. Relatively speaking, the probability of conflict caused by prime numbers is smaller than that of composite numbers. For specific proofs, please refer to //www.VeVB.COM/article/100911.htm. The Hashtable initializes the bucket size to 11, which is the application of the bucket size designed as prime numbers (the Hashtable cannot be guaranteed to be prime numbers after expansion). HashMap adopts this unconventional design, mainly to optimize when modulo and expansion. At the same time, to reduce conflicts, HashMap also adds the process of high-bit participation in computing when locating the hash bucket index position.
There is a problem here. Even if the load factor and Hash algorithm are designed reasonably, there will inevitably be a situation where the zipper is too long. Once the zipper is too long, it will seriously affect the performance of HashMap. Therefore, in JDK1.8 version, the data structure was further optimized and the red and black tree was introduced. When the length of the linked list is too long (the default is more than 8), the linked list is converted into a red and black tree. The characteristics of red and black tree rapid addition, deletions, modifications and searches will be used to improve the performance of HashMap. Insertion, deletion, and search of red and black trees will be used to insert, delete, and search algorithms such as red and black tree.
Determine the index position of the hash bucket array
Code implementation:
//Method 1: static final int hash(Object key) { //jdk1.8 & jdk1.7 int h; // h = key.hashCode() Gets the hashCode value for the first step// h ^ (h >>> 16) Participate in the operation of the second step return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);}//Method 2: static int indexFor(int h, int length) { //Jdk1.7 source code, jdk1.8 does not have this method, but the implementation principle is the same return h & (length-1); //The third step takes the modulus operation}The Hash algorithm here is essentially three steps: taking the hashCode value of the key, high-bit operation, and modulus operation.
For any given object, as long as its hashCode() return value is the same, the Hash code calculated by the program calling method one is always the same. The first thing we think of is to modulo the hash value to the array length, so that the distribution of elements is relatively uniform. However, the consumption of modulus operations is relatively large. This is done in HashMap: call method two to calculate which index the object should be stored in the table array.
This method is very clever. It obtains the saved bits of the object through h & (table.length -1), and the length of the underlying array of HashMap is always to the n-power of 2, which is the optimization of HashMap in terms of speed. When length is always to the n power of 2, the h& (length-1) operation is equivalent to modulo length, that is, h%length, but & has higher efficiency than %.
In the implementation of JDK1.8, the algorithm for high-bit operations is optimized, and the high-16-bit XOR low-16-bit implementation of hashCode(): (h = k.hashCode()) ^ (h >>> 16), which is mainly considered from the speed, efficiency, and quality. This can ensure that when the length of the array table is relatively small, it can also ensure that Bits are involved in Hash calculations taking into account high-low, and there will be no excessive overhead.
Here is an example, n is the length of the table.
HashMap put method implementation
The general idea of the put function is:
The specific code is implemented as follows:
public V put(K key, V value) { return putVal(hash(key), key, value, false, true); } /** *Method for generating hash*/ static final int hash(Object key) { int h; return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16); } final V putVal(int hash, K key, V value, boolean onlyIfAbsent,boolean evict) { Node<K,V>[] tab; Node<K,V> p; int n, i; //Judge whether the table is empty, if ((tab = table) == null || (n = tab.length) == 0) n = (tab = resize()).length;//Create a new table array and get the length of the array //Calculate the hash value to the inserted array index i according to the key value key. If table[i]==null, directly create a new node and add if ((p = tab[i = (n - 1) & hash]) == null) tab[i] = newNode(hash, key, value, null); else {//If the corresponding node has Node<K,V> e; K k; //Judge whether the first element of table[i] is the same as the key, if the same, directly overwrite the value if (p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k)))) e = p; //Judge whether table[i] is a treeNode, that is, whether table[i] is a red-black tree. If it is a red-black tree, directly insert the key-value pair in the tree else if (p instanceof TreeNode) e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value); // This chain is a linked list else { //Transipate through table[i] to determine whether the length of the linked list is greater than TREEIFY_THRESHOLD (the default value is 8). If it is greater than 8, convert the linked list into a red-black tree and perform the insertion operation in the red-black tree. Otherwise, the insertion operation of the linked list is performed; if it is found that the key already has a direct overwrite value during the traversal process; for (int binCount = 0; ; ++binCount) { if ((e = p.next) == null) { p.next = newNode(hash, key, value, null); if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st treeifyBin(tab, hash); break; } if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) break; p = e; } } // Write if (e != null) { // existing mapping for key V oldValue = e.value; if (!onlyIfAbsent || oldValue == null) e.value = value; afterNodeAccess(e); return oldValue; } } ++modCount; // After the insertion is successful, determine whether the actual number of key value pairs exceeds the maximum capacity threshold. If it exceeds the capacity, expand if (++size > threshold) resize(); afterNodeInsertion(evict); return null; }HashMap get method implementation
The idea is as follows:
1. The first node in the bucket, hit directly;
2. If there is a conflict, use key.equals(k) to find the corresponding entry
If it is a tree, then search in the tree through key.equals(k), O(logn);
If it is a linked list, then search through key.equals(k) in the linked list, O(n).
public V get(Object key) { Node<K,V> e; return (e = getNode(hash(key), key)) == null ? null : e.value; }final Node<K,V> getNode(int hash, Object key) { Node<K,V>[] tab; Node<K,V> first, e; int n; K k; if ((tab = table) != null && (n = tab.length) > 0 && (first = tab[(n - 1) & hash]) != null) { // Hit directly if (first.hash == hash && // Every time it is checked the first node ((k = first.key) == key || (key != null && key.equals(k)))) return first; // Missed if ((e = first.next) != null) { // Get if (first instanceof TreeNode) return ((TreeNode<K,V>)first).getTreeNode(hash, key); // Get do { if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; } while ((e = e.next) != null); } } return null; }Capacity expansion mechanism
Resize means recalculating the capacity and constantly adding elements to the HashMap object. When the array inside the HashMap object cannot load more elements, the object needs to expand the length of the array so that more elements can be loaded. Of course, arrays in Java cannot be automatically expanded. The method is to use a new array instead of existing arrays with small capacity, just like we use a small bucket to fill water. If we want to fill more water, we have to replace it with a large bucket.
We analyze the source code of resize. Given that JDK1.8 integrates red and black trees, it is more complicated. In order to facilitate understanding, we still use JDK1.7 code, which is easier to understand. There is little difference in essence. Let’s talk about the specific differences later.
void resize(int newCapacity) { //Pause new capacity Entry[] oldTable = table; //Reference the Entry array before expansion int oldCapacity = oldTable.length; if (oldCapacity == MAXIMUM_CAPACITY) { //If the array size before expansion has reached the maximum (2^30) threshold = Integer.MAX_VALUE; //Modify the threshold to the maximum value of int (2^31-1), so that the capacity will not be expanded in the future return; } Entry[] newTable = new Entry[newCapacity]; //Initialize a new Entry array transfer(newTable); //! ! Transfer data to the new Entry array table = newTable; //HashMap's table attribute refers to the new Entry array threshold = (int)(newCapacity * loadFactor); //Modify the threshold}Here we use an array with larger capacity instead of existing array with smaller capacity. The transfer() method copies the elements of the original Entry array to the new Entry array.
void transfer(Entry[] newTable) { Entry[] src = table; //src refers to the old Entry array int newCapacity = newTable.length; for (int j = 0; j < src.length; j++) { //Tranquility through the old Entry array Entry<K,V> e = src[j]; //Get each element of the old Entry array if (e != null) { src[j] = null;//Release the object reference of the old Entry array (after the for loop, the old Entry array no longer refers to any objects) do { Entry<K,V> next = e.next; int i = indexFor(e.hash, newCapacity); //! ! Recalculate the position of each element in the array e.next = newTable[i]; //Tag [1] newTable[i] = e; //Put the element on the array e = next; //Access the element on the next Entry chain} while (e != null); } } }The reference to newTable[i] is assigned to e.next, which means that the header insertion method of a single linked list is used. New elements at the same position will always be placed at the head of the linked list; in this way, the elements placed on an index will eventually be placed at the end of the Entry chain (if a hash conflict occurs). This is different from Jdk1.8, which is explained in detail below. Elements on the same Entry chain in the old array may be placed at different positions in the new array after recalculating the index position.
Here is an example to illustrate the capacity expansion process. Assume that our hash algorithm simply uses key mod to get the size of the table (that is, the length of the array). The size of the hash bucket array table=2, so the key = 3, 7, 5, and the put order is 5, 7, and 3. After mod 2, the conflict is in table[1]. Here, it is assumed that the load factor loadFactor=1, that is, when the actual size of the key-value pair is greater than the actual size of the table, it is expanded. The next three steps are the process of resizing the hash bucket array to 4, and then rehashing all Nodes.
Let’s explain what optimizations have been made in JDK1.8. After observation, we can find that we are using a power of two expansion (referring to the length of two times the original), so the position of the element is either in the original position or moves the position of the power of two again at the original position. Looking at the figure below, you can understand the meaning of this sentence. n is the length of the table. Figure (a) represents an example of the index position of the two keys that determine the index position of the key1 and key2 before expansion. Figure (b) represents an example of the index position of the key1 and key2 after expansion. Where hash1 is the hash and high-bit operation result corresponding to key1.
After the element recalculates hash, since n becomes 2 times, the mask range of n-1 is 1 bit (red) at the high point, so the new index will change like this:
Therefore, when we expand HashMap, we do not need to recalculate the hash like the implementation of JDK1.7. We just need to see if the bit added to the original hash value is 1 or 0. If it is 0, the index has not changed. If it is 1, the index becomes "original index + oldCap". You can see the following figure as the resize diagram with 16 expansion to 32:
This design is indeed very clever, which not only saves the time to recalculate the hash value, but also, since the newly added 1bit is 0 or 1, it can be considered random, so the resize process evenly distributes the previous conflicting nodes to the new bucket. This is the new optimization point added by JDK1.8. There is a little attention to the difference. When rehash in JDK1.7, when old linked lists migrate new linked lists, if the array index position of the new table is the same, the linked list elements will be inverted, but as can be seen from the above figure, JDK1.8 will not be inverted. Interested students can study the resize source code of JDK1.8, which is very good, as follows:
final Node<K,V>[] resize() { Node<K,V>[] oldTab = table; int oldCap = (oldTab == null) ? 0 : oldTab.length; int oldThr = threshold; int newCap, newThr = 0; if (oldCap > 0) { // If the maximum value exceeds, it will no longer be expanded, so I have to collide with you if (oldCap >= MAXIMUM_CAPACITY) { threshold = Integer.MAX_VALUE; return oldTab; } // If the maximum value is not exceeded, it will be expanded to 2 times the original if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY && oldCap >= DEFAULT_INITIAL_CAPACITY) newThr = oldThr << 1; // double threshold } else if (oldThr > 0) // initial capacity was placed in threshold newCap = oldThr; else { // zero initial threshold signifies using defaults newCap = DEFAULT_INITIAL_CAPACITY; newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY); } // Calculate the new resize upper limit if (newThr == 0) { float ft = (float)newCap * loadFactor; newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE); } threshold = newThr; @SuppressWarnings({"rawtypes","unchecked"}) Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap]; table = newTab; if (oldTab != null) { // Move each bucket into the new buckets for (int j = 0; j < oldCap; ++j) { Node<K,V> e; if ((e = oldTab[j]) != null) { oldTab[j] = null; if (e.next == null) newTab[e.hash & (newCap - 1)] = e; else if (e instanceof TreeNode) ((TreeNode<K,V>)e).split(this, newTab, j, oldCap); else { // preserve order Node<K,V> loHead = null, loTail = null; Node<K,V> hiHead = null, hiTail = null; Node<K,V> next; do { next = e.next; // original index if ((e.hash & oldCap) == 0) { if (loTail == null) loHead = e; else loTail.next = e; loTail = e; } // original index + oldCap else { if (hiTail == null) hiHead = e; else hiTail.next = e; hiTail = e; } } while ((e = next) != null); // Put the original index into the bucket if (loTail != null) { loTail.next = null; newTab[j] = loHead; } // Put the original index + oldCap into the bucket if (hiTail != null) { hiTail.next = null; newTab[j + oldCap] = hiHead; } } } } } return newTab;}Summarize
We can now answer several questions at the beginning to deepen our understanding of HashMap:
1. When will HashMap be used? What are his characteristics?
It is based on the implementation of the Map interface. When storing key-value pairs, it can receive null key values, which is asynchronous. HashMap stores Entry(hash, key, value, next) objects.
2. Do you know how HashMap works?
Through hash method, objects are stored and obtained through put and get. When storing an object, when we pass K/V to the put method, it calls hashCode to calculate the hash to get the bucket location and store it further. HashMap will automatically adjust the capacity according to the current bucket occupation (if it exceeds Load Facotr, the resize is twice the original). When getting the object, we pass K to get, which calls hashCode to calculate hash to get the bucket position, and further calls the equals() method to determine the key-value pair. If a collision occurs, Hashmap organizes the elements that generate collision conflicts through the linked list. In Java 8, if the elements that collide in a bucket exceed a certain limit (the default is 8), a red and black tree is used to replace the linked list to increase the speed.
3. Do you know the principles of get and put? What are the functions of equals() and hashCode()?
By hashing the hashCode() of the key and compute the subscript (n-1 & hash), the position of buckets is obtained. If a collision occurs, use the key.equals() method to search for the corresponding node in the linked list or tree
4. Do you know the implementation of hash? Why do I need to do this?
In the implementation of Java 1.8, it is implemented through the high 16-bit X-or low 16-bit of hashCode(): (h = k.hashCode()) ^ (h >>> 16), which is mainly considered from the speed, efficiency, and quality. This can ensure that when the n of the bucket is relatively small, it can also ensure that both high and low bits are involved in the hash calculation, and there will be no excessive overhead.
5. What if the size of HashMap exceeds the capacity defined by the load factor?
If the load factor is exceeded (default 0.75), a HashMap with twice the original length will be resized and the hash method will be called again.
The cheat sheet about Java collections is described as follows:
The hash bucket array implemented with the Entry[] array, and the hash value of the Key can be used to obtain the array subscript.
When inserting an element, if two keys fall into the same bucket (for example, after the hash values 1 and 17 are modulo 16, both belong to the first hash bucket), Entry uses a next property to implement multiple Entry in a one-way linked list, and the Entry that enters the bucket points next to the bucket's current Entry.
When looking for a key with a hash value of 17, first locate the first hash bucket, then iterate through all elements in the bucket with a linked list, and compare their key values one by one.
When the number of Entry reaches 75% of the buckets (many articles say that the number of buckets used reaches 75%, but according to the code, the bucket array will be expanded exponentially and all the original Entry will be reassigned, so it is best to have an estimated value here.
The bit operation (hash & (arrayLength-1)) will be faster, so the size of the array is always to the N power of 2. If you give an initial value such as 17, it will be converted to 32. The default initial value when the element is first placed is 16.
Iterator() traverses along the hash bucket array, which looks like an out of order.
6. What happens when the hashcode of two objects is the same?
Because the hashcode is the same, their bucket position is the same, and a 'collision' will happen. Because HashMap uses a linked list to store objects, this Entry (the Map.Entry object containing key-value pairs) is stored in the linked list.
7. If the hashcode of two keys is the same, how do you get the value object?
After finding the bucket location, the keys.equals() method will be called to find the correct node in the linked list, and finally find the value object to be found. Therefore, when designing the key type of HashMap, if an immutable object is declared as final and the appropriate equals() and hashCode() methods are used, the occurrence of collisions will be reduced and the efficiency will be improved. Immutability can cache hashcodes for different keys, which will increase the speed of getting the entire object. Using wrapper classes like String and Interger as keys is a very good choice.
8. What if the size of HashMap exceeds the capacity defined by the load factor?
The default load factor size is 0.75. That is to say, when a map fills up 75% buckets, like other collection classes (such as ArrayList, etc.), a bucket array that is twice the size of the original HashMap will be created to resize the map and put the original object into the new bucket array. This process is called rehashing because it calls the hash method to find the new bucket location
9. Do you understand what is wrong with resizing HashMap?
When resizing HashMap, there is indeed conditional competition, because if both threads find that HashMap needs to be resized, they will try to resize at the same time. During the resize process, the order of elements stored in the linked list will be reversed, because when moving to the new bucket position, HashMap does not place the elements at the end of the linked list, but at the head, which is to avoid tail traversing. If conditional competition occurs, then there is a vicious cycle. Therefore, in a concurrent environment, we use CurrentHashMap to replace HashMap
10. Why are wrapper classes like String and Interger suitable as keys?
Because String is immutable and final, and the equals() and hashCode() methods have been rewritten. Other wrapper classes also have this feature. Immutability is necessary because in order to calculate hashCode(), you must prevent the key value from changing. If the key value returns a different hashcode when putting in and getting, you cannot find the object you want from the HashMap. Immutability has other advantages such as thread safety. If you can guarantee that the hashCode is unchanged just by declaring a field as final, then please do so. Because the equals() and hashCode() methods are used when obtaining objects, it is very important to rewrite these two methods correctly. If two unequal objects return different hashcodes, the chance of collision will be smaller, which will improve the performance of HashMap
Thank you for reading, I hope it can help you. Thank you for your support for this site!