We have analyzed the two sets of ArrayList and LinkedList. We know that ArrayList is implemented based on arrays, and LinkedList is implemented based on linked lists. Each has its own advantages and disadvantages. For example, ArrayList will be better than LinkedList when positioning and looking for elements, while LinkedList will be better than ArrayList when adding and removing elements. The HashMap introduced in this article combines the advantages of both. Its underlying layer is implemented based on a hash table. If hash conflict is not considered, the time complexity of HashMap in addition, deletion, modification and search operations can reach an astonishing O(1). Let's first look at the structure of the hash table on which it is based.
As can be seen from the above figure, a hash table is a structure composed of an array and a linked list. Of course, the above figure is a bad example. A good hash function should try to average the distribution of elements in the array, reduce hash conflicts and reduce the length of the linked list. The longer the length of the linked list means that the more nodes it needs to be traversed during searching, the worse the performance of the hash table will be. Next, let's take a look at some member variables of HashMap.
//Default initial capacity static final int DEFAULT_INITIAL_CAPACITY = 1 << 4;//Default maximum capacity static final int MAXIMUM_CAPACITY = 1 << 30;//Default loading factor refers to how much the hash table can reach static final float DEFAULT_LOAD_FACTOR = 0.75f;//Empty hash table static final Entry<?,?>[] EMPTY_TABLE = {};//The actual hash table transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;//HashMap size, that is, the number of key-value pairs stored in HashMap transient int size;//The threshold of key-value pairs is used to determine whether the hash table capacity needs to be amplified int threshold;//The load factor final float loadFactor;//The number of modifications is used to fail-fast mechanism transient int modCount;//Use the default threshold of alternative hash static final int ALTERNATIVE_HASHING_THRESHOLD_DEFAULT = Integer.MAX_VALUE;//The random hash seed helps reduce the number of hash collisions transient int hashSeed = 0;As can be seen from the member variables, the default initial capacity of HashMap is 16, and the default loading factor is 0.75. threshold is the threshold of the key-value pair that can be stored in the set. The default is the initial capacity*loading factor, that is, 16*0.75=12. When the key-value pair needs to exceed the threshold, it means that the hash table is already saturated at this time. If the elements are continued to be added, the hash conflict will be added, which will degrade the performance of HashMap. At this time, the automatic capacity expansion mechanism will be triggered to ensure the performance of HashMap. We can also see that the hash table is actually an Entry array, and each Entry stored in the array is the header node of the one-way linked list. This Entry is a static inner class of HashMap, let's take a look at the member variables of Entry.
static class Entry<K,V> implements Map.Entry<K,V> { final K key; //Key V value; //Value Entry<K,V> next; //The reference to the next Entry int hash; //Histocode... //Omit the following code}An Entry instance is a key-value pair that contains key and value. Each Entry instance has a reference to the next Entry instance. To avoid repeated calculations, each Entry instance also stores the corresponding Hash code. It can be said that the Entry array is the core of HashMap, and all operations are done for this array. Since HashMap source code is relatively long, it is impossible to introduce all its methods in a comprehensive way, so we only focus on the key points to introduce them. Next, we will be problem-oriented and explore the internal mechanism of HashMap in depth for the following issues.
1. What operations does HashMap do during construction?
//Constructor, pass in the initialization capacity and load factor public HashMap(int initialCapacity, float loadFactor) { if (initialCapacity < 0) { throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); } //If the initialization capacity is greater than the maximum capacity, set it to the maximum capacity if (initialCapacity > MAXIMUM_CAPACITY) { initialCapacity = MAXIMUM_CAPACITY; } //If the load factor is less than 0 or the load factor is not a floating point number, an exception is thrown if (loadFactor <= 0 || Float.isNaN(loadFactor)) { throw new IllegalArgumentException("Illegal load factor: " + loadFactor); } //Set the load factor this.loadFactor = loadFactor; //The threshold is initialization capacity threshold = initialCapacity; init();}void init() {}All constructors will call this constructor. In this constructor, we see that in addition to doing some verification of the parameters, it does two things: set the load factor to the incoming load factor and set the threshold to the incoming initialization size. The init method is empty and does nothing. Note that at this time, no new Entry array is created based on the incoming initialization size. So when will we create a new array? Keep reading.
2. What operations will be performed when HashMap adds key-value pairs?
//Put key-value pairs into HashMap public V put(K key, V value) { //Initialize if (table == EMPTY_TABLE) { //Initialize the hash table inflateTable(threshold); } if (key == null) { return putForNullKey(value); } //Calculate the hash code of the key int hash = hash(key); //Position the position of the hash table according to the hash code int i = indexFor(hash, table.length); for (Entry<K,V> e = table[i]; e != null; e = e.next) { Object k; //If the corresponding key already exists, replace its value value and return the original value if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; //If there is no corresponding key, add Entry to HashMap addEntry(hash, key, value, i); //Return null return null;}You can see that when adding key-value pairs, you will first check whether the hash table is an empty table, and if it is an empty table, it will be initialized. Then perform subsequent operations and call the hash function to calculate the hash code of the passed key. Position the specified slot of the Entry array according to the hash code, and then traverse the one-way linked list of the slot. If the passed in already exists, perform a replacement operation, otherwise a new Entry will be created and added to the hash table.
3. How is the hash table initialized?
//Initialize the hash table and the hash table capacity will expand, because it is possible that the incoming capacity is not a power of 2 private void inflateTable(int toSize) { //The capacity of the hash table must be a power of 2 int capacity = roundUpToPowerOf2(toSize); //Set the threshold, here is generally capacity*loadFactor threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1); //Create a new hash table with a specified capacity table = new Entry[capacity]; //Initialize the hash seed initHashSeedAsNeeded(capacity);}As we know above, when constructing a HashMap, we will not create a new Entry array, but check whether the current hash table is an empty table during put operation. If it is an empty table, call the inflateTable method for initialization. The code for this method is posted above. You can see that the capacity of the Entry array will be recalculated inside the method, because the initialization size passed in when constructing the HashMap may not be a power of 2, so you need to convert this number into a power of 2 and then create a new Entry array based on the new capacity. When initializing the hash table, reset the threshold again, and the threshold is generally capacity*loadFactor. In addition, the hash seed (hashSeed) will be initialized when initializing the hash table. This hashSeed is used to optimize the hash function. The default is 0, and no alternative hash algorithm is used, but you can also set the hashSeed value by yourself to achieve optimization effect. This will be discussed in detail below.
4. When does HashMap determine whether it needs to expand capacity and how it expands capacity?
//Add the Entry method and first determine whether you want to expand the capacity void addEntry(int hash, K key, V value, int bucketIndex) { //If the size of HashMap is greater than the threshold and the value of the corresponding slot of the hash table is not empty if ((size >= threshold) && (null != table[bucketIndex])) { //Because the size of HashMap is greater than the threshold, it indicates that a hash conflict is about to occur, so expand the capacity resize(2 * table.length); hash = (null != key) ? hash(key) : 0; bucketIndex = indexFor(hash, table.length); } //It is shown here that the size of HashMap does not exceed the threshold, so there is no need to expand the capacity createEntry(hash, key, value, bucketIndex);}//Extend the hash table void resize(int newCapacity) { Entry[] oldTable = table; int oldCapacity = oldTable.length; //If the current maximum capacity is already in use, you can only increase the threshold if (oldCapacity == MAXIMUM_CAPACITY) { threshold = Integer.MAX_VALUE; return; } //Otherwise, expand the capacity Entry[] newTable = new Entry[newCapacity]; //The method to migrate hash table transfer(newTable, initHashSeedAsNeeded(newCapacity)); //Set the current hash table to the new hash table table = newTable; //Update the hash table threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);}When calling the put method to add a key-value pair, if there is no key in the collection, call the addEntry method and create a new Entry. When you see the addEntry code posted above, before creating a new Entry, you will first determine whether the size of the current collection element exceeds the threshold. If the threshold exceeds the threshold, call resize for expansion. The new capacity passed in is twice the original hash table. In the resize method, a new Entry array with a capacity of twice the original hash table will be created. Then all the elements in the old hash table are migrated to the new hash table, where rehash may be performed, and whether to perform rehash is performed according to the value calculated by the initHashSeedAsNeeded method. After the hash table migration is completed, the current hash table is replaced with a new one, and finally the threshold value of the HashMap is recalculated based on the new hash table capacity.
5. Why must the size of the Entry array be a power of 2?
//Return the array index corresponding to the hash code static int indexFor(int h, int length) { return h & (length-1); }The indexFor method calculates the corresponding subscript in the array based on the hash code. We can see that the (&) operator is used inside this method. The operation is to perform bit operations on two operands. If the corresponding two bits are 1, the result is 1, otherwise it is 0. Operations are often used to remove high-bit values of operands, such as: 01011010 & 00001111 = 00001010. Let's go back to the code and see what h&(length-1) does.
It is known that the length passed in is the length of the Entry array. We know that the array subscript is calculated from 0, so the maximum subscript of the array is length-1. If length is a power of 2, then the binary bits of length-1 are followed by 1. At this time, the function of h&(length-1) is to remove the high-bit value of h and leave only the low-bit value of h as the subscript of the array. From this we can see that the size of the Entry array is defined as a power of 2 in order to use this algorithm to determine the subscript of the array.
6. How does hash function calculate the Hash code?
//The function that generates hash code final int hash(Object k) { int h = hashSeed; //If key is of String type, use another hash algorithm if (0 != h && k instanceof String) { return sun.misc.Hashing.stringHash32((String) k); } h ^= k.hashCode(); //The perturbation function h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4);}The last two lines of the hash method are the algorithm that truly calculates the hash value. The algorithm that calculates the hash code is called the perturbation function. The so-called perturbation function is to mix everything together. You can see that four right-to-right shift operations are used here. The purpose is to mix the high value of h and the low value to increase the randomness of the low value. As above, we know that the subscript of the positioning array is determined based on the low-bit value of the hash code. The hash code of the key is generated by the hashCode method, and the low value of the hash code generated by a bad hashCode method may have a lot of repetition. In order to make the hash code mapped relatively uniformly on the array, the perturbation function comes in handy, combining the characteristics of the high-bit value into the low-bit value, increasing the randomness of the low-bit value, thereby making the hash distribution more loose, thereby improving performance. The following figure gives an example to help understand.
7. What's going on with the replacement hash?
We see that in the hash method, the hashSeed will first be assigned to h. This hashSeed is a hash seed, which is a random value, and is used to help optimize the hash function. The default hashSeed is 0, which means that the alternative hash algorithm is not used by default. So when to use hashSeed? First, you need to set the alternative hash to enable, and set the value of jdk.map.althashing.threshold in the system property. This value is -1 by default in the system property. When it is -1, the threshold value of using the alternative hash is Integer.MAX_VALUE. This also means that you may never use a replacement hash. Of course you can set this threshold a little smaller, so that a random hashSeed will be generated when the set element reaches the threshold. This increases the randomness of the hash function. Why use alternative hash? When the set element reaches the threshold you set, it means that the hash table is relatively saturated, and the possibility of hash conflicts will greatly increase. At this time, using a more random hash function for the added elements can make the added elements more randomly distributed in the hash table.
Note: All the above analysis is based on JDK1.7, and there will be major changes between different versions, readers need to pay attention.