The main research in this article is the ConsistentHashing algorithm code.
The consistent hashing algorithm was proposed by MIT in 1997 (see 0). The design goal was to solve the hotpot problem in the Internet, and its original intention was very similar to CARP. Consistent hashing corrects the problem caused by the simple hashing algorithm used by CARP, so that DHT can be truly applied in P2P environments.
Consistent hashing proposes four adaptation conditions that the hashing algorithm should meet in a dynamically changing Cache environment:
Balance means that the result of the hash can be distributed to all caches as much as possible, so that all cache space can be utilized. Many hash algorithms can meet this condition.
Monotonity refers to if some content has been dispatched to the corresponding cache through hashing, and new caches are added to the system. The result of the hash should ensure that the original allocated content can be mapped to the new cache without being mapped to other buffers in the old cache collection.
Simple hashing algorithms often cannot meet the requirements of monotonicity, such as the simplest linear hash:
x → ax + b mod (P)
In the above formula, P represents the size of all caches. It is not difficult to see that when the cache size changes (from P1 to P2), all the original hash results will change, thus not meeting the requirements of monotonicity.
Changes in hash results mean that when the cache space changes, all mapping relationships need to be updated in the system. In P2P systems, cache changes are equivalent to Peer joining or exiting the system. This situation occurs frequently in P2P systems, which will bring huge computing and transmission load. Monotonity requires that the hash algorithm can avoid this situation.
In a distributed environment, the terminal may not see all the caches, but can only see some of them. When a terminal wishes to map content to the cache through the hashing process, since the cache range seen by different terminals may be different, the result of the hashing is inconsistent. The final result is that the same content is mapped into different cache areas by different terminals. This situation should obviously be avoided because it causes the same content to be stored in different buffers, reducing the efficiency of system storage. The definition of dispersion is the severity of the above situation. A good hashing algorithm should be able to avoid inconsistencies as much as possible, that is, to minimize dispersion.
The load problem is actually looking at the problem of decentralization from another perspective. Since different terminals may map the same content to different buffers, for a specific buffer, it may also be mapped to different content by different users. Like dispersion, this situation should be avoided, so a good hashing algorithm should be able to minimize the buffer load.
On the surface, consistent hashing is targeting the problem of distributed buffering, but if you think of buffering as a Peer in a P2P system and the mapped content as various shared resources (data, files, media streams, etc.), you will find that the two are actually describing the same problem.
In the consistent hashing algorithm, each node (corresponding to Peer in the P2P system) has a randomly assigned ID. When mapping content to a node, a consistent hash operation is performed using the keyword of the content and the node's ID and obtains the key value. Consistent hashing requires that the key value and node ID are in the same value range. The simplest key value and ID can be one-dimensional, such as a set of integers from 0000 to 9999.
When storing content based on key value, the content is stored on the node with the ID closest to its key value. For example, if the key value is 1001, there are nodes with IDs 1000, 1010, 1100 in the system, and the content will be mapped to 1000 nodes.
To build the routes required for the query, the consistency hash requires each node to store the location information (IP address) of its uplink node (the ID value is greater than the smallest among its own nodes) and downlink node (the ID value is less than the largest among its own nodes). When a node needs to find content, it can decide to initiate a query request to the uplink or downlink node based on the key value of the content. If a node that receives the query request finds that it has the requested target, it can directly return confirmation to the node that initiates the query request; if it finds that it does not belong to its own scope, it can forward the request to its uplink/downlink node.
In order to maintain the above-mentioned routing information, when a node joins/exits the system, adjacent nodes must update the routing information in a timely manner. This requires that nodes not only store directly connected downlink node location information, but also know indirect downlink node information at a certain depth (n-hop), and dynamically maintain the node list. When a node exits the system, its uplink node will try to connect directly to the nearest downlink node. After the connection is successful, it will obtain the downlink node list from the new downlink node and update its own node list. Similarly, when a new node is added to the system, first find the downlink node according to its own ID and obtain the downlink node list, and then ask the uplink node to modify its downlink node list, thus restoring the routing relationship.
discuss
Consistent hash basically solves the most critical problem in a P2P environment - how to distribute storage and routing in a dynamic network topology. Each node only needs to maintain information of a small number of neighboring nodes, and when the node joins/exits the system, only a small number of relevant nodes participate in the maintenance of the topology. All of this makes consistent hash the first practical DHT algorithm.
However, the routing algorithm of consistent hashing still has shortcomings. During the query process, the query message must go through O(N) step (O(N) indicates a proportional relationship with N, and N represents the total number of nodes in the system) before it can reach the query node. It is not difficult to imagine that when the system is very large, the number of nodes may exceed one million, and such query efficiency is obviously difficult to meet the needs of use. From another perspective, even if the user can tolerate long delays, the large amount of messages generated during the query process will bring unnecessary load to the network.
package herotrix;import java.util.Collection;import java.util.SortedMap;import java.util.TreeMap;public class ConsistentHash<T> {// hash algorithm private final HashFunction hashFunction;//number of virtual nodes private final int numberOfReplicas;private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();public ConsistentHash(HashFunction hashFunction, int numberOfReplicas, Collection<T> nodes){this.hashFunction = hashFunction; this.numberOfReplicas = numberOfReplicas; for (T node : nodes){add(node);}}public void add(T node){for (int i = 0; i < numberOfReplicas; i++){circle.put(hashFunction.hash(node.toString() + i), node);}}public void remove(T node){for (int i = 0; i < numberOfReplicas; i++){circle.remove(hashFunction.hash(node.toString() + i));}}//Key algorithm public T get(Object key){if(circle.isEmpty()){return null;}//Calculate hash value int hash = hashFunction.hash(key);//If this hash value is not included if(!circle.containsKey(hash)){SortedMap<Integer, T> tailMap = circle.tailMap(hash); hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();}return circle.get(hash);}}The above is the entire content of this article about the Java language Consistent Hash algorithm learning notes (code examples). I hope it will be helpful to everyone. Interested friends can continue to refer to other related topics on this site. If there are any shortcomings, please leave a message to point it out. Thank you friends for your support for this site!