The hashtable bucket number will usually take a prime number analysis

Author：Eve Cole Update Time：2025-05-18 17:16:02

Why does a hashtable bucket take a prime number?

Have a hash function

H(c ) = c % N;

When N takes a composite number, the simplest example is to take 2^n, for example, take 2^3=8, at this time

H(11100(binary)) = H(28) = 4
H(10100(binary)) = H(20) = 4

At this time, the 4th bit of binary (from right to left) of c will "fail", which means that no matter what value is taken in the 4th bit of c, it will lead to the same value of H(c). At this time, the fourth bit of c does not participate in the operation of H(c) at all, so H(c) cannot fully reflect the characteristics of c, increasing the chance of conflict.

When taking other composite numbers, some bits of c will be "failed" to varying degrees, resulting in conflicts in some common applications.

However, taking prime numbers can basically ensure that each bit of c participates in the operation of H(c), thereby reducing the chance of conflict in common applications. ．

(Personal opinion: Sometimes the efficiency of not taking prime numbers is not too bad... but it is undoubtedly safer to take prime numbers...)

The above is my understanding

To add to this, this means that in common applications, some data are often similar. It is better to use prime numbers at this time. For example, the data to be stored is in a compressed state, such as storing a table describing the current search state. At this time, the probability of hashing without prime numbers is relatively high.

If it is a randomly distributed integer, then the hash modulus will be the same as long as it is taken large enough, but this is obviously out of practical application.

What you said is a special situation, because when a relatively small prime number is selected, when the large prime number N is selected, it can only fail in a certain bit of the N-digit system. Combined with the characteristics of the computer system, the N-digit representation is often not critical, while the commonly used 2^N-digit system is more critical, so conflicts can be avoided.

In fact, I have used some large numbers to test it to store an adjacency matrix compressed into binary. When the modulus is large enough, even the composite number can have a very close effect to the prime number, but in some (several dozen) composite numbers, the efficiency will be severely reduced, so prime numbers are relatively safe.

You might as well do your own experiments, don’t choose random integers, but consider some common applications, use prime numbers and composite numbers to test, mainly examining the average loading factor, and the conclusion you get may be the same as mine: composite numbers are also good at most of the time, but the effect is surprisingly poor in some composite numbers, and almost all prime numbers have good results.

I personally think that in a more general sense, if you do not take prime numbers, there will be some danger. The danger occurs when the non-prime number m=x*y is assumed to be selected, and if the key of hash happens to be related to this divisor x, it will be miserable. In the worst case, all assume that it is multiples of x, then you can imagine that the result of hash is: 1~y, not 1~m. However, if the size of the bucket is selected as a prime number, there will be no problem.

Thank you for reading, I hope it can help you. Thank you for your support for this site!