The principle of the Bloom filter is very simple: it is to hash a string into an integer key, and then select a very long bit sequence, which starts with 0, and change 0 at this position to 1 in the key; next time a string comes in, the value key after hash, and if the value on this bit is also 1, it means that the string exists.
If you follow the above method, it will be no different from the hash algorithm, and there are still duplications of the hash algorithm.
The Bloom filter hash a string into multiple keys, so I'll just follow the book.
First create a 1.6 billion binary constant, and then set all 1.6 billion binary bits to zero. For each string, 8 different random generators (F1, F2,..., F8) are used to generate 8 information fingerprints (f1, f2,..., f8). Then, a random number generator G is used to map these eight information fingerprints to 8 natural numbers g1, g2,..., g8 in 1 to 1.6 billion. Now change all the binary bits in these 8 positions to 1. This way a Bloom filter is built.
So how to detect whether a string already exists?
Now use 8 random number generators (F1, F2,..., F8) to generate 8 information fingerprints s1, s2,..., s8 for this string, and then correspond these 8 information fingerprints to the 8 binary bits of the Bloom filter, namely T1, T2,..., T8. If the string exists, then obviously the binary bits corresponding to T1, T2,..., T8 should be 1. This is how to determine whether a string already exists.
In fact, the Bloom filter is an extension of the hash algorithm. Since it is essentially a hash, there will definitely be shortcomings. In other words, there will definitely be misjudgments. A string has not appeared, but the Bloom filter judgment has appeared. Although the possibility is very small, it does exist.
So how to reduce this probability? First of all, it can be imagined that if 8 information fingerprints are extended to 16 errors, the probability will definitely be reduced, but it should also be considered that in this way, the number of strings that a Bloom filter can store is also reduced by 1 times; in addition, select a good hash function, and there are many types of hash methods for strings, including very good hash functions.
Bronze filter is mainly used to filter malicious URLs. All malicious URLs are built on a Bronze filter, and then the user is accessed by the URL. If it is in a malicious URL, the user will be notified. In this way, we can also set a whitelist for some URLs that often have errors in judgment, and then match the URLs that are judged to exist and the URLs in the whitelist. If they are in the whitelist, they will be released. Of course, this whitelist cannot be too big, nor is it too big, and the probability of a Bloom filter error is very small. Interested readers can check the error rate of the Bloom filter.
The following is the source code of the Java version of Bloom filter:
import java.util.BitSet; /** * * @author xkey */ public class BloomFilter { private static final int DEFAULT_SIZE = 2 << 24;//Bit length of the Bloom filter private static final int[] seeds = {3,5,7, 11, 13, 31, 37, 61};//The prime number here can be selected to reduce the error rate very well private static BitSet bits = new BitSet(DEFAULT_SIZE); private static SimpleHash[] func = new SimpleHash[seeds.length]; public static void addValue(String value) { for(SimpleHash f : func)// hash the string value into 8 or more integers, and then change to 1 on the bits of these integers bits.set(f.hash(value),true); } public static void add(String value) { if(value != null) addValue(value); } public static boolean contains(String value) { if(value == null) return false; boolean ret = true; for(SimpleHash f : func)//In fact, there is no need to run all of them here. Just ret==false once, then the string will not be included. ret = ret && bits.get(f.hash(value)); return ret; } public static void main(String[] args) { String value = "www.VeVB.COM"; for (int i = 0; i < seeds.length; i++) { func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]); } add(value); System.out.println(contains(value)); } } class SimpleHash {//This thing is equivalent to the structure in C++ private int cap; private int seed; public SimpleHash(int cap, int seed) { this.cap = cap; this.seed = seed; } public int hash(String value) {//Stand hash, it is very important to select a good hash function int result = 0; int len = value.length(); for (int i = 0; i < len; i++) { result = seed * result + value.charAt(i); } return (cap - 1) & result; } } Summary: Bloom filter is an innovation in hashing algorithms, and it also consumes very little space and has a low error rate. In short, this innovative idea is worth learning and is a use of data type like bit.
The Java implementation method of Bloom Filter is all the content I have shared with you. I hope you can give you a reference and I hope you can support Wulin.com more.