Huffman encoding introduction
Huffman encoding deals with the coding pairing problem of characters and binary corresponding to characters, which is divided into encoding and decoding, with the purpose of compressing the binary data length corresponding to characters. We know that when characters are stored and transferred, they are binary (the computer only knows 0/1), so there is a mapping relationship between characters and binary. Characters belong to the character set (charset). Characters need to be stored and transmitted in binary through encode. When displayed, they need to be decoded back to characters. The character set and encoding methods are one-to-many relationships (Unicode can be encoded with UTF-8, UTF-16, etc.). After understanding the character set, encoding and decoding, the problem of garbled code flying all over the sky can be easily solved. Taking the lowercase a of the English letter as an example, in ASCII encoding, the decimal is 97 and the binary is 01100001. Each character in ASCII is encoded with 8 Bits (1Byte). If there are 1000 characters to be transmitted, then 8000 Bits must be transmitted. The problem is that the frequency of using the letter e in English is 12.702%, while z is 0.074%. The former is more than 100 times that of the latter, but it does use binary with the same digits. It can be done better, the method is variable-length encoding. The guiding principle is to encode the higher frequency with shorter digits, and those with low frequency with longer digits. The Huffman encoding algorithm deals with such problems.
Huffman encoding Java implementation
The data structures mainly used by the Huffman encoding algorithm are full binary trees and priority queues. The latter uses Java.util.PriorityQueue, and the former implements itself (all are internal classes). The code is as follows:
static class Tree { private Node root; public Node getRoot() { return root; } public void setRoot(Node root) { this.root = root; } } static class Node implements Comparable<Node> { private String chars = ""; private int frequency = 0; private Node parent; private Node leftNode; private Node rightNode; @Override public int compareTo(Node n) { return frequency - n.frequency; } public boolean isLeaf() { return chars.length() == 1; } public boolean isRoot() { return parent == null; } public boolean isLeftChild() { return parent != null && this == parent.leftNode; } public int getFrequence() { return frequency; } public void setFrequence(int frequency) { this.frequency = frequency; } public String getChars() { return chars; } public void setChars(String chars) { this.chars = chars; } public Node getParent() { return parent; } public void setParent(Node parent) { this.parent = parent; } public Node getLeftNode() { return leftNode; } public void setLeftNode(Node leftNode) { this.leftNode = leftNode; } public Node getRightNode() { return rightNode; } public void setRightNode(Node rightNode) { this.rightNode = rightNode; } } Statistics
Since the encoding table needs to be arranged according to the frequency, then of course, you must obtain the statistical information of the frequency. I implemented a method to deal with such a problem. If there is already statistics, then convert to Map<Character,Integer>. If the information you get is a percentage, multiply by 100 or 1000, or 10000. It can always be converted to an integer. For example, 12.702% multiplied by 1000 is 12702, and Huffman encoding only cares about size issues. The statistical method is implemented as follows:
public static Map<Character, Integer> statistics(char[] charArray) { Map<Character, Integer> map = new HashMap<Character, Integer>(); for (charc c : charArray) { Character character = new Character(c); if (map.containsKey(character)) { map.put(character, map.get(character) + 1); } else { map.put(character, 1); } } return map; } Building a tree
Building a tree is a core step in the Huffman encoding algorithm. The idea is to hang all characters on a leaf node of a completely binary tree, and the frequency of the left node of any non-page child node does not appear more than the right node. The algorithm converts statistics into Nodes and stores them in a priority queue. Each time two nodes with the minimum frequency pop up from the queue, a new parent Node (non-leaf node) is constructed. The sum of the two nodes whose character content just pops up, and the frequency is also their sum. The first pop-up is used as the left child node, and the next one is used as the right child node, and the newly constructed parent node is placed in the queue. Repeat the above actions N-1 times, N is the number of different characters (the number in the queue is reduced by 1 each time). End the above steps, there is a node left in the queue, and the root node of the tree pops up. The code is as follows:
private static Tree buildTree(Map<Character, Integer> statistics, List<Node> leaves) { Character[] keys = statistics.keySet().toArray(new Character[0]); PriorityQueue<Node> priorityQueue = new PriorityQueue<Node>(); for (Character character : keys) { Node node = new Node(); node.chars = character.toString(); node.frequency = statistics.get(character); priorityQueue.add(node); leaves.add(node); } int size = priorityQueue.size(); for (int i = 1; i <= size - 1; i++) { Node node1 = priorityQueue.poll(); Node node2 = priorityQueue.poll(); Node sumNode = new Node(); sumNode.chars = node1.chars + node2.chars; sumNode.frequency = node1.fraquence + node2.fraquence; sumNode.leftNode = node1; sumNode.rightNode = node2; node1.parent = sumNode; node2.parent = sumNode; priorityQueue.add(sumNode); } Tree tree = new Tree(); tree.root = priorityQueue.poll(); return tree; } coding
The corresponding encoding of a character is to search up from the leaf node where the character is located. If the character node is the left node of the parent node, add 0 before the encoded character, otherwise if it is the right node, add 1 until the root node. As long as the mapping relationship between the characters and binary code is obtained, the encoding is very simple. The code is as follows:
public static String encode(String originalStr, Map<Character, Integer> statistics) { if (originalStr == null || originalStr.equals("")) { return ""; } char[] charArray = originalStr.toCharArray(); List<Node> leafNodes = new ArrayList<Node>(); buildTree(statistics, leafNodes); Map<Character, String> encodInfo = buildEncodingInfo(leafNodes); StringBuffer buffer = new StringBuffer(); for (char c : charArray) { Character character = new Character(c); buffer.append(encodInfo.get(character)); } return buffer.toString(); } private static Map<Character, String> buildEncodingInfo(List<Node> leafNodes) { Map<Character, String> codewords = new HashMap<Character, String>(); for (Node leafNodes: leafNodes) { Character character = new Character(leafNode.getChars().charAt(0)); String codeword = ""; Node currentNode = leafNode; do { if (currentNode.isLeftChild()) { codeword = "0" + codeword; } else { codeword = "1" + codeword; } currentNode = currentNode.parent; } while (currentNode.parent != null); codewords.put(character, codeword); } return codewords; } decoding
Because the Huffman encoding algorithm can ensure that any binary code will not be the prefix of another code, decoding is very simple. Take out each bit of the binary in sequence, search down from the tree root, 1 to the right, 0 to the left, reach the leaf node (hit), and return to the root node and continue repeating the above actions. The code is as follows:
public static String decode(String binaryStr, Map<Character, Integer> statistics) { if (binaryStr == null || binaryStr.equals("")) { return ""; } char[] binaryCharArray = binaryStr.toCharArray(); LinkedList<Character> binaryList = new LinkedList<Character>(); int size = binaryCharArray.length; for (int i = 0; i < size; i++) { binaryList.addLast(new Character(binaryCharArray[i])); } List<Node> leafNodes = new ArrayList<Node>(); Tree tree = buildTree(statistics, leafNodes); StringBuffer buffer = new StringBuffer(); while (binaryList.size() > 0) { Node node = tree.root; do { Character c = binaryList.removeFirst(); if (c.charValue() == '0') { node = node.leftNode; } else { node = node.rightNode; } } while (!node.isLeaf()); buffer.append(node.chars); } return buffer.toString(); } Test and comparison
The following tests on the correctness of Huffman encoding (encoded first, then decoded, including Chinese), and comparison of Huffman encoding with common character encoding binary strings. The code is as follows:
public static void main(String[] args) { String oriStr = "Huffman codes compress data very effectively: savings of 20% to 90% are typical, " + "depending on the characteristics of the data being compressed. China's rise"; Map<Character, Integer> statistics = statistics(oriStr.toCharArray()); String encodedBinariStr = encode(oriStr, statistics); String decodedStr = decode(encodedBinariStr, statistics); System.out.println("Original string: " + oriStr); System.out.println("Huffman encoed binary string: " + encodedBinariStr); System.out.println("decoded string from binariy string: " + decodedStr); System.out.println("binary string of UTF-8: " + getStringOfByte(oriStr, Charset.forName("UTF-8"))); System.out.println("binary string of UTF-16: " + getStringOfByte(oriStr, Charset.forName("UTF-16"))); System.out.println("binary string of US-ASCII: " + getStringOfByte(oriStr, Charset.forName("US-ASCII"))); System.out.println("binary string of GB2312: " + getStringOfByte(oriStr, Charset.forName("GB2312"))); } public static String getStringOfByte(String str, Charset charset) { if (str == null || str.equals("")) { return ""; } byte[] byteArray = str.getBytes(charset); int size = byteArray.length; StringBuffer buffer = new StringBuffer(); for (int i = 0; i < size; i++) { byte temp = byteArray[i]; buffer.append(getStringOfByte(temp)); } return buffer.toString(); } public static String getStringOfByte(byte b) { StringBuffer buffer = new StringBuffer(); for (int i = 7; i >= 0; i--) { byte temp = (byte) ((b >> i) & 0x1); buffer.append(String.valueOf(temp)); } return buffer.toString(); }The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.