This article describes the bidirectional matching word segmentation algorithm implemented by Java. Share it for your reference, as follows:
Several popular word segmentation algorithms are: word segmentation method based on string matching, word segmentation method based on understanding, and word segmentation method based on statistics. This article uses the string matching method.
Forward maximum matching participle:
This algorithm is implemented based on word participle dictionary, which performs segmentation matching from the left side of the string. If the dictionary exists, it returns the divided word and cuts the word from the previous string, and loops to cut until the string size is 0.
For example: str = "We are all students from the School of Information Engineering, Northwest A&F University.", (Suppose we define the maximum cutting length max = 8, that is, eight Chinese characters)
1. Define the word participle length len = max, take out the len characters from the left word = "We are all northwest agriculture and forestry" and match word in the dictionary;
2. If there is no such word in the dictionary, then remove the last character and assign it to word, and the len value is reduced by 1;
3. Repeat step 2 until the word is found in the dictionary or len = 1, exit the loop;
4. Cut off the divided words from str and repeat steps 1, 2, and 3 until str is decomposed.
Reverse maximum matching participle:
Like the forward word segmentation algorithm, it simply starts from the right side of the string until the length of the string is 0. I won't go into details here.
Two-way matching participle:
This method is to make ambiguity participle for segmenting ambiguity on the basis of forward participle and reverse participle. Propose the implementation of the "snake-eating algorithm". The string to perform word segmentation is food. There are 2 gluttonous snakes, one eats from left to right; the other eats from right to left. As long as the left and right participle results are ambiguous, the two snakes will bite it. The string that the snake eats becomes participle. If there is always ambiguity between left and right participles, the two snakes will keep eating. When two snakes eat the intersect between the strings, there will definitely be no ambiguity. At this time, the participle in the greedy snake belly on the left, there is no ambiguity in the middle, and the participle in the greedy snake belly on the right, three of which are the final participle.
This article is to divide the entire paragraph into words. First, the paragraph is divided into sentences based on punctuation marks, and then each sentence is output to divide the word.
package cn.nwsuaf.spilt;import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import java.util.Scanner;/** * Implementation based on the forward and reverse bidirectional maximum matching word segmentation algorithm* @author Liu Yonglang* */public class WordSpilt { /** * Storage word segmentation dictionary*/ private Map<String, Integer> map = new HashMap<String, Integer>(); /** * The maximum word-cut length is five Chinese characters*/ private int MAX_LENGTH = 5; /** * Read the word participle dictionary in the construction method and store it in the map* * @throws IOException */ public WordSpilt() throws IOException { BufferedReader br = new BufferedReader(new FileReader("src/dict.txt")); String line = null; while ((line = br.readLine()) != null) { line = line.trim(); map.put(line, 0); } br.close(); } /** * Set the maximum word cut length* * @param max * Maximum word cut length*/ public void setMaxLength(int max) { this.MAX_LENGTH = max; } /** * Get the current maximum word cut length, default is 5 (5 Chinese characters) * * @return Current maximum word cut length*/ public int getMaxLength() { return this.MAX_LENGTH; } /** * Maximum matching word cut algorithm* * @param spiltStr * String to be split* @param leftToRight * The slicing direction, true is from left to right, false is the string divided from right to left * @return */ public List<String> spilt(String spiltStr, boolean leftToRight) { // If the slicing string is empty, return empty if (spiltStr.isEmpty()) return null; // Store the positive matching split string List<String> leftWords = new ArrayList<String>(); // Store the negative matching split string List<String> rightWords = new ArrayList<String>(); // String for slicing String word = null; // Take the length of the word, initialize to the maximum value int wordLength = MAX_LENGTH; // is in the current position of the string int position = 0; // The length of the string has been processed int length = 0; // Remove the extra spaces in the string spiltStr = spiltStr.trim().replaceAll("//s+", ""); // When the string to be sliced is not sliced, loop segmentation while (length < spiltStr.length()) { // If the length of the string that has not been sliced is less than the maximum value, let the word length equal to the length of the word itself if (spiltStr.length() - length < MAX_LENGTH) wordLength = spiltStr.length() - length; // Otherwise, take the default value else wordLength = MAX_LENGTH; // If it is a forward maximum match, start cutting from the position of spiltStr if (leftToRight) { position = length; word = spiltStr.substring(position, position + wordLength); } // If it is an inverse maximum match, start cutting from the end of spiltStr else { position = spiltStr.length() - length; word = spiltStr.substring(position - wordLength, position); } // Start cutting the string of the specified length from the current position // word = spiltStr.substring(position, position + wordLength); // If there is no string cut in the word participle dictionary, discard a character while (!map.containsKey(word)) { // If it is a single word, exit the loop if (word.length() == 1) { // If it is a letter or number, divide the continuous letters or numbers together if (word.matches("[a-zA-z0-9]")) { // If it is a forward match, directly loop to add up the subsequent continuous characters if (leftToRight) { for (int i = spiltStr.indexOf(word, position) + 1; i < spiltStr .length(); i++) { if ((spiltStr.charAt(i) >= '0' && spiltStr.charAt(i) <= '9') || (spiltStr.charAt(i) >= 'A' && spiltStr .charAt(i) <= 'Z') || (spiltStr.charAt(i) >= 'a' && spiltStr .charAt(i) <= 'z')) { word += spiltStr.charAt(i); } else break; } } else { // If it is a reverse matching, add up and flip the continuous numbers and alphabetical characters before the current position for (int i = spiltStr.indexOf(word, position - 1) - 1; i >= 0; i--) { if ((spiltStr.charAt(i) >= '0' && spiltStr.charAt(i) <= '9') || (spiltStr.charAt(i) >= 'A' && spiltStr.charAt(i) <= 'Z') || (spiltStr.charAt(i) >= 'a' && spiltStr.charAt(i) <= 'z')) { word += spiltStr.charAt(i); if (i == 0) { StringBuffer sb = new StringBuffer(word); word = sb.reverse().toString(); } } else { // Flip operation StringBuffer sb = new StringBuffer(word); word = sb.reverse().toString(); break; } } } } break; } // If it is a forward maximum match, discard the last character if (leftToRight) word = word.substring(0, word.length() - 1); // Otherwise discard the first character else word = word.substring(1); } // Save the split string to the specified table if (leftToRight) leftWords.add(word); else rightWords.add(word); // Processed strings add length += word.length(); } // If it is an inverse maximum match, adjust the string in the table to forward if (!leftToRight) { for (int i = rightWords.size() - 1; i >= 0; i--) { leftWords.add(rightWords.get(i)); } } // Return the slicing result return leftWords; } /** * Determine whether the two sets are equal* * @param list1 * Set 1 * @param list2 * Set 2 * @return Return true if equal, otherwise false */ public boolean isEqual(List<String> list1, List<String> list2) { if (list1.isEmpty() && list2.isEmpty()) return false; if (list1.size() != list2.size()) return false; for (int i = 0; i < list1.size(); i++) { if (!list1.get(i).equals(list2.get(i))) return false; } return true; } /** * Discriminant participle ambiguity function* * @param inputStr * String to be divided* @return Partition result*/ public List<String> resultWord(String inputStr) { // Partition result List<String> result = new ArrayList<String>(); // "Left Snake" participle result List<String> resultLeft = new ArrayList<String>(); // "Medium Snake" (divergent part) participle result List<String> resultMiddle = new ArrayList<String>(); // "Right Snake" participle result List<String> resultRight = new ArrayList<String>(); // Forward maximum matching word segmentation result List<String> left = new ArrayList<String>(); // Inverse maximum matching word segmentation result List<String> right = new ArrayList<String>(); left = spilt(inputStr, true); /*System.out.println("Forward participle result: "); for (String string : left) { System.out.print(string + "/"); } System.out.println("/n Inverse participle result: "); */ right = spilt(inputStr, false); /*for (String string : right) { System.out.print(string + "/"); } System.out.println("/nBoth-way word participle result: ");*/ // // Determine whether the word participle splicing at both ends has intersected in the middle of the input string. As long as there is no intersection, it keeps looping while (left.get(0).length() + right.get(right.size() - 1).length() < inputStr .length()) { // If the results of the forward and reverse word participle are equal, then the forward result will jump out of the loop if (isEqual(left, right)) { resultMiddle = left; break; } // If the results of the forward and reverse word participle are different, then the one with the smaller number of participles is taken, and there is no need to loop if (left.size() != right.size()) { resultMiddle = left.size() < right.size() ? left : right; break; } // If the above conditions are not met, then implement the "snake" algorithm // Let "left greedy snake" eat the first word of the forward word participle result resultLeft.add(left.get(0)); // Let "right greedy snake" eat the last word of the reverse word participle result resultRight.add(right.get(right.size() - 1)); // Remove the words eaten by "snake" inputStr = inputStr.substring(left.get(0).length()); inputStr = inputStr.substring(0, inputStr.length() - right.get(right.size() - 1).length()); // Clean up the previous positive and reverse word segmentation results to prevent interference left.clear(); right.clear(); // Start the participle for uneating strings left = spilt(inputStr, true); right = spilt(inputStr, false); } // The end of the loop means that either the participle is not ambiguous, or the "greedy snake" eats from both ends to the intersection // If it is the word participle at the intersection, the following judgment must be made: // If the middle intersection overlaps: // The length of the first participle + the length of the last participle in the reverse direction > Enter the total length of the string, directly take the positive if (left.get(0).length() + right.get(right.size() - 1).length() > inputStr .length()) resultMiddle = left; // If the middle intersection intersects, it just takes the food and there is no overlap: // The first participle in the forward direction + the length of the last participle in the reverse direction = Enter the total length of the string, then the forward and reverse direction can be spelled if (left.get(0).length() + right.get(right.size() - 1).length() == inputStr .length()) { resultMiddle.add(left.get(0)); resultMiddle.add(right.get(right.size() - 1)); } // Add unambiguous participle result to the final result for (String string : resultLeft) { result.add(string); } for (String string : resultMiddle) { result.add(string); } // The participle stored in "right greedy snake" should be adjusted to forward for (int i = resultRight.size() - 1; i >= 0; i--) { result.add(resultRight.get(i)); } return result; } /** * Divide a paragraph into several sentences and perform word segmentation separately * * @param inputStr * A paragraph to be divided* @return Participle result of this paragraph*/ public List<String> resultSpilt(String inputStr) { // Used to store the final word segmentation result List<String> result = new ArrayList<String>(); // If punctuation is encountered, it will be divided into several sentences String regex = "[,.;!?]"; String[] st = inputStr.split(regex); // Add the participle result of each sentence to the final participle result for (String stri : st) { List<String> list = resultWord(stri); result.addAll(list); } return result; } public static void main(String[] args) { // example: Come and see if the house price is expensive? Table Tennis Auction Scanner input = new Scanner(System.in); String str = input.nextLine(); WordSpilt wordSpilt = null; try { wordSpilt = new WordSpilt(); } catch (IOException e) { e.printStackTrace(); } List<String> list = wordSpilt.resultWord(str); for (String string : list) { System.out.print(string + "/"); } }}The src/dict.txt is a dictionary file, and the settings are as follows:
Welcome to Wulin.com script download script tutorial material download
The operation results are as follows:
For more information about Java algorithms, readers who are interested in this site can view the topics: "Java Data Structure and Algorithm Tutorial", "Summary of Java Operation DOM Node Tips", "Summary of Java File and Directory Operation Tips" and "Summary of Java Cache Operation Tips"
I hope this article will be helpful to everyone's Java programming.