Java uses Nagao algorithm to realize new words discovery and hot words

Author：Eve Cole Update Time：2025-03-01 19:16:01

The NAGAO algorithm is used to count the frequency of each sub -string, and then based on these frequency statistics, the word frequency, the number of neighbors of each string, the left and right entropy, and the interactive information (internal condensation) of each string.

Noun explanation:

NAGAO algorithm: A fast statistical text of all sub -string frequency algorithms in the text. Detailed algorithm can be visible http://www.doc88.com/p-664123446503.html
Word frequency: The number of times the string appears in the document. The more the number of times, the more important.
Number of neighbors on the left and right: The number of different characters on the left and right of the string in the document. The more neighbors on the left and right, the higher the probability of the word string.
Left and right entropy: The number of characters on the left and right of the string in the document distributes the number of characters in different characters. Similar to the indicators above, there are certain differences.
Interactive information: each time a string is divided into two parts, the left half of the string and the right half of the string, calculate the probability that it appears at the same time, except for the probability of their respective independence, and finally take the minimum probability in all the division. The larger this value, the higher the condensation of the string, the more likely it is.

The specific process of algorithm:

1. Read the input file one by one, according to non-Chinese characters ([^/u4e00-/u9fa5]+) and stop words "? And do n’t say it after you go. "
Divide into a string, the code is as follows:
String [] pHRASES = line.split ("[^/u4e00-/u9fa5]+| ["+stopwords+"]);
The stop words can be modified.
2. Get the left and right skewers of all cut string, add left and right ptables respectively
3. Sort the ptable and calculate LTable. LTable records that in the sorted ptable, the next sub -skewers have the same number of the same character with the same character
4. Traversing Ptable and LTable, you can get the frequent words of all sub -string, neighbors
5. According to the frequency of the word string, the left and right neighbors, the word frequency, the number of neighbors, the left and right entropy, and interactive information of the output string

1. Nagaoalgorithm.java

 package com.algo.word; Import java.io.BufferedReader; Import Java.io.BufferedWriter; Import Java.io.filenotFoundexception; Import Eader; Import Java.io.Filewriter; Import Java.io.ioException; Import java.util.arraylist; Import java.util.arrays; Import java.util.Collections; Import java.util.hashmap; Import java.util.hashSet; Import .UTIL.List; Import Java.util.Map; Import Java .UTIL.SET; Public Class Nagaoalgorithm {Private Int N; Private list <string> leftptable; Private int [] leftltable; ; Private int [] rightLtable; Private double wordnumber; Private Map <string, TFNeighbor> Wordtfneighbor; Private Final Static String Stopwords = "is it very good? Well, it is better than that it is not good. 5 n = 5; leftPtable = New ArrayList <string> (); Rightptable = New ArrayList <string> (); wordTFNEIGHBOR = New HashMap <string, TFNEIGHBOR> ();} // Reverse Phrase Private String Reverse (String Phrase) { StringBuilder Reversephrase = New StringBuilder (); For (int i = phrase.length ()-1; i--) Reversephrase.append (phrase.charat (i)); Reversephrase.tstring ();} // Co-Prefix Langth of S1 and S2 Private Int CoprefixLength (String S1, String S2) {int CoprefixLength = 0; for (Int I = 0; I <Math.min (s1.length (), S2.Length () ); I ++) {if (s1.charat (i) == s2.charat (i)) CoprefixLength ++; Else Break;} Return copy; dtoptable (string line) {// split line account to Consecutive None Chinese Character String [] phrasees = line.split ("[^/u4e00-/u9fa5]+| ["+Stopwords+"" "); for (string e: pHRASES) {for (int i = 0; I <phrase.Length (); I ++) Rightptable.add (phrase.substring (i)); string reversephrase = reverse (phrase); for (int i = 0; I < English (); I ++) leftptable .add (reversephrase.substring (i)); wordnumber += phrase.length ();} // contribute private void countltable () {CollectPtable ; rightltable = new int [rightptable.size ()] ; for (int i = 1; I <rightptable.size (); I ++) RightLtable [i] = CoprefixLength (Rightptable.get (I-1), rightpltable.get (i)); tptable); leftltable = new int [leftptable.size ()]; for (int i = 1; I <leftptable.size (); i ++) leftltable [i] = CoprefixLENGTH I) );; "Info: [Info: [Nagao Algorithm Step 2]: Haveing Sorted Ptable and Countted Left and Right LTable"); T Statistical Result: TF, Neighbor Distribution Private void Countfneighbor () () {// get tf and right neighbor for (int pindex = 0; pindex <rightptable.size (); pindex ++) {string phrase = rightptable.get (pindex); for (int length h = 1 + Rightltable [pindex]; length <= n && length <= phrase.length (); length ++) {string word = phrase.substring (0, length); hbor.incrementtf (); if (phrase.length () > LENGTH) TFNEIGHBOR.ADDDDDDIGHTNEIGHBOR (phrase.charat (length)); for (int LINDEX = pindex+1; lindex <rightable.Length; lindex ++) {RIGHTLTA BLE [lindex]> = length) {tfneighbor.incrementtf (); String cophrase = rightptable.get (lindex); if (cophrase.length ()> length) TFNEIGHBOR.ADDDDDDADDDNEIGHBOR (COPHRASE.CHARAT (LENGTH));} Else Break ;} Wordtfneighbor.put (word, tfneighbor);}} // get left neighbor for (int pindex = 0; pindex <leftptable.size (); pindex ++) {string phrase = leftptable.get (pindex); for (int length = 1+leftLtab le [pindex]; length <= n && length < = phrase.length (); length ++) {string word = reverse (phrase.substring (0, length)); tfneighbor tfneighbor = wordTFNEIGHBOR.GET (word); if (PHRASE) .length ()> LENGTH) tfneighbor.addtoleftneighbor (phrase (phrasee .charat (length); for (int Lindex = pindex+1; lindex <leftLtable.Length; lindex ++) {if (leftLtable [lindex]> = LENGTH) {string cophrase = leftpta ble.get (lindex); if (COPHRASE. length ()> length) tfneighbor.addtoleftneighbor (cophrase.charat (length);} Else Break;}} System.out.println ("Info: [Nagao Algorithm Step 3]] : having counted tf and neighbor ");} // According to wordTFNEIGHBOR, Count Mi of Word Private Double Countmi (String Word) {if (word.Length () <= 1) Return 0; Double COPROBABILITY = wordtfneigh bor.get (word) .gettf ()/wordnumber; list < Double> Mi = New ArrayList <Double> (word.Length ()); for (int POS = 1; POS <Word.length (); POS ++) {string leftpart = word.substring (0, POS); string ghtpart = Word.substring (POS); Double LeftProbability = Wordtfneighbor.get (leftPart) .gettf ()/wordnumber; Double RightProbability = Wordtfneighbor.get ( Rightpart) .gettf ()/Wordnumber; mi.add (COPROBABILITY/(leftProbability*RightProbability) );} Return Collections.min (mi);} // Save Tf, (LEFT and RIGHT) Neighbor Number, Neighbor Entropy, Mutual Information Private void String out, string stoplist, string [] threshold) {try {// Read Stop Words File Set <strong> Stopwords = New Hashset <string> (); bufferedReader < (LINE = BR.Readline ())! = NULL) { if (line.Length ()> 1) Stopwords.add (line);} BR.Close (); // Output words tf, neighbor info, mi bufferedWriter bw = new bufferedWriter (New Filewriter (OWriter (OWriter (O UT); for (MAP .Ntry <string, tfneighbor> Entry: Wordtfneighbor.entrySet ()) {if (entry.getKey (). Length () <= 1 || Stopwords.contains (entry.getkey ()))) Tfneighbor tfneighbor = ENTRY. getValue (); Int TF, Leftneighbornumber, RightNeighbornumber; Double mi; tf = tfneighbor.gettf (); Leftneighbornumber = TFNeighbor.GetLEFTNEIG hbornumber (); Rightneighbornumber = TFNEIGHBOR.GetRightNeighbornumber (); mi = countmi (entry.getkey ()); if (TF> Integer.parseint (Threshold [0]) && LEFTNEIGHBORNUMBER> Integer.parseint (Threshold [1]) && mi> Integer.parseint (threshold [3]) {) {) {) {) {) {) {) {) {) { StringBuilder sb = New StringBuilder (); sb.append (entry.getkey ()); sb.append (","). Sb.append (","). b.append (","). Append (RightNeighbornumber); sb.append (",", "). tneightborentropy ()); sb.append (","). Append (mi) .appnd ("/n"); bw.write (sb.tostring ());}}} bw.close ();} Catch (IOEXception E) {Throw New Runtimeexception (E );} System.out.println ("Info: [Nagao Algorithm Step 4]: Haveing Saved to File"); O (String [] Inputs, String Out, String Stoplist ) {Nagaoalgorithm nagao = new nagaoalgorithm (); // Step 1: add phrasees to ptable string line; for (string in: inputs) {try {bufferedReader br = new buf fereadReader (New FileRereader (in)); While ((line = Br.reamline ())! = NULL) {nagao.addtoptable (line);} br.Close ();} Catch (IOEXCEption E) {Throw New RuntimeException ();}}}} TLN ("Info: [[[[[[[ Nagao Algorithm Step 1]: having added all left and right substrings to ptable "); // Step 2: sort ptable and count ltable nagao.countltapable (); // step3: Cou nt tf and neighbor nagao.counttfneighbor (); // step4: Save TF Neighborinfo and Mi Nagao.savetfneighborInfomi (out, Stoplist, "20,3,3,5" .split (","); UTS, String Out, String Stoplist, int n, string file) {nagaoalgorithm nagao = new nagaoalgorithm (); nagao.setn (n); string [] threshold = filter.split (",", "); h! = 4) {system.out. Println ("ERROR: Filter Must have 4 numbers, seperated with ','"); Return;} // Step 1: add phrasees to ptable string line; for (string in: inputs) {try {b ufferedReader br = new bufferedReader ( New FileReader (in)); While ((line = br.reamline ())! = Null) {nagao.addtoptable (line);} <Throw News Exception (); }} System.out.println ("Info: [Nagao Algorithm Step 1]: Haveing Added All Left and Right Substrings to Ptable"); // Step 2: Sort ptable and count ltable aO.COUNTLTable (); // Step3: Count tf and neighbor nagao.counttfneighbor (); // Step4: Save TF Neighborinfo and Mi Nagao.saVetfneighborinfomi (out, Stoplist, Threshold);} e void setn (int n) {n = n;} Public Static void main (string [] args) {string [] ins = {"e: //test/ganfen.txt"}; Applynagao (ins, "e: //test/out.txt", "e: // test /// stoplist.txt ");}}

2. tfneighbor.java

 package com.algo.word; Import Java.util.hashmap; Import Java.util.map; Public Class TFNEIGHBOR {Private INT TF; Private Map <Character, Integer> Leftneighbor; Private Map <Character, Integer> RightNeighbor; TFNEIGHBOR () {leftneighbor = New HashMap <charactor, Integer> (); RightNeighbor = New HashMap <character, Integer> ();} // Add Word to Leftneighbor Public Void AD DTOLEFTNEIGHBOR (Char Word) {//Leftneighbor.put (Word, 1 + Leftneighbor.getOrdefault (Word, 0)); Integer Number = Leftneighbor.get (Word); Leftneighbor.put (word, number == null? 1: 1+Number); To RightNeighbor Public Void Addtorightneighbor (Char (Char Word) {//rightNeighbor.put (Word, 1 + RightNeighbor.getOnterfault (word, 0)); Integer Number = RightNeighbor.get (Word); RightNeighbor.Put (word, n umber == NULL? 1: 1+Number) ;} // increment tf public void incrementtf () {tf ++;} public int GetleFTLENEIGHBORNUMBER () {Return Leftneighbor.size (); Eighbornumber () {Return RightNeighbor.size ();} Public Double Double GetLeftneighborentropy () {Double Entropy = 0; int Sum = 0; for (int Number: leftneighbor.values ()) {Entropy += Number*Math.log (Number); n 0; Return Math.log (Sum) -Entropy/Sum;} Public Double GetRightNeighborentropy () {Double Entropy = 0; int Sum = 0; for (int Number: RightNeight.values ()) {Entropy += Number*Math.log (Number );;} if (SUM == 0) Return 0; Return math.log (SUM) -ENTROPY/SUM;} Public int Getttf () {Return TF;}}

3. main.java

 package com.algo.word; Public class main {public static void main (string [] args) {// if 3 arguments, first argument is input files splitting with ',' // Second argument is output file // output 7 columns split with ',', like belW: // Word, TERM FREQUENCY, LEFT Neighbor Number, Right Neighbor Number, Left Neighbor Entropy, Right Neighbor Entropy, Mutual Inf ORMation // Third Argument Is Stop Words List if (ARGS.Length == 3 ) Nagaoalgorithm.applynagao (args [0] .split (","), ARGS [1], ARGS [2]); // if 4 arguments, fth argument is the ngram parameter n // 5th arguments Hold of Output Words , DEFAULT is "20,3,3,5" // Output TF> 20 && (left | Right) Neighbor Number> 3 && Mi> 5 else if (args.length == 5) .split (","), args [1], ARGS [2], Integer.parseint (ARGS [3]), ARGS [4]);}}}

The above is all the contents of this article. I hope everyone can like it.