The traditional Trie is simple to implement, but the space it takes up is really unacceptable. Especially when the character set is not limited to the 26 English characters, the exploded space is simply unacceptable.
Double array Trie is a space-optimized Trie tree. The principle will not be discussed in this article. Please refer to An Efficient Implementation of Trie Structures. This program is also written with reference to this paper.
Regarding several details not mentioned in the paper and implementations inconsistent with the paper:
1. For inserting strings, if a string is a substring of another string, I will also use the terminator as an edge to generate a new node. I will set the Base of the new node to 0.
Therefore, there are two situations at the end of a string: one is that the Base value is negative and the remaining characters (maybe only one terminator) are stored in the Tail array; the other is that the Base is 0.
Therefore, you should consider these two situations when making inquiries.
2. For the first type of conflict (Case 3 in the paper), it may be necessary to take out part of the string in Tail and put it in the index as an edge. The paper uses the method of shifting the tail string to the left. My method directly modifies the Base value instead of moving the tail string.
The following is the code implemented in java, which can handle the insertion of the same string, the insertion of substrings, etc.
Copy the code code as follows:
/*
* Name: Double Array Trie
* Author: Yaguang Ding
* Mail: [email protected]
* Blog: blog.csdn.net/dingyaguang117
* Date: 2012/5/21
* Note: a word ends may be either of these two case:
* 1. Base[cur_p] == pos ( pos<0 and Tail[-pos] == 'END_CHAR' )
* 2. Check[Base[cur_p] + Code('END_CHAR')] == cur_p
*/
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;
public class DoubleArrayTrie {
final char END_CHAR = '/0';
final int DEFAULT_LEN = 1024;
int Base[] = new int [DEFAULT_LEN];
int Check[] = new int [DEFAULT_LEN];
char Tail[] = new char [DEFAULT_LEN];
int Pos = 1;
Map<Character,Integer> CharMap = new HashMap<Character,Integer>();
ArrayList<Character> CharList = new ArrayList<Character>();
public DoubleArrayTrie()
{
Base[1] = 1;
CharMap.put(END_CHAR,1);
CharList.add(END_CHAR);
CharList.add(END_CHAR);
for(int i=0;i<26;++i)
{
CharMap.put((char)('a'+i),CharMap.size()+1);
CharList.add((char)('a'+i));
}
}
private void Extend_Array()
{
Base = Arrays.copyOf(Base, Base.length*2);
Check = Arrays.copyOf(Check, Check.length*2);
}
private void Extend_Tail()
{
Tail = Arrays.copyOf(Tail, Tail.length*2);
}
private int GetCharCode(char c)
{
if (!CharMap.containsKey(c))
{
CharMap.put(c,CharMap.size()+1);
CharList.add(c);
}
return CharMap.get(c);
}
private int CopyToTailArray(String s,int p)
{
int _Pos = Pos;
while(s.length()-p+1 > Tail.length-Pos)
{
Extend_Tail();
}
for(int i=p; i<s.length();++i)
{
Tail[_Pos] = s.charAt(i);
_Pos++;
}
return _Pos;
}
private int x_check(Integer []set)
{
for(int i=1; ; ++i)
{
boolean flag = true;
for(int j=0;j<set.length;++j)
{
int cur_p = i+set[j];
if(cur_p>= Base.length) Extend_Array();
if(Base[cur_p]!= 0 || Check[cur_p]!= 0)
{
flag = false;
break;
}
}
if (flag) return i;
}
}
private ArrayList<Integer> GetChildList(int p)
{
ArrayList<Integer> ret = new ArrayList<Integer>();
for(int i=1; i<=CharMap.size();++i)
{
if(Base[p]+i >= Check.length) break;
if(Check[Base[p]+i] == p)
{
ret.add(i);
}
}
return ret;
}
private boolean TailContainString(int start,String s2)
{
for(int i=0;i<s2.length();++i)
{
if(s2.charAt(i) != Tail[i+start]) return false;
}
return true;
}
private boolean TailMatchString(int start,String s2)
{
s2 += END_CHAR;
for(int i=0;i<s2.length();++i)
{
if(s2.charAt(i) != Tail[i+start]) return false;
}
return true;
}
public void Insert(String s) throws Exception
{
s += END_CHAR;
int pre_p = 1;
int cur_p;
for(int i=0; i<s.length(); ++i)
{
//Get status location
cur_p = Base[pre_p]+GetCharCode(s.charAt(i));
//If the length exceeds the existing length, expand the array
if (cur_p >= Base.length) Extend_Array();
//idle state
if(Base[cur_p] == 0 && Check[cur_p] == 0)
{
Base[cur_p] = -Pos;
Check[cur_p] = pre_p;
Pos = CopyToTailArray(s,i+1);
break;
}else
//existing status
if(Base[cur_p] > 0 && Check[cur_p] == pre_p)
{
pre_p = cur_p;
continue;
}else
//Conflict 1: When Base[cur_p] is less than 0, it encounters a string that is compressed and stored in Tail.
if(Base[cur_p] < 0 && Check[cur_p] == pre_p)
{
int head = -Base[cur_p];
if(s.charAt(i+1)== END_CHAR && Tail[head]==END_CHAR)//Insert repeated string
{
break;
}
//In the case of common letters, because the last judgment has excluded the terminator, so both of them must be not terminators.
if (Tail[head] == s.charAt(i+1))
{
int avail_base = x_check(new Integer[]{GetCharCode(s.charAt(i+1))});
Base[cur_p] = avail_base;
Check[avail_base+GetCharCode(s.charAt(i+1))] = cur_p;
Base[avail_base+GetCharCode(s.charAt(i+1))] = -(head+1);
pre_p = cur_p;
continue;
}
else
{
//If the two letters are not the same, one of them may be the terminator.
int avail_base;
avail_base = x_check(new Integer[]{GetCharCode(s.charAt(i+1)),GetCharCode(Tail[head])});
Base[cur_p] = avail_base;
Check[avail_base+GetCharCode(Tail[head])] = cur_p;
Check[avail_base+GetCharCode(s.charAt(i+1))] = cur_p;
//Tail is END_FLAG
if(Tail[head] == END_CHAR)
Base[avail_base+GetCharCode(Tail[head])] = 0;
else
Base[avail_base+GetCharCode(Tail[head])] = -(head+1);
if(s.charAt(i+1) == END_CHAR)
Base[avail_base+GetCharCode(s.charAt(i+1))] = 0;
else
Base[avail_base+GetCharCode(s.charAt(i+1))] = -Pos;
Pos = CopyToTailArray(s,i+2);
break;
}
}else
//Conflict 2: The current node is already occupied and the pre base needs to be adjusted.
if(Check[cur_p] != pre_p)
{
ArrayList<Integer> list1 = GetChildList(pre_p);
int toBeAdjust;
ArrayList<Integer> list = null;
if(true)
{
toBeAdjust = pre_p;
list = list1;
}
int origin_base = Base[toBeAdjust];
list.add(GetCharCode(s.charAt(i)));
int avail_base = x_check((Integer[])list.toArray(new Integer[list.size()]));
list.remove(list.size()-1);
Base[toBeAdjust] = avail_base;
for(int j=0; j<list.size(); ++j)
{
//BUG
int tmp1 = origin_base + list.get(j);
int tmp2 = avail_base + list.get(j);
Base[tmp2] = Base[tmp1];
Check[tmp2] = Check[tmp1];
//There is a follow-up
if(Base[tmp1] > 0)
{
ArrayList<Integer> subsequence = GetChildList(tmp1);
for(int k=0; k<subsequence.size(); ++k)
{
Check[Base[tmp1]+subsequence.get(k)] = tmp2;
}
}
Base[tmp1] = 0;
Check[tmp1] = 0;
}
//Update new cur_p
cur_p = Base[pre_p]+GetCharCode(s.charAt(i));
if(s.charAt(i) == END_CHAR)
Base[cur_p] = 0;
else
Base[cur_p] = -Pos;
Check[cur_p] = pre_p;
Pos = CopyToTailArray(s,i+1);
break;
}
}
}
public boolean Exists(String word)
{
int pre_p = 1;
int cur_p = 0;
for(int i=0;i<word.length();++i)
{
cur_p = Base[pre_p]+GetCharCode(word.charAt(i));
if(Check[cur_p] != pre_p) return false;
if(Base[cur_p] < 0)
{
if(TailMatchString(-Base[cur_p],word.substring(i+1)))
return true;
return false;
}
pre_p = cur_p;
}
if(Check[Base[cur_p]+GetCharCode(END_CHAR)] == cur_p)
return true;
return false;
}
//Internal function, returns the last Base index of the matching word,
classFindStruct
{
int p;
String prefix="";
}
private FindStruct Find(String word)
{
int pre_p = 1;
int cur_p = 0;
FindStruct fs = new FindStruct();
for(int i=0;i<word.length();++i)
{
// BUG
fs.prefix += word.charAt(i);
cur_p = Base[pre_p]+GetCharCode(word.charAt(i));
if(Check[cur_p] != pre_p)
{
fs.p = -1;
return fs;
}
if(Base[cur_p] < 0)
{
if(TailContainString(-Base[cur_p],word.substring(i+1)))
{
fs.p = cur_p;
return fs;
}
fs.p = -1;
return fs;
}
pre_p = cur_p;
}
fs.p = cur_p;
return fs;
}
public ArrayList<String> GetAllChildWord(int index)
{
ArrayList<String> result = new ArrayList<String>();
if(Base[index] == 0)
{
result.add("");
return result;
}
if(Base[index] < 0)
{
String r="";
for(int i=-Base[index];Tail[i]!=END_CHAR;++i)
{
r+= Tail[i];
}
result.add(r);
return result;
}
for(int i=1;i<=CharMap.size();++i)
{
if(Check[Base[index]+i] == index)
{
for(String s:GetAllChildWord(Base[index]+i))
{
result.add(CharList.get(i)+s);
}
//result.addAll(GetAllChildWord(Base[index]+i));
}
}
return result;
}
public ArrayList<String> FindAllWords(String word)
{
ArrayList<String> result = new ArrayList<String>();
String prefix = "";
FindStruct fs = Find(word);
int p = fs.p;
if (p == -1) return result;
if(Base[p]<0)
{
String r="";
for(int i=-Base[p];Tail[i]!=END_CHAR;++i)
{
r+= Tail[i];
}
result.add(fs.prefix+r);
return result;
}
if(Base[p] > 0)
{
ArrayList<String> r = GetAllChildWord(p);
for(int i=0;i<r.size();++i)
{
r.set(i, fs.prefix+r.get(i));
}
return r;
}
return result;
}
}
Test code:
Copy the code code as follows:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Scanner;
import javax.xml.crypto.Data;
public class Main {
public static void main(String[] args) throws Exception {
ArrayList<String> words = new ArrayList<String>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("E:/Rabbit's Experimental Learning Center [In-class]/ACM Competition/ACM 4th School Competition/E Command Prompt/words3.dic")));
String s;
int num = 0;
while((s=reader.readLine()) != null)
{
words.add(s);
num++;
}
DoubleArrayTrie dat = new DoubleArrayTrie();
for(String word: words)
{
dat.Insert(word);
}
System.out.println(dat.Base.length);
System.out.println(dat.Tail.length);
Scanner sc = new Scanner(System.in);
while(sc.hasNext())
{
String word = sc.next();
System.out.println(dat.Exists(word));
System.out.println(dat.FindAllWords(word));
}
}
}
The following are the test results. It takes about 20 seconds to construct a DAT of 6W English words.
When I grow the array, the length increases to 2 times each time, initially 1024
The length of the Base and Check arrays is 131072
The length of Tail is 262144
TTT1