Example code for sac array of Java suffix array

Author：Eve Cole Update Time：2025-09-01 12:00:04

Please Baidu for some basic concepts of suffix arrays. Simply put, the suffix array is a collection of all suffix sizes of a string. Then we can achieve various needs based on some properties of the suffix array.

 public class MySuffixArrayTest { public char[] suffix;//original string public int n;//string length public int[] rank;// Ranking of Suffix[i] in all suffix public int[] sa;// Suffix[SA[1]] < Suffix[SA[2]] … < Suffix[SA[Len]], that is, the suffix with ranking i is Suffix[SA[i]] // (It is an inverse operation with Rank) public int[] height;// Indicates Suffix[SA[i]] and Suffix[SA[i - 1]], that is, the longest public prefix of two adjacent suffixes, public int[] h;// is equal to Height[Rank[i]], that is, the longest public prefix of the suffix[i] and the suffix of its previous suffix public int[] ws;// count sorting auxiliary array public int[] y;// second keyword rank array public int[] x;// rank auxiliary array }

The following explanations take the string "aabaaaab" as an example. Let's first show the results. Please refer to this result for understanding and analysis (I copied someone else's picture of this result. Please subscript 1 by default, because my array starts with subscript 0)

suffix: The original string array assumes that the original string is "aabaaaab", then the corresponding value of this array should be {'a','a','b','a','a','a','b'}
n: string length here n is 8
rank: The ranking array of the suffix array is equivalent to the ranking corresponding to the i-th suffix. For example, rank[0] refers to the ranking of the suffix "aabaaaab" rank[1] refers to the ranking of the suffix "abaaaab"
sa: This is an array that is inverse to the rank array. Does the x-node store the suffix? Or to give an example to illustrate that sa[0] refers to the first-ranked suffix array, that is, 3. That is, the corresponding rank[3] of the array is 0. Please be sure to understand the formula sa[rank[i]]=i. If you understand the relationship between sa and rank, you should also understand it.
height: height[i] is the length of the largest common prefix of the sa[i] suffix array and the sa[i-1] suffix array height[1] refers to the second and first largest common prefixes sa[1] and sa[0] that is, the largest common prefixes of "aaab" and "aaaab" naturally see height[1]=3 at a glance
h: h[i] refers to the i-th suffix and the largest public prefix of the previous one h[0] refers to the first suffix array, namely "aabaaaab" and the largest public prefix of the previous one, namely "aab", that is, height[rank[0]]=height[3]=3 This is a bit difficult to understand. You can not understand for the time being and continue reading.
ws: nothing to say, count sorting auxiliary array
y: The second keyword sort is the sa array with the second keyword sorted equivalent to the second keyword
x: You can understand it as a backup of rank array. It initially uses rank array backup, and then records the rank array after each loop

First, let’s look at the code for sa array. I will explain the function of the code one by one and attach the total code to the following

 rank = new int[n]; sa = new int[n]; ws = new int[255]; y = new int[n]; x = new int[n]; // Loop the original string to convert the int value into the rank array for (int i = 0; i < n; i++) { rank[i] = (int) suffix[i]; }

The function of the above code is to initialize the array and perform the first counting and sorting. The first loop is to assign the initial value to the rank array. After execution, the corresponding value of the rank array is {97, 97, 98, 97, 97, 97, 97, 98}. You should see that the initial value of the rank array is the ascii code corresponding to the letter.

The next three cycles are the first counting sorting. If you don’t understand counting sorting, please Baidu. Let me talk about the process of these three cycles

 for (int i = 0; i < n; i++) { ws[rank[i]]++; x[i] = rank[i]; }for (int i = 1; i < ws.length; i++) { ws[i] += ws[i - 1]; }

What these two loops do is count all occurrence values and backup the rank array to the x array. After the first loop is run, ws[97]=6,ws[98]=2, and after the second loop is run, ws[97]=6,ws[98]=8

 for (int i = n - 1; i >= 0; i--) { sa[--ws[rank[i]]] = i; }

The above paragraph is the specific code for counting and sorting to find the sa array. Everyone must have misunderstood the first time they read it. Why did they find the sa? I was also confused for the first time, but please be patient and understand this code carefully. Do you still remember the formula mentioned above sa[rank[i]]=i For example, for the suffix "b", we ask his sa, that is, sa[rank[7]]=sa[98]=7. Obviously, sa[98] does not exist, but we have recorded the number of times 98 appears in the ws array, so ws[98] should be the corresponding ranking of "b". Please don't forget to subtract 1 to become sa[--ws[rank[i]]] = i. As for why you need to traverse from back to front, you need to understand it carefully here, otherwise you will definitely be completely blinded by the way you sort it according to the second keyword. How do you sort it if there are two rank values that are the same? It must appear first in front of the sa array. If you think about this loop and the changes in the ws array value, you will understand that the order of the for loop actually represents the order of arrangement when the rank value is the same. The traversal from back to front means that the suffix ranking is also lower when the rank value is the same.

The above is just the first counting sorting, which is equivalent to only comparing the first letter of each suffix array to find a sa. The corresponding result is as shown in the figure below.

 // Loop combination sort for (int j = 1, p = 0; j <= n; j = j << 1) { // If you need to fill it, add the sorting array first yp = 0; for (int i = n - j; i < n; i++) { y[p++] = i; } // Range the second keyword according to the first keyword sa for (int i = 0; i < n; i++) { if (sa[i] >= j) { y[p++] = sa[i] - j; } } // Sorting the two keywords for (int i = 0; i < ws.length; i++) { ws[i] = 0; } for (int i : x) { ws[i]++; } for (int i : x) { ws[i]++; } for (int i = 1; i < ws.length; i++) { ws[i] += ws[i - 1]; } for (int i = n - 1; i >= 0; i--) { sa[--ws[x[y[i]]] = y[i]; y[i] = 0; } // Calculate the rank array from sa int xb[] = new int[n];// x array backup for (int i = 0; i < n; i++) { xb[i] = x[i]; } int number = 1; x[sa[0]] = 1; for (int i = 1; i < n; i++) { if (xb[sa[i]] != xb[sa[i - 1]]) { x[sa[i]] = ++number; } else if (sa[i] + j >= n && sa[i - 1] + j >= n) { x[sa[i]] = number; } else if (sa[i] + j < n && sa[i - 1] + j >= n) { x[sa[i]] = ++number; } else if (xb[sa[i] + j] != xb[sa[i - 1] + j]) { x[sa[i]] = ++number; } else { x[sa[i]] = number; } if (number >= n) break; } }

This is the most difficult piece of code to understand when finding the sa array. First of all, you need to understand the idea of the multiplication algorithm. After the first counting order, do we already know the sorting of the first initial letter of all suffix arrays? Since we know the sorting of the first initial letter is equivalent to the order of his second letter (note the difference between sorting and order. The sorting is that we know which one he is fixed in. The order is that we only know the order in which he appears, but we don’t know which one he is specifically ranked). This is of course, because they are originally from a string, and for each suffix, it can also be used as the suffix for his previous suffix. Speaking of it, for example, for "baaaab" the order of its first letter corresponds to the second keyword order of "abaaaab". With the order of the first keyword and the sort of the second keyword, we can find the combined sort of the two keywords. According to the result of the combination sort, we can still use the previous idea. After the first combination sort of "baaaab", we sort out the order of the first two letters "ba", so he can also use the order of the second keyword of "aabaaaab". The logic of the entire sort is referenced below

Then we will analyze the code in segments

 for (int i = n - j; i < n; i++) { y[p++] = i; } //Select the second keyword according to the first keyword sa for (int i = 0; i < n; i++) { if (sa[i] >= j) { y[p++] = sa[i] - j; } }

The above code is to find the sa, that is, the y array of the second keyword, with the initial value of p being 0, and the first loop is to rank the suffix that needs to be filled in the front of the array.

You need to understand the logic of the second loop in combination with the previous logic diagram. We traverse the sorting result of the first keyword sa. If(sa[i] >=j) determines whether the suffix can be used as the second keyword for other suffixes. Taking the first loop j=1 as an example, when sa[i]=0 represents the suffix array "aabaaaab", obviously it cannot be used as the second keyword for other suffixes. For the second keyword that can be used as other suffixes, the order of his sa is the corresponding second keyword. sa[i] - j finds the suffix of his as the second keyword and puts it in the y array, and p++. You need to understand here slowly.

 // Merge the sort of two keywords for (int i = 0; i < ws.length; i++) { ws[i] = 0; } for (int i : x) { ws[i]++; } for (int i = 1; i < ws.length; i++) { ws[i] += ws[i - 1]; } for (int i = n - 1; i >= 0; i--) { sa[--ws[x[y[i]]]] = y[i]; y[i] = 0; }

The above is to find the combination sorting based on the first keyword sorting sa and the second keyword sorting y. This code is quite obscure. We can first not understand the code, but understand an idea. For the sorting of two keywords, the actual rules are similar to the sorting of two numbers. For example, 11 and 12 compare the size, 10 bits are the first keyword, and single bits are the second keyword. After comparing 10 bits, we find 11=12, and then compare the single bits, we know that 11<12. If the 10 bits are the same, the order of the single bits is the size order. I said the first time I count sorting above that the order of count sorting for loop actually represents the order of arrangement when the rank values are the same. So how do we find the order after the two keywords are merged in one count sorting? Let me tell you my understanding. One count sort actually contains two sorts, one is the sort of numerical values, and the other is the sort of occurrence order. The rules are equivalent to the previous example of comparison of 11 and 12. The sort of numerical values is 10 bits, and the sort of occurrence order is one bits. At this point, we have an idea. The sorting of values is sorted by the first keyword, and the sorting of occurrences is sorted by the second keyword, so that we can count and sort at one time to find the sorting after the two keywords are combined. The above code is the implementation of this idea. The x array is the rank array of the first keyword, and we count it.

 for (int i = n - 1; i >= 0; i--) { sa[--ws[x[y[i]]]] = y[i]; y[i] = 0; }

This loop is the implementation of all the above ideas. We traverse the second keyword array y from the back. For y[i], we calculate the count ranking of its first keyword. This count ranking is the ranking of y[i], and the final count is reduced by 1. The merged keyword sort was successfully found.

I believe that if you understand all the above codes, you will definitely be amazed. I was also excited when I thought about this code over and over again, and I was simply convinced. This is the charm of algorithms.

With the sa array, we can find the rank array. This is not difficult, so we won't explain it. All codes for finding sa are attached below.

 public static void main(String[] args) { String str = "aabaaaab"; MySuffixArrayTest arrayTest = new MySuffixArrayTest(str.toString()); arrayTest.initSa();// Find sa array} public void initSa() { rank = new int[n]; sa = new int[n]; ws = new int[255]; y = new int[n]; x = new int[n]; // Loop the original string to convert the int value into the rank array for (int i = 0; i < n; i++) { rank[i] = (int) suffix[i]; } // First count sort for (int i = 0; i < n; i++) { ws[rank[i]]++; x[i] = rank[i]; } for (int i = 1; i < ws.length; i++) { ws[i] += ws[i - 1]; } for (int i = n - 1; i >= 0; i--) { sa[--ws[rank[i]]] = i; } // Loop combination sort for (int j = 1, p = 0; j <= n; j = j << 1) { // If you need to fill, add the sorted array first yp = 0; for (int i = n - j; i < n; i++) { y[p++] = i; } // Range the second keyword according to the first keyword sa for (int i = 0; i < n; i++) { if (sa[i] >= j) { y[p++] = sa[i] - j; } } // Sorting the two keywords for (int i = 0; i < ws.length; i++) { ws[i] = 0; } for (int i : x) { ws[i]++; } for (int i = 1; i < ws.length; i++) { ws[i] += ws[i - 1]; } for (int i = n - 1; i >= 0; i--) { sa[--ws[x[y[i]]] = y[i]; y[i] = 0; } // Calculate the rank array based on sa int xb[] = new int[n];// x array backup for (int i = 0; i < n; i++) { xb[i] = x[i]; } int number = 1; x[sa[0]] = 1; for (int i = 1; i < n; i++) { if (xb[sa[i]] != xb[sa[i - 1]]) { x[sa[i]] = ++number; } else if (sa[i] + j >= n && sa[i - 1] + j >= n) { x[sa[i]] = number; } else if (sa[i] + j < n && sa[i - 1] + j >= n) { x[sa[i]] = ++number; } else if (xb[sa[i] + j] != xb[sa[i - 1] + j]) { x[sa[i]] = ++number; } else { x[sa[i]] = number; } if (number >= n) break; } } } }

Summarize

The above is the example code for sac arrays of Java suffix arrays introduced to you. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. Thank you very much for your support to Wulin.com website!