__Applications__

In this section, we present algorithms for several string problems using suﬃx trees and suﬃx arrays. While the same run-time bounds can be achieved for many interesting applications with either a suﬃx tree or a suﬃx array, there are others which involve a space vs. time trade oﬀ. Even in cases where the same run-time bound can be achieved, it is often easier to design the algorithm ﬁrst for a suﬃx tree, and then think if the implementation can be done using a suﬃx array. For this reason, we largely concentrate on suﬃx trees. The reader interested in reading more on applications of suﬃx arrays is referred to [1, 2].

Given a pattern *P *and a text *T *, the pattern matching problem is to ﬁnd all occurrences of *P *in *T *. Let *|**P **| *= *m *and *|**T **| *= *n*. Typically, *n >> m*. Moreover, *T *remains ﬁxed in many applications and the query is repeated for many diﬀerent patterns. For example, *T *could be a text document and *P *could represent a word search. Or, *T *could be an entire database of DNA sequences and *P *denote a substring of a query sequence for homology (similarity) search. Thus, it is beneﬁcial to preprocess the text *T *so that queries can be answered as eﬃciently as possible.

**P****att****ern Matching using Suﬃx Trees**

The pattern matching problem can be solved in optimal *O*(*m *+ *k*) time using *S**T *(*T *), where *k *is the number of occurrences of *P *in *T *. Suppose *P *occurs in *T *starting from position *i*. Then, *P *is a preﬁx of *suf **f**i *in *T *. It follows that *P *matches the path from root to leaf labeled *i *in *S**T *. This property results in the following simple algorithm: Start from the root of *S**T *and follow the path matching characters in *P *, until *P *is completely matched or a mismatch occurs. If *P *is not fully matched, it does not occur in *T *. Otherwise, each leaf in the subtree below the matching position gives an occurrence of *P *. The positions can be enumerated by traversing the subtree in time proportional to the size of the subtree. As the number of leaves in the subtree is *k*, this takes *O*(*k*) time. If only one occurrence is of interest, the suﬃx tree can be preprocessed in *O*(*n*) time such that each internal node contains the label of one of the leaves in its subtree. Thus, the problem of whether *P *occurs in *T *or the problem of ﬁnding one occurrence can be answered in *O*(*m*) time.

**P****att****ern Matching using Suﬃx Arrays**

Consider the problem of pattern matching when the suﬃx array of the text, *S**A*(*T *), is available. As before, we need to ﬁnd all the suﬃxes that have *P *as a preﬁx. As *S**A *is a lexicographically sorted order of the suﬃxes of *T *, all such suﬃxes will appear in consecutive positions in it. The sorted order in *S**A *allows easy identiﬁcation of these suﬃxes using binary search. Using a binary search, ﬁnd the smallest index *i *in SA such that *suf f**S**A*[*i*] contains *P *as a preﬁx, or determine that no such suﬃx is present. If no suﬃx is found, *P *does not occur in *T *. Otherwise, ﬁnd the largest index *j*(*≥ **i*) such that *suf f**S**A*[*j*] contains *P *as a preﬁx. All the elements in the range *S**A*[*i..j*] give the starting positions of the occurrences of *P *in *T *.

A binary search in *S**A *takes *O*(log *n*) comparisons. In each comparison, *P *is compared with a suﬃx to determine their lexicographic order. This requires comparing at most *|**P **| *= *m *characters. Thus, the run-time of this algorithm is *O*(*m *log *n*). Note that while this run-time is inferior to the run-time using suﬃx trees, the space required by this algorithm is only *n *words for *S**A *apart from the space required to store the string. Note that the *L**cp *array is not required. Assuming 4 bytes per suﬃx array entry and one byte per character in the string, the total space required is only 5*n *bytes.

The run-time can be improved to *O*(*m *+ log *n*), by using slightly more space and keeping track of appropriate *lcp *information. Consider an iteration of the binary search. Let *S**A*[*L..R*] denote the range in the suﬃx array where the binary search is focused. To be- gin with, *L *= 1 and *R *= *n*. At the beginning of an iteration, the pattern *P *is known to be lexicographically greater than or equal to *suf f**S**A*[*L*] and lexicographically smaller than or equal to *suf f**S**A*[*R*]. Let *M *= *I **L*+*R **l*. During the iteration, a lexicographic com-

parison between *P *and *suf **f**S**A*[*M *] is made. Depending on the result, the search range is narrowed to either *S**A*[*L..M *] or *S**A*[*M**..**R*]. Assume that *l *= *|**l**cp*(*P**, suf f**S**A*[*L*])*| *and *r *= *|**l**cp*(*P**, suf f**S**A*[*R*])*| *are known at the beginning of the iteration. Also, assume that *|**l**cp*(*suf f**S**A*[*L*]*, suf f**S**A*[*M *])*| *and *|**l**cp*(*suf f**S**A*[*M *]*, suf f**S**A*[*R*])*| *are known. From these val- ues, we wish to determine *|**l**cp*(*P**, suf f**S**A*[*M *])*| *for use in next iteration, and consequently determine the relative lexicographic order between *P *and *suf f**S**A*[*M *]. As *S**A *is a lexico- graphically sorted array, *P *and *suf f**S**A*[*M *] must agree on at least *min*(*l**, r*) characters. If *l** *and *r *are equal, then comparison between *P *and *suf f**S**A*[*M *] is done by starting from the (*l *+ 1)*t**h *character. If *l *and *r *are unequal, consider the case when *l** **> r*.

Similarly, the case when *r** **> l *can be handled such that comparisons between *P *and *suf f**S**A*[*M *], if at all needed, start from (*r *+ 1)*t**h *character. To start the execution of the algorithm, *l**cp*(*P**, suf f**S**A*[1]) and *l**cp*(*P**, suf f**S**A*[*n*]) are computed directly using at most 2*|**P **|** *character comparisons. This ensures *|**l**cp*(*P**, suf f**S**A*[*L*])*| *and *|**l**cp*(*P**, suf f**S**A*[*R*])*| *are known at the beginning of the ﬁrst iteration. This property is maintained for each iteration as *L *or *R *is shifted to *M *but *|**l**cp*(*P**, suf **f**S**A*[*M *])*| *is computed. For now, assume that the required *|**l**cp*(*suf f**S**A*[*L*]*, suf f**S**A*[*M *])*| *and *|**l**cp*(*suf f**S**A*[*R*]*, suf f**S**A*[*M *])*| *values are available.

**LEMMA 29.4 **The total number of character comparisons made by the algorithm is *O*(*m *+ log *n*).

**Pr****o****o****f **The algorithm makes at most 2*m *comparisons in determining the longest common preﬁxes between *P *and *suf f**S**A*[1] and between *P *and *suf f**S**A*[*n*]. Classify the comparisons made in each iteration to determine the longest common preﬁx between *P *and *suf f**S**A*[*M *] into *successful *and *failed *comparisons. A comparison is considered successful if it contributes the longest common preﬁx. There is at most one failed comparison per iteration, for a total of at most log *n *such comparisons over all iterations. As for successful comparisons, note that the comparisons start with (*max*(*l**, r*) + 1)*t**h *character of *P *, and each successful comparison increases the value of *max*(*l**, r*) for next iteration. Thus, each character of *P *is involved only once in a successful comparison. The total number of character comparisons is at most 3*m *+ log *n *= *O*(*m *+ log *n*).

It remains to be described how the *|**l**cp*(*suf f**S**A*[*L*]*, suf f**S**A*[*M *])*| *and *|**l**cp*(*suf f**S**A*[*R*]*, suf f**S**A*[*M *])*|*

values required in each iteration are computed. Suppose the *Lcp *array of *T *is known. For

The *l**cp *of two suﬃxes can be computed in time proportional to the distance between them in the suﬃx array. In order to ﬁnd the *l**cp *values required by the algorithm in constant time, consider the binary tree corresponding to all possible search intervals used by any execution of the binary search algorithm. The root of the tree denotes the interval [1*..**n*].

If [*i..j*] (*j **− **i **≥ *2) is the interval at an internal node of the tree, its left child is given by [*i..**I **i*+*j **l*] and its right child is given by [*I **i*+*j **l**..**j*]. The execution of the binary search tree algorithm can be visualized as traversing a path in the binary tree from root to a leaf. If *l**cp *value for each interval in the tree is precomputed and recorded, any required *l**cp *value during the execution of the algorithm can be retrieved in constant time. The leaf level in the binary tree consists of intervals of the type [*i..i *+ 1]. The *lcp *values for these *n **− *1

intervals is already given by the *Lcp *array. The *lcp *value corresponding to an interval at an internal node is given by the smaller of the *lcp *values at the children. Using a bottom-up traversal, the *lcp *values can be computed in *O*(*n*) time. In addition to the *Lcp *array, *n **− *2 additional *lcp *values are required to be stored. Assuming approximately 1 byte per *lcp *value, the algorithm requires approximately 2*n *bytes of additional space. As usual, *lcp *values larger than or equal to 255, if any, are stored in an exceptions list and the size of such list should be very small in practical applications.

Thus, pattern matching can be solved in *O*(*m *log *n*) time using 5*n *bytes of space, or in *O*(*m *+ log *n*) time using 7*n *bytes of space. Abouelhoda *et al. *[2] reduce this time further to *O*(*m*) time by mimicking the suﬃx tree algorithm on a suﬃx array with some auxiliary information. Using clever implementation techniques, the space is reduced to approximately 6*n *bytes. An interesting feature of their algorithm is that it can be used in other applications based on a top-down traversal of suﬃx tree.

**L****ongest Common Substrings**

Consider the problem of ﬁnding a longest substring common to two given strings *s*1 of size *m *and *s*2 of size *n*. To solve this problem, ﬁrst construct the *G**S**T *of strings *s*1 and *s*2. A longest substring common to *s*1 and *s*2 will be the path-label of an internal node with the greatest string depth in the suﬃx tree which has leaves labeled with suﬃxes from both the strings. Using a traversal of the *G**S**T *, record the string-depth of each node, and mark each node if it has suﬃxes from both the strings. Find the largest string-depth of any marked node. Each marked internal node at that depth gives a longest common substring. The total run-time of this algorithm is *O*(*m *+ *n*).

The problem can also be solved by using the suﬃx tree of one of the strings and suﬃx links. Without loss of generality, suppose the suﬃx tree of *s*2 is given. For each position *i** *in *s*1, we ﬁnd the largest substring of *s*1 starting at that position that is also a substring of *s*2. For position 1, this is directly computed by matching *suf **f*1 of *s*1 starting from the root of the suﬃx tree until no more matches are possible. To determine the longest substring match from position 2, simply walk up to the ﬁrst internal node, follow the suﬃx link, and walk down as done in McCreight’s algorithm. A similar proof shows that this algorithm runs in *O*(*m *+ *n*) time.

Now consider solving the longest common substring problem using the *G**S**A *and *Lcp *array for strings *s*1 and *s*2. First, consider a one string variant of this problem *− *that of computing the longest repeat in a string. This is given by the string depth of the deepest internal node in the corresponding suﬃx tree. All children of such a node must be leaves.

Any consecutive pair of such leaves have the longest repeat as their longest common preﬁx.

Thus, each largest value in the *Lcp *array reveals a longest repeat in the string. The number of occurrences of a repeat is one more than the number of consecutive occurrences of the corresponding largest value in the *Lcp *array. Thus, all distinct longest repeats, and the number and positions of their occurrences can be determined by a linear scan of the *Lcp *array.

To solve the longest common substring problem, let *v *denote an internal node with the greatest string depth that contains a suﬃx from each of the strings. Because such a pair of suﬃxes need not be consecutive in the suﬃx array, it might appear that one has to look at nonconsecutive entries in the *Lcp *array. However, the subtree of any internal node that is a child of *v *can only consist of suﬃxes from one of the strings. Thus, there will be two consecutive suﬃxes in the subtree under *v*, one from each string. Therefore, it is enough to look at consecutive entries in the *G**S**A*. In a linear scan of the *G**S**A *and *Lcp *arrays, ﬁnd the largest *lcp *value that corresponds to two consecutive suﬃxes, one from each string. This gives the length of a longest common substring. The starting positions of the suﬃxes reveals the positions in the strings where the longest common substring occurs. The algorithm runs in *O*(*m *+ *n*) time.

Compression of text data is useful for data transmission and for compact storage. A simple, not necessarily optimal, data compression method is the Ziv-Lempel compression [24, 25]. In this method, the text to be compressed is considered a large string, and a compact representation is obtained by identifying repeats in the string. A simple algorithm following this strategy is as follows: Let *T *denote the text to be compressed and let *|**T **| *= *n*. At some stage during the execution of the compression algorithm, suppose that the string *T *[1*..**i **− *1] is already compressed. The compression is extended by ﬁnding the length *l**i *of a largest preﬁx of *suf **f**i *that is a substring of *T *[1*..**i **− *1]. Two cases arise:

The algorithm is initiated by setting *T *[1] to be the compressed representation of *T *[1*.**.*1], and continuing the iterations until the entire string is compressed. For example, executing the above algorithm on the string *mississippi *yields the compressed string *mis*(3*, *1)(2*, *3)(2*, *1)*p** *(9*, *1)(2*, *1). The decompression method for such a compressed string is immediate.

Suﬃx trees can be used to carry out the compression in *O*(*n*) time [20]. They can be used in obtaining *l**i*, the length of the longest preﬁx of *suf **f**i *that is a substring of the portion of the string already seen, *T *[1*..**i **− *1]. If *j *is the starting position of such a substring, then *T *[*j..**j *+ *l**i **− *1] = *T *[*i..i *+ *l**i **− *1] and *i **≥ **j *+ *l**i*. It follows that *|**l**cp*(*suf **f**j **, suf f**i*)*|**≥ **l**i*. Let *v *= *l**ca*(*i, j*), where *i *and *j *are leaves corresponding to *suf **f**i *and *suf **f**j *, respectively. It follows that *T *[*i..i *+ *l**i **− *1] is a preﬁx of *path*–*l**abel*(*v*). Consider the unique path from the root of *S**T *(*T *) that matches *T *[*i..i *+ *l**i **− *1]. Node *v *is an internal node in the subtree below, and hence *j *is a leaf in the subtree below. Thus, *l**i *is the largest number of characters along the path *T *[*i..n*] such that *∃ *leaf *j *in the subtree below with *j *+ *l**i **≤ **i*. Note that any *j *in the subtree below that satisﬁes the property *j *+ *l**i **≤ **i *is acceptable. If such a *j *exists, the smallest leaf number in the subtree below certainly satisﬁes this property, and hence can be chosen as the starting position *j*.

This strategy results in the following algorithm for ﬁnding *l**i*: First, build the suﬃx tree of *T *. Using an appropriate linear time tree traversal method, record the string depth of each node and mark each internal node with the smallest leaf label in its subtree. Let *min*(*v*) denote the smallest leaf label under internal node *v*. To ﬁnd *l**i*, walk along the path *T *[*i..n*] to identify two consecutive internal nodes *u *and *v *such that *min*(*u*) + *string*–*d**e**pth*(*u*) *< I *and *min*(*v*) + *string*–*d**e**pth*(*v*) *≥ **i*. If *min*(*v*) + *string*–*d**e**pth*(*u*) *> **i*, then set *l**i *= *string **d**e**pth*(*u*) and set the starting position to be *min*(*u*). Otherwise, set *l**i *= *i **− **min*(*v*) and set the starting position to be *min*(*v*).

To obtain *O*(*n*) run-time, it is enough to ﬁnd *l**i *in *O*(*l**i *) time as the next *l**i *characters of the string are compressed into an *O*(1) space representation of an already seen substring.

Therefore, it is enough to traverse the path matching *T *[*i..n*] using individual character comparisons. However, as the path is guaranteed to exist, it can be traversed in *O*(1) time per edge, irrespective of the length of the edge label.

Given a set of strings *S *= *{**s*1*, s*2*,…**, s**k **} *of total length *N *, the string containment problem is to identify each string that is a substring of some other string. An example application could be that the strings represent DNA sequence fragments, and we wish to remove redun- dancy. This problem can be easily solved using suﬃx trees in *O*(*N *) time. First, construct the *G**S**T *(*S*) in *O*(*N *) time. To ﬁnd if a string *s**i *is contained in another, locate the leaf labeled (*s**i**, *1). If the label of the edge connecting the leaf to its parent is labeled with the string $, *s**i *is contained in another string. Otherwise, it is not. This can be determined in *O*(1) time per string.

Suppose we are given a set of strings *S *= *{**s*1*, s*2*,..**. , s**k **} *of total length *N *. The suﬃx- preﬁx overlap problem is to identify, for each pair of strings (*s**i**, s**j *), the longest suﬃx of *s**i** *that is a preﬁx of *s**j *. Suﬃx-preﬁx overlaps are useful in algorithms for ﬁnding the shortest common superstring of a given set of strings. They are also useful in applications such as genome assembly where signiﬁcant suﬃx-preﬁx overlaps between pairs of fragments are used to assemble fragments into much larger sequences.

The suﬃx-preﬁx overlap problem can be solved using *G**S**T *(*S*) in optimal *O*(*N *+*k*2) time.

Consider the longest suﬃx *α *of *s**i *that is a preﬁx of *s**j *. In *G**S**T *(*S*), *α *is an initial part of the path from the root to leaf labeled (*j**, *1) that culminates in an internal node. A leaf that corresponds to a suﬃx from *s**i *should be a child of the internal node, with the edge label $. Moreover, it must be the deepest internal node on the path from root to leaf (*j**, *1)

that has a suﬃx from *s**i *attached in this way. The length of the corresponding suﬃx-preﬁx overlap is given by the string depth of the internal node.

Let *M *be a *k **× **k *output matrix such that *M *[*i, j*] should contain the length of the longest suﬃx of *s**i *that overlaps a preﬁx of *s**j *. The matrix is computed using a depth ﬁrst search (DFS) traversal of *G**S**T *(*S*). The *G**S**T *is preprocessed to record the string depth of every node. During the DFS traversal, *k *stacks *A*1*, A*2*,..**. , A**k *are maintained, one for each string. The top of the stack *A**i *contains the string depth of the deepest node along the current DFS path that is connected with edge label $ to a leaf corresponding to a suﬃx from *s**i*. If no such node exists, the top of the stack contains zero. Each stack *A**i *is initialized by pushing zero onto an empty stack, and is maintained during the DFS as follows: When the DFS traversal visits a node *v *from its parent, check to see if *v *is attached to a leaf with edge label $. If so, for each *i *such that string *s**i *contributes a suﬃx labeling the leaf, *push string*–*d**e**pth*(*v*) on to stack *A**i*. The string depth of the current node can be easily maintained during the DFS traversal. When the DFS traversal leaves the node *v *to return back to its parent, again identify each *i *that has the above property and *pop *the top element from the corresponding stack *A**i*.

The output matrix *M *is built one column at a time. When the DFS traversal reaches a leaf labeled (*j**, *1), the top of stack *A**i *contains the longest suﬃx of *s**i *that matches a preﬁx of *s**j *. Thus, column *j *of matrix *M *is obtained by setting *M *[*i, j*] to the top element of stack *S**i*. To analyze the run-time of the algorithm, note that each *push *(similarly, *pop*) operation on a stack corresponds to a distinct suﬃx of one of the input strings. Thus, the total number of *push *and *pop *operations is bounded by *O*(*N *). The matrix *M *is ﬁlled in *O*(1) time per element, taking *O*(*k*2) time. Hence, all suﬃx-preﬁx overlaps can be identiﬁed in optimal *O*(*N *+ *k*2) time.