SIAM J. APPL, MATH. 1982 Society for Industrial and Applied Mathematics

Vol. 42, No. 4, August 1982 00361399/82142040004 $01.0010

LONG COMMON SUBSEQUENCES AND THE PROXIMITY

OF TWO RANDOM STRINGS*

J. MICHAEL STEELEt

Abstract. Let (X 11 X2, X.) and (xl, x2', . . . , x'n) be two strings from an alphabet si, and let L~ denote their longest common subsequence. The probabilistic behavior of L~ is studied under various probability models for the x's and x"s.

1. Introduction. Long molecules such as proteins and nucleic acids can be thought of schematically as sequences from a finite alphabet d. From an evolutionary point of view it is natural to compare molecules by considering their common ancestors, and in schematic terms this reduces to the problem of considering the longest common subsequence of two given sequences.

Sankoff (1972) gave an efficient algorithm for calculating the length of the longest common subsequence. Subsequently, Sankoff and Cedergren (1973), and Sankoff, Cedergren, and Lapalme (1976) considered a number of empirical cases and conducted some Monte Carlo investigations. The first formal probabilistic analysis of the problem of long common subsequences was initiated in Chvdtal and Sankoff (1975). To describe their work we first introduce some notation.

BY X and X~,, 1 S i < oo, we denote two sequences of independent, and identically distributed random variables with values in &i. The random variable of main interest is

L,, := max {k: Xi, = X~,, Xj2 X1',. Xik = Xk where

l:5 il < i2 < ' < ik:s: n and 1 j,

In words, L. is the largest cardinality of any subsequence common to the sequences (X,, X2,. . . , X.) and (Xt, X2',. . . , Xn).

Under the assumption that Isil = k and that Xi and X'i are both uniform on d, Chvital and Sankoff proved the existence of the limit of the means,

(1.1) lim E L,, ~k.

n00 n

Among other results, ClivAtal and Sankoff obtained upper and lower bounds on Ck. These authors proved no results for Var L,,, but on the basis of a Monte Carlo study they were led to conjecture Var Ln = o(n 213 ).

Deken (1979) was able to sharpen the bounds on Ck, and also noted that as a consequence of Kingman's subadditive ergodic theorem (Kingman (1968)), that one actually has

(1.2) lim Ln = c a.s.

noo n

where c depends on the distributions of the processes {(Xi, Vi): 1 ~= i :S m}.

This result naturally entails Var Ln = o (n 2 ), but no further progress was made on the variance problem.

* Received by the editors November 28, 1980. This work was supported in part by the Office of Naval Research under contract N0001476C0475.

t Department of Statistics, Stanford University, Stanford, California 94305.

731

732 J. MICHAEL STEELE

The present article takes up several aspects of the study of L,,. In the second section, as an elementary application of an inequality of Efron and Stein (1981),.it is proved that Var L,, = 0(n). This makes only modest progress on the ChAtalSankoff conjecture that Var L. = o (n 213 ), but it still serves to supplement (1.2) with a rate of convergence result.

The third section takes up the question of the behavior of L. under more general assumptions than independence. A simple complement of Kingman's subadditive ergodic theorem (Kingman (1973)) is derived and then applied to L.. The coupling method which is used here (or the RadonNikod~m method which is sketched) may likely be of use in many other problems where subadditivity is available but stationarity is absent.

The fourth section branches out from the explicit analysis of Ln. It addresses the question of whether there exist statistics which are more tractable than Ln, but which still reasonably measure the genetic proximity of long molecules. The principal new candidate is T., the total number of common subsequences, Here one can compute ET,, exactly, but we note T,, has other drawbacks to its analysis.

The final section makes brief comment on some open problems and related literature.

2. A variance bound. Let S (V 1, V2, v,, ~ 1) denote any realvalued function

of n 1 vectors vi c= R d ; and suppose Vi, 1:5 i < oo, is any sequence of independent,

d

identically distributed, random vectors in R . We then define new random variables

Si = S(V1, V2,. Vi1, Vi,1, V.) for 1 S i:5 n, and we further set S.

is n (S_ S.)2

(11n) Y_in1 Si. Tukey's jackknife estimate for the variance of S and

Efron and Stein (198 1) have proved the very useful inequality,

n

(2.1) Var (S.) 9 E E (Si _ S.)2.

i1

The main point of this section is to show that (2. 1) leads to the bound

(2.2) Var Ln = 0(n),

under the general assumption that Vi = (Xi, X,~) are independent and identically distributed. In fact, one can prove the following result.

THEOREM 1. For each n, suppose there is defined a function S(xi, X2, x.) from (R d)n to R. Suppose also that Vi, l:5 1 < oo, is any sequence of independent random vectors in R d , and for 1 5_ i !::: n, l:5 n < 00 set

(2.3) Si,n = S(V1, V2, ' ' ' , Vi1, Vill, Vn).

IfE(Si,n _ S1,,')2 is bounded for all 1 ' i < i:5 n and 1 n < 00, then

(2.4) Var S(V1, V2,. . , V.)= 0 (n).

Proof. Let the bound on E(Si, _ S1,,')2 be B. Fix n, define S. = (11n) Y_"j=1 Sin, and let

Dn SM, V2, V.A S.

1 n

Y_ (S(V1, V2, 5 V.1) S(V1, V2, Vi1, Vi11, VJ).

n ii

By Schwarz' inequality,

(2.6) Var S (V,, V2, . . . , VJ Var (S.) + Var D,, + 2 (Var S.)112 (Var DJ"',

LONG COMMON SUBSEQUENCES 733

and

2

(2.7) Var Dn _:5 ED n =B.

Since E(Si, _ Si.j2 =::i B one also has E((Si, _ S.)2) :_5 B. So inequalities (2.1), (2.6), and (2.7) entail

(2.8) VarS(V1, V2,..., V._l):_5nB+B+2(nB)112 B 112 =B(n'I . 2+1)2.

This completes the proof of the theorem with a very specific form of the 0(n) term. 11

Returning to Ln we note that, for Vi = (Xi, Xg~) and si = {l, 2,.. .}, Theorem 1 is applicable to S(V1, V2,. V.) =Ln(V1, V2,.. ., Vn)=_Ln. Since

(2.9) O:SL(V1, V2,. .., Vn)L(V1, V2, Vi1, Vi+l, M:5 1,

it is trivial that (2.8) can be taken with B = 1. In summary we have the following bound.

COROLLARY 1. If (Xi, X',) are Li.d. with values in si Xd then

(2.10) VarLn1 (n 1/2 + 1Y.

By the usual BorelCantelli and subsequence arguments together with (2.9) and (2. 10) one can prove a rate result.

COROLLARY 2. We have for all e > 0 that

(2.11) Ln ELn = o(n314+e ) with probability one.

Since the techniques for proving (2.11) are well known and since the result is not the best possible, there is no reason to include the proof. This is nevertheless the first rate result available on Ln, since such rates cannot be obtained in general from the subadditive ergodic theorem (d. Hammersley (1978, p. 670)).

3. Nonstationary sequences. By Deken's observation we know Kingman's theorem implies that LnIn converges almost surely under the assumption that the Vi, 1 :_5 i < oo form a stationary sequence. The point of this section is to give a very simple illustration of how Kingman's theorem can also be used for nonstationary processes. Naturally, one must appeal to some underlying asymptotic stationarity, but the resulting class of results seem useful enough to merit recording. In particular, one should compare the present result to the "substationary" subadditive ergodic theorem of Abid (1979). That result apparently does not suffice for the application to Ln given here, it is considerably more complicated.

By a subadditive sequence of functions on E we denote a sequence hn :E n > R which satisfies

(3.1) h.+n(el,e2,***,en+.)5h.(el,e2,''',em)+hn(em+l,en+2,***,en+m).

As an example, we note that if E=dxd and ei=(al,a~g), then letting hn(el,e2,*,*,en) denote the length of the longest common subsequence of (a l, a2, * * * , an) and (al, a2', . . . , an) one has (3.1). Because of the applications we have in view, we will also focus on monotone subadditive functions, i.e., those functions which satisfy (3. 1) as well as

(3.2) hn.(Xn+l,X.12,'..,x.)5hn(Xl,X2,***,Xn) forallm:5=nand{xl,X2,'''9Xn}.

We will say that a stochastic process {Xj'i=i on the discrete state space E has a stationary ergodic coupling if there is a stationary ergodic process {X', } 'i= 1 on the same

734 J. MICHAEL STEELE

probability space such that Zi = (Xi, Xel) is a coupling, i.e., such that the stopping time ,r = min {1: Xi = Xt~} is finite with probability one, and Xi = X', for all i ,r.

It is well known that couplings are a convenient and powerful way of expressing the asymptotic properties of stochastic processes (see, e.g., Griffeath (1978)). The next result illustrates this ease of application.

THEOREm 2. Suppose that & is a positive and monotone sequence of subadditive

functions on.E. If {Xi}~g0, is a stochastic process with state space E for which there is

stationary ergodic coupling, then

(3.3) lim h. (X,, X2,. XJ

hco n

for some constant c.

Proof. Let {X,~}T,1 denote the stationary ergodic process to which {Xj}T,, may be coupled, and let r be the coupling time, i.e., r = min {i: Xi = X,~}. The doubly indexed process Y,, = hts(Xst+l, Xst+2, X,') is easily checked to have the properties:

(3.4a) Y

.,.:5 Yt + Y,. whenever s < t < u.

(3.4b) The joint distributions of the shifted process {Y,1,,1} are the same as those of the unshifted process,

(3.4c) The expectations g, = ETo, satisfy g, ~: At for some A and for all t.

The properties (3.4ac) are exactly the hypotheses of Kingman's theorem (Kingman (1973% so by its conclusion we have

(3.5) lim2~o'=limh,,(X'1,X'2,...,X,,)n=c a.s.

noo n nco

Here, to conclude that the limit is indeed a constant we have made use of the fact that Kingman's theorem assures that the limit.is shift invariant and we have assumed that {X} T,, is ergodic.

Now we have by (3.1), (3.2), and the definition of r that

h.(Xl, X2, X.) t_5 h,(Xl, X2,. . , XT) + h.(XT'+lyXT+2, X.)

(3.6) =hT(XlX2,. .,X,)+h.(X'1,X2, ',Xn).

Since 7. < oo with probability one, (3.4) and (3.6) yield

,7 h. (Xl, X2,. X.)

(3.7) im = C.

n~CO n

To handle the limit infirnum we need only consider the analogous inequality with the variables reversed, i.e.,

h,, (Xl, X2'y X' .)5 h.,(X'1, X2',. X'T) + h (X,, X2,.X),

and we obtain

C = lim h. (X, . X.)

n

to complete the proof. 11

COROLLARY 1. If Vi, l:5 i < oo, is an irreducible, aperiodic, positive recurrent Markov chain with state space d xd then no matter what the initial distribution

LONG COMMON SUBSEQUENCES 735

ir(v) = P(V1 = v), one has with probability one

lim L. (V,, V2,. V.)~n =c

nco

for some constant c.

Proof. To prove the corollary one only has to 'exhibit an appropriate coupling; and, in this case, the existence of sucg a coupling is well known (see, e.g., Hoel, Port, and Stone (1972)). "' the above corollary without recourse to coupling; one can

One can also prove use an absolute continuity argument. Under the hypotheses of the corollary there is a stationary measure ir'. Moreover, the initial measure 7r is absolutely continuous with respect to Ir' (since by the irreducibility and positive recurrence ~r'(a 1, a2) > 0, for all (a,, a2) E d X d). If {Vg~ : 1 5_ i < oo} is the process with initial Istribution ir' and the same transition function as {Vi: 1 t_5 i < oo}, it is further true that the measure 90 for the infinite process {Vi: 1 5_ i < 00} is absolutely continuous,. with respect to that g~' for {Vi' : 1:5 i < 00}. Since L(V'j, V2',. V',) satisfies the hypotheses of Kingrriin'i subadditive ergodic theorem, {w:lim,,.L(V'I,V2,..,V',)/n=C} is a set of 9' measure one. By absolute tontinuity of 0<<.9' the set {60: liMn. L(V1, V2,.. ., Vj/n = c} has 9 measurt one. This is precisely the conclusion of the corollary. E]

4. Alternative statistics. The random variable L(Vi, V2, V.) certainly is an

interesting measure of genetic proximity, but it appears to be hard to handle. In such

a situation it is natural to look for suitable alternatives.

To introduce one such alternative, let (Xl, X2, Xn) and (X',, X2,. X0

denote two sequences of values from .4. By A, B we denote subsets of {1, 2,. n},

say A = Y1, i2, a:n'~d B = {jj, j2,. jk} if JAI JB1 = k. Next we set

(4.1) p(A, B) 1 'fX'1=X~,' Xt.=Xl21 1X11 = X&P

0 otherwisel

The statistic of interest in this section is

(4.2) T. = E p (A, B),

AB

where the sum is over two pairs of subsets of {1, 2, n} and it is understood that

p (A, B) is taken to be zero if the cardinalities of A and B differ, i.e., JA 10 IB I.

If the Xi, 19 i < co and the Xj', 1 5_ i < 00 are all independent, and P(Xi = a!) = p!, P(X'i = a,) = p,' for all i, f, it is easy to see that

n 2 k

Y ( n) (,

pip

(4.3) ETn k

This explicit formula is quite a contrast to the mystery surrounding EL,, under similar hypotheses. A number of qualitative properties of ET,, are also evident from (4.3). In particular,.if we set p, =p'i for l:5 i:5 Isil a and take d finite, then

n P?)k

(4.4) EpTn = E E 1

0 (n) a

k_ k =1

is easily checked to be a Schurconvex function, i.e., whenever 0 is

majorized by fl'. (For an elaboration of this terminology see Hardy, Littlewood and

Polya (1951), and for an elaboration of the many consequences of Schur convexity

see the treatise by Olkin and Marshall (1979)).

736 J. MICHAEL STEELE

Despite the mathematical simplicity of T,, as evidenced by (4.3) and (4.4), it provides only a partial surrogate for L,,. In the first place T,, tends to be very jarge, and there is no efficient algorithm for finding T,,. Thus, from a computational view point, L. is a superior statistic. Also, as yet, there is no information at all about the variance of T. or of its limit properties.

5. Open problems. The main open problems concern the expectations

(5.1) 0. (p) = EL.

under the hypotheses of independence and identical distribution as applied in (4.4).

For one explicit conjecture, it seems inevitable that 0,,(p) is Schur convex (just as 0 (p) was proved to be). Perhaps it would be easier to consider the limit,

on(p)

(5.2) 41 (p) = lim

nco n

Again, it must be true that 4i(p) is Schur convex, but so far even this has not been proved.

The older problems concern the numerical values of 0(p). Perhaps progress can be made on this problem by taking a more algorithmic point of view. Is there an efficient algorithm for computing the approximate value of 0(p) or t#,,(p) with a guaranteed error bound?

Given the results of § 2, it is very interesting to see if one can improve (2. 10) to show Var L,, = o (n). This would be the first really nontrivial step toward the CIrvAtalSankoff conjecture, and it would seem to require some genuinely new combinatorial insight to settle the point one way or the other.

Finally, the main scientific problem is to find a replacement for L,, which still has a genetic justification. The null distributions of L,, seem likely to be always out of reach, and major progress will be made when L,, finds a suitable substitute. The statistic T,, is a resonable first choice, but it leads to its own problems. For example, what is the order of the growth of Var T,,?

In the search for surrogates for L it may be critical to consider the variety of problems to which it has been applied. In addition to the application to molecule comparisons noted previously, there is a natural application in communications. In particular, Bradley and Bradley (1978) have applied L,, in the study of bird songs.

There are also a variety of potential uses in computer science, and for an introduction there it seems useful to refer to the papers of Aho, Hirschberg and Ullman (1976), Okuda, Tanaka and Kasai (1975), Selkow (1977) and Wagner and Fischer (1974). In at least some of these papers in which L. has been used, it seems there must exist a more tractable substitute.

Acknowledgment. The observation that absolute continuity provides a second proof of Corollary 1 in ~ 3 is due to Steve Lalley who kindly commented on an earlier draft of this article.

REFERENCES

A. V. AHo, D. S. HIRSCHBERG AND J. D. ULLMAN (1976), Bounds on the complexity of the longest common subsequences problem, J. Assoc. Cornput. Mach., 23, pp. 112.

M. ABID (1978), Un Worime ergodique pour des processus sousadditifs et surstationnaires. C. R. Acad. Sci. Paris S6r. A, 217, pp. 149152.

D. W. BRADLEY AND R. A. BRADLEY (1978), Application of sequence comparison to the study of bird songs, Tech. Rep. Dept. of Data Processing, California State Univ., Long Beach, CA.

LONG COMMON SUBSEQUENCES 737

V. CHVATAL AND D. SANKOFF (1975), Longest common subsequences of two random sequences, J. Appl. Probab., 12, pp. 306315.

J. G. DEKEN (1979), Some limit results for longest common subsequences, Discrete Math., 26, pp. 17~ l.

B. EFRONANDC. STEINQ981), The jackknife estimate of variance, Ann. Statist., 9,pp. 586596.

D. GRIFFAETH (1978), Coupling methods for Markov processes, in Studies in Probability and Ergodic Theory, Advances in Mathematics Supplementary Studies, Vol. 2, GC Rota, ed., Academic Press, New York.

J. M. HAMMERSLEY (1974), Postulates for subadditive processes, Ann. Probab., 2, pp. 652~680.

G. H. HARDY, J. E. LUTLEWOOD AND G. POLYA (1951), Inequalities, Cambridge University Press, Cambridge.

P. G. HOEL, S. C. PORT AND C. J. STONF (1972), Introduction to Stochastic Processes, HoughtonMifflin, Palo Alto, CA.

J. F. C. KINGMAN (1973), Subadditive ergodic theory, Ann. Prob., 1, pp. 883909.

A. W. MARSHALL AND I. OLKIN (1979), Inequalities: Theory of Majorization and Its Applications, Academic Press, New York.

T. OKUDA, E. TANAKA AND T. KASAX (1976), A method for correction of garbled words based on the Levenshtein metric, IEEE Trans. Comput., C25, pp. 172177.

D. SANKOFF (197 2), Matching sequences under deletionI insertion constraints, Proc. Nat. Acad. Sci. U.S.A., 69, pp. 46.

D. SANKOFF AND R. J. CEDERGREN (1973), A test for nucleotide sequence homology, J. Molecular Biol., 77,pp.159164.

D. SANKOFF, R. J. CFDERGREN AND G. LAPALME (1976), Frequency of insertiondeletion, transversion, and transition in evolution of SS ribosomal RNA, J. Molecular Evolution, 7, pp. 133~149.

S. M. SELKOW (1977), The tree to tree editing problem, Inform. Processing letters, 6, pp, 17.

R. A. WAGNER AND M. J. FisCHER (1974), The string to string correction problem, J. Assoc. Comput. Mach., 21 , pp. 168173.