Using multitype branching models to analyze bacterial pathogenicity

We apply multitype continuous time Markov branching models to study pathogenicity in E. coli, a bacterium belonging to the genus Escherichia. First, we examine brie(cid:29)y the properties of multitype branching processes and we also survey some fundamental limit theorems regarding the behavior of such models under various conditions. These theorems are then applied to discrete, state dependent models in order to analyze pathogenicity in a published clinical data set consisting of 251 strains of E. coli. We use well established methods, incorporating maximum likelihood techniques, to estimate speciation rates as well as the rates of transition between di(cid:27)erent states of the models. From the analysis, we not only derive new results, but we also verify some preexisting notions about virulent behavior in bacterial strains.

1. Introduction In this paper, we explore the possibility of utilizing the theory of branching processes into analyzing pathogenicity in bacterial strains.For that purpose, we rst review fundamental properties of multitype, continuous time Markov branching processes as well as their behavior in the long time limit.Then, we apply multitype branching models to examine virulence in E. coli strains, and perform an in depth analysis on the limits of proportions of E. coli in dierent states of the models.The strains used in this study were isolated from human hosts, and obtained from a previously published data set (by Bartoszek et al. [4]) of pathogenic and nonpathogenic E. coli bacteria.
In recent years, considerable research has been conducted regarding the use of branching processes to explore various biological phenomena.A two type Markov branching model, coined as the `binary state speciation and extinction' (BiSSE) model, was proposed by Maddison et al. [12] to assess the impact of binary characters on rates of diversicationthe dierence between speciation and extinction rates.The parameters of the BiSSE model describe speciation and extinction in the two types, as well as the transitions that take place from one type to the other.This model was recently used by Bartoszek et al. [4] in order to estimate parameters from genetic data of E. coli populations, and predict pathogenicity of various virulence factors (VFs)agents that enable bacteria to replicate and spread within the host by damaging and eluding its defences [6].Bartoszek et al. used a modied version of the BiSSE model, in which the extinction rates were assumed to be zero.Another Markov branching model, named as the `multistate speciation and extinction' (MuSSE) model, was later introduced by FitzJohn [7] as an extension of the BiSSE model to binary traits with more than two states.
In this paper, we apply the MuSSE model with zero extinction rates to estimate state dependent speciation and transition rates for a real, known collection of pathogenic and nonpathogenic strains of E. coli bacteria (obtained from [4]) that reside in the human digestive and urinary tracts.The bacterial strains are subdivided into four categories depending on whether or not they carry a VF in the gut and bladder of the human host.Lately, a number of researchers have successfully used the discrete state MuSSE model for estimation of various parameters.For instance, Sachs et al. (2013) used the MuSSE model to estimate trait-dependent diversication and transition rates amongst states consisting of free-living, mutualistic, parasitic, and dual lifestyle bacteria in the Proteobacteria phylum.Using this model, they inferred that proteobacterial mutualist lineages arise from freeliving and parasitic ancestors, but rarely transition back to a parasitic or freeliving status.Pirie et al. [14] implemented the MuSSE framework to compare diversication rates of 800 species in the plant genus Erica, endemic to ve geographical regionsPalearctic, Tropical African, Madagascan, Drakensberg and Cape.Arbuckle et al. [1] used the BiSSE and MuSSE models to show that speciation and extinction rates vary across defensive traits in amphibians.
With this study, we ask the following questions: (a) Is it possible to utilize state dependent branching processes to analyze pathogenic behavior in bacteria?(b) How do maximum likelihood methods behave when estimating parameters (such as speciation rates and transition rates between states) of multitype models?(c) Based only on a nite sample, do the estimated parameters provide reasonable information on the almost sure limits of the proportion of bacterial strains, and if so, can they be used further to obtain plausible condence regions for these limits?
To answer these questions, we proceed by rst giving a concise survey of ntype branching processes.More specically, we recall fundamental theorems regarding the long time behavior of branching models.These limit theorems are obtained from earlier works of Athreya and Ney [2] and Janson [8].In order to thoroughly explain the mathematical background, which is pertinent to understanding the forthcoming biological application, we introduce some technical notation as well.For multitype branching models, a matrix known as the mean ospring matrixwhose entries consist of the net growth and transition rates of the processis of central importance; the limit behavior of the process can be completely characterized by this matrix through its eigenvalues and corresponding eigenvectors.In the coming sections, we evaluate the mean ospring matrix for various submodels of the multitype branching process, and present interesting results associated with the largest eigenvalue of these matrices.After reviewing the general characteristics of multitype branching processes, we apply 4type branching submodels to a known clinical data set of virulent and nonvirulent E. coli strains, and, with the help of the MuSSE model, obtain rates of speciation and transition among the 4 states of the models.Using the aforementioned limit theorems, we also perform an in depth analysis on the limits of proportions of E. coli strains in dierent states.Finally, we compare the results obtained from various models and draw some useful inferences regarding pathogenic behavior in virulent bacteria.

General multitype branching processes
We consider an ntype continuous time Markov branching process X(t), given by the column vector X(t) = X 1 (t), . . ., X n (t) , t ≥ 0, where each X i (t), i = 1, . . ., n, represents the number of typei particles at time t.The lifetime of each type is assumed to be exponentially distributed with intensity a i , i = 1, . . ., n.We introduce a vector a whose components comprise of a i , i.e., a = (a 1 , . . ., a n ).With each type i, we also associate j = (j 1 , . . ., j n ), a vector of nonnegative integer coordinates.Then, the ospring distribution of the n types is specied by the coordinates of p(j), where p(j) = p (1) (j), . . ., p (n) (j) and j p (i) (j) = 1, for all i = 1, . . ., n.Here, p (i) (j) = p (i) (j 1 , . . ., j n ) gives the probability that a typei particle creates j 1 type1 ospring, j 2 type2 ospring, . .., j n typen ospring [3].Letting s = (s 1 , . . ., s n ), the generating function is recognized as f (s) = f (1) (s), . . ., f (n) (s) , where, for each i = 1, . . ., n, determines the distribution of the number of various types of ospring produced by a typei particle.Further, we consider the matrix A = {a ik : i, k = 1, . . ., n}, where The mean matrix of the branching process is given by . Following [8], we identify the mean ospring matrix A as where T in the superscript denotes the matrix transpose.We let γ be the largest positive eigenvalue of A, and, let u and v be the left and right normalized column eigenvectors, respectively, of A, corresponding to γ.Thus, u A = γu and Av = γv.We now recall some limit results for multitype branching processes that will be used in later sections for the analysis of branching models.These results, numbered 1, 2 and 3 here, are stated formally in Appendix A of the paper as Theorems A.3, A.4 and A.5, respectively.
1.The fundamental limit result for multitype branching processes signies the existence of a nonnegative random variable W such that e −γt X(t) . This implies that the number of particles increases to innity at a speed e γt , with the distribution of the types specied by a random multiple of the right eigenvector.2. For a real nonnegative number N , let T N be the rst time when the total size of the branching process reaches a given level N > 0.Then, letting C be the sum of the components of v, and under additional assumptions (see Appendix A), the asymptotic distribution of types at T N is given by the right normalized eigenvector v in the sense that Under further assumptions on the size of the eigenvalues of the ospring matrix A, as N → ∞, the centered and normalized type distribution has a Gaussian limit distribution in the sense that , where N (0, Σ b ) is a multivariate normal distribution and Σ b is the covariance matrix stated explicitly in Appendix A [8].

A 4type branching model Consider a 4type, continuous time
Markov branching process, X(t) = X 1 (t), . . ., X 4 (t) , t ≥ 0, where each X i (t), i = 1, . . ., 4, gives the number of typei particles at time t.Let λ i represent the speciation rate of typei particles, i = 1, . . ., 4. Further, q 12 and q 21 are the rates of transition from type1 → 2 and type2 → 1, respectively, and similarly, q 34 and q 43 the transition rates from type3 → 4 and type 4 → 3, respectively.Finally, m 32 is assumed to be the transition rate from type3 to type2.The branching rates of X(t) are thus given as where the initial state X(0) can be either (0, 0, 1, 0) or (0, 0, 0, 1) , since type 3 and type4 are the dominating types (see Def. A.1). Figure 1 summarizes the parameters used in this branching process.The motivation behind choosing this particular model is that the results derived here will be applied in Section 4 to obtain results and draw inferences from a real data set of E. coli bacterial strains.Note that a 4type branching model would generally consist of 20 parameters: 4 speciation parameters, 4 extinction parameters and 12 parameters representing transition rates between states.Here we have assumed nonextinction, and we have also set 7 (out of 12) transition parameters to zero.
Using the above information, we are now able to nd the generating functions: (s 1 , s 2 , s 3 , s 4 ) = (λ 3 s 2 3 + q 34 s 4 + m 32 s 2 )/a 3 , f (4) (s 1 , s 2 , s 3 , s 4 ) = (λ 4 s 2 4 + q 43 s 3 )/a 4 .The mean ospring matrix A for this process is dened as where the eigenvalues of A, obtained using Maple 18.00 [13], are We have It can be seen that γ 1 ≥ 0 and γ 3 ≥ 0 for any set of parameters.Also, γ 1 > 0 if at least one of δ 1 and δ 2 is strictly positive, or if both are negative, then one is strictly negative, and a similar result holds for γ 3 .Further, we also require the eigenvectors of A to be used later in the application of limit theorems A.3A.5.Thus, using again Maple 18.00 [13], the left column eigenvectors of A, after substantial manual simplication, are found as and similarly, the right column eigenvectors are computed as where q 43 (m 32 +q 34 )+q 12 q 21 +q 43 m 32 .It should be noted that for certain parameter settings, where one of q 21 , q 43 or m 32 vanish, the above expressions do not apply.Instead, alternate formulae have to be found on a case by case basis.
Now let γ = max{γ 1 , γ 3 } be the largest positive eigenvalue, and let v = (ν 1 , ν 2 , ν 3 , ν 4 ) be the normalized form of the corresponding right eigenvector.Then, from Thm. A.3 (and as described in limit result 1), there exists a nonnegative random variable W , such that as t → ∞, Moreover, let ν 1 + ν 2 + ν 3 + ν 4 = C > 0 and recall that T N is the rst time when the total number of species reaches a level N .Using limit result 2 (or Thm.A.4), we have that as N → ∞, In order to apply the limit result 3 (see also Thm.A.5 in Appendix A), we need more information on the variance characteristics of the branching process.For that purpose, consider rst the column vectors, ξ i , i = 1, . . ., 4, given as with probability λ 4 /a 4 (0, 0, 1, −1) -"q 43 /a 4 .
Next, dene a matrix B as Hence, using Eq. ( 7) from Appendix A, the matrix Σ I is specied by where column vectors ûi and vi are determined as Finally using Eq. ( 6), the covariance matrix Σ b is given by where Under the condition that γ is greater than two times the second largest eigenvalue, and using the central limit theorem A.5, we have that as N → ∞.
In the following section, we apply the above 4type model to a clinical data set of 251 bacterial strains, and obtain insightful results concerning pathogenicity in E. coli bacteria.We will take N = 251, so that T N is the rst time when the total size of the branching process reaches 251, and hence we can obtain the proportion of strains in various states of the model.

Application of the branching model to E. coli strains data
From [4], we obtain an E. coli data set of N = 251 strains, which forms the tips of a given phylogenetic tree (see Fig. 1 and Fig. 3 in [4]).The tree is xed and describes the genealogical structure of the strains.The tree tips are grouped into 4 categories: pathogenic and nonpathogenic E. coli found in the human gastrointestinal tract, and, pathogenic and nonpathogenic E. coli found in the human urinary tract.Whether a bacterial strain is pathogenic or not in any environment, depends on whether or not it is positive for carrying a certain VF (such as toxins, invasins, hemolysins, etc.).Hence, the strains are divided into the following 4 states where U 0 : negative for VF in the urinary tract, U 1 : positive for VF in the urinary tract, K 1 : positive for VF in the digestive tract, K 0 : negative for VF in the digestive tract.
Virulence Factor (VF) From the data given in [4], we choose to analyze 9 VFs.These VFs along with the number of strains, N i , in each state i, i = 1, . . ., 4, are given in Table 1.We model this data set by applying a 4type, continuous time Markov branching process X(t) = (U 0 , U 1 , K 1 , K 0 ) and branching rates as given in Eq. ( 1). Figure 1 gives a diagrammatic representation of the parameters used in the model.In the gure, λ i , i = 1, . . ., 4, represent speciation rates of strains in various states, q 12 , q 21 , q 34 , q 43 are the transition rates from pathogenic to nonpathogenic states and vice versa, and nally, m 32 represents migration from state K 1 to U 1 .As in [4], extinction rates are set to zero for all 4 states, and, pathogenic and nonpathogenic strains are allowed to transition back and forth in the same environment (urinary tract or intestine).It is well known that E. coli travel from the gastrointestinal tract to the bladder in the human host, and cause urinary tract infections ( [5], [10]).Hence, we assume that E. coli migrate from states K 1 to U 1 .
obtain estimates for all parameters.This is achieved by making use of the `make.musse()'function in the R package diversitree [7], which allows for maximum likelihood estimation of the models' parameters.During the parameter estimation analysis of each VF, the initial state for the process is determined by the relative probability of observing type3 and type4 strains, i.e., 0, 0, N 3 /(N 3 + N 4 ), N 4 /(N 3 + N 4 ) .To increase the estimation power of the MuSSE framework, we try out various models in which dierent constraints are applied on the parameters (the extinction rates already being set to zero), as given in Table 2. From the table, it can be seen that parameters are either set to be zero, or, pairs of parameters are constrained to be equal.This is done by utilizing the `constrain()' function in the diversitree package.The most suitable constraint on the parameters is then chosen using the Bayesian information criterion or BIC [17], that is, for each VF separately, we choose that combination of constrained (or freely varying) parameter estimates which give the lowest BIC. VF

Results
The parameter values obtained as a result of the analysis are given in Table 3.We conclude that for 6 out of 9 VFs, whenever the parameters are constrained in some manner, the model is a better t to the given data set, in contrast to when all the parameters are allowed to vary freely.From the parameter values in Table 3, we have that: (a) λ 2 > λ 1 for 8 out of 9 VFs, and λ 3 > λ 4 for all VFs.Thus, virulent strains of E. coli, in both urinary and digestive environments, speciate at a higher rate than nonvirulent strains.
(b) q 21 ≥ q 12 and q 34 ≥ q 43 for all VFs.Thus, E.coli bacteria, in both digestive and urinary tracts, lose their pathogenicity at a higher rate as compared to gaining it.
Here, it is important to notice that the matrix Σ b has rank 3, hence, for the construction of the subsequent condence regions, we have removed the counts for the rst type (alternatively, we can choose to remove any one of the four types) in Eq. ( 4).From the theory developed in Chapters 4. strains.For a signicance level α = 0.01, the observed generalized squared distance D 2 , dened by Eq. ( 4) and ( 5), is given as D 2 ≤ χ 2 3 (0.01) = 11.345.The center of the condence ellipsoids is represented by (N 2 , N 3 , N 4 )/N , while the half-lengths of the axes are denoted by a, b, and c. of [9], at a given signicance level α, a 100(1 − α)% joint condence region for pwhich can be thought of as the mean of a multidimensional normal distributionis dened by ellipsoids such that, where χ 2 3 (α) is the upper αlevel quantile of the χ 2 distribution with 3 degrees of freedom.The quantity D 2 represents the square of the generalized distance from the centre of the condence ellipsoid to a constant density surface.For various VFs, the observed D 2 value is calculated using Eq. ( 4), and is given in the second column of Table 5.It can be seen that for astA and iroN, we are unable to nd D 2 , since the conditions of Thm.A.5 are not met for these two VFs; the second largest eigenvalue γ 1 is greater than γ 3 /2, i.e., γ 3 /2 < γ 1 < γ 3 .In this case, weak convergence is shown but to a random variable, whose distribution is not characterized, only its existence (Corollary 3.18 in [8]).
The axes of the condence ellipsoids and their relative lengths are determined using the eigenvalues and corresponding eigenvectors of the positive denite matrix Σ b [9].Since Σ b is a 3 × 3 matrix with the rst type removed, we can nd a simultaneous condence ellipsoid for p i when i = 2, 3 and 4. Let  5.
Since Thm.A.5 is an asymptotic result, and our sample size consists of only 251 data points, we must conrm that Thm.A.5 has been successfully implemented while nding the condence regions in the aforementioned analysis.To achieve this, we check how the observed proportions (N i /N ) behave for the observed number of strains (N = 251) by making use of simulated trees.For the estimated model parameters, given in Table 3, we obtain the clades evolution using the `tree.musse()'function of the diversitree package in R. For various VFs, excluding astA and iroN, we simulate 10000 trees, each with 251 tips.To ensure that the root of all simulated trees is a dominating type (assumption (F5) in Appendix A), take the prior distribution to be concentrated on states 3 and 4, with probabilities equaling the observed relative proportions of types3 and 4. For this, we use the `sample()' function of R which randomly selects either of the two states with desired probabilities.Out of the 10000 simulations, we take into consideration in the subsequent analysis, only those trees in which the observed tips have at least one obser-  vation of type3 or 4.This is due to the essential nonextinction assumption of Thm.A.5 (cf.Def.A.2).For each VF, the number of simulations that we use out of 10000 is denoted by N sim , and shown in Table 6.From the 251 tip counts of each simulated tree, we calculate the proportions of types, and then obtain the D 2 value given in Eq. ( 4), i.e., the squared generalized distance from the simulated proportions (with the rst coordinate removed) which lie on the surface of an ellipsoid, to the ellipsoid's centre (at v/C).Using the χ 2 distribution with 3 degrees of freedom, we obtain from Eq. ( 5), how `far in the tail' the observed D 2 values lie.
The results are shown in the form of histograms in Figure 3.We conclude that the observed proportions are close to the ellipsoid's centre, i.e., the quantile corresponds to a level greater than a cuto value of α = 0.01, and thus for the given number of strains, N = 251, we are indeed close to the asymptotic regime.From the histograms, we obtain the fraction of simulations, F sim , which give a value of the generalized square distance greater than the observed value of D 2 , as well as the fraction of simulations, f sim , for which the generalized squared distance is greater than the critical value, χ 2 3 (0.01) = 11.345.For each VF, these values are shown in Table 6.

Two variations of the branching model
We now model the same E. coli data set [4] by applying another 4type, Markov branching process with branching rates as shown in Figure 4.All parameters of speciation and transition are the same as in the original process X(t), except for the migration rate, m 32 , which is replaced by m 42 in this case.Since E. coli are assumed to travel from the intestinal to the urinary tract and cause infections, here we assume that bacteria migrate from the nonvirulent state K 0 to the virulent state U 1 .The mean ospring matrix, Â, Figure 4: Diagrammatic representation of the branching process Y (t).The four states are similar to the ones in the branching model X(t).The parameters λ i , i = 1, . . ., 4, represent speciation rates in each state, q 12 , q 21 , q 34 and q 43 are the transition rates between states, m 42 represents migration from state K 0 to U 1 .
for this process is given as Using Maple 18.00 [13], the left and right column eigenvectors of Â are obtained as , and S 1 are dened previously in Section 3. Similar to the rst model X(t), the package diversitree [7] in R is applied to each VF and estimates of all parameters are obtained, as shown in Table 7.We apply constraints given in    9: Results of the condence regions' analysis for p corresponding to the process Y (t).Nsim represents that number of simulated trees (out of 10000) in which the observed tips have at least a single observation of the dominating type3 or 4. Fsim is the fraction of simulations for which the generalized squared distance is greater than the observed D 2 value.fsim is that fraction of simulations for which the generalized squared distance is greater than χ 2 3 (0.01) = 11.345.
For each VF, γ3 is found to be the largest eigenvalue, with corresponding normalized eigenvector, say v.We again apply Thm.A.4 and using the estimated parameters, obtain the limiting values p = (p 1 , . . ., p4 ) = v/C, where C is the sum of coordinates of v.The values of p are given in Table 8.From Tables 7 and 8, we infer that: (a) λ 2 > λ 1 for all VFs and λ 3 < λ 4 for 6 out of 9 VFs, (b) q 21 > q 12 for 8 out of 9 VFs and q 34 > q 43 for 5 out of 9 VFs, (c) the probability, p2 + p3 , of prevalence of VFs in E. coli strains varies with each VFfor instance, the VF mG has the maximum probability of being maintained.An analysis of the condence regions for p is also carried out, similar to the one for p in Section 4. Figure 5 and Table 9 give the results for this analysis.
We apply yet another 4type branching process Z(t) = (U 0 , U 1 , K 1 , K 0 ) to the same data set.The branching rates are as shown in Figure 6.This time we include migration from both virulent and nonvirulent states in the gastrointestinal tract to the virulent state of the urinary tract, using parameters m 32 and m 42 .The mean ospring matrix Ā is given as The eigenvalues of Ā are Figure 6: Branching rates for the model Z(t).λ i , i = 1, . . ., 4, represent speciation rates, q 12 , q 21 , q 34 and q 43 are the rates of transition between states, m 32 and m 42 are migration rates from state K 1 to U 1 and K 0 to U 1 , respectively.The left and right column eigenvectors of Ā are obtained as and respectively, where with d 4 = λ 4 − q 43 − m 42 and δ 1 , δ 2 , δ 3 , G 1 , S 1 , Ĝ4 as dened in earlier sections.Model analysis is carried out in R, and as before, the initial state is determined by the relative probability of observing type3 and type4 strains.We use the same constraints given in Table 2, except that both m 32 and m 42 are allowed to vary.In addition, we use 21 more constraints in which the constraints from Table 2 are repeated using m 32 = m 42 .The best constraints are chosen according to BIC, and the parameter values obtained are given in Table 10.From the parameters, we compute the eigenvalues and corresponding eigenvectors of Ā. γ3 is found to be the largest eigenvalue for all VFs.Moreover, p = (p 1 , . . ., p4 ) , denoting the limits of proportions of E. coli  ) in which at least one observation of the dominating type3 or 4 was obtained, Fsim gives that fraction of simulations for which the generalized squared distance is greater than the observed value of D 2 , and fsim is that fraction of simulations for which the generalized squared distance is greater than χ 2 3 (0.01) = 11.345.
The sum p2 + p3 gives the probability of prevalence of various VFs in E. coli strains.An analysis regarding the condence regions for the limits of strain proportions was also performed, similar to the one for the previous models.The results are displayed in Figure 7 and Table 12.

Conclusions
We now list some concise but useful inferences drawn from the mathematical analysis of the three models.Here, we would also like to state that in a future work, we plan to carry out a rigorous analysis of a larger data set of E. coli strains, which would not only add to the current results, but also lead to more comprehensive and solid biological conclusions.
1. E. coli strains carrying a VF speciate at faster rates as compared to strains which do not carry a VF, in urinary tracts of the human host (Tables 3, 7  and 10).This result remains fairly constant in all three models.A similar result was also obtained in [4], where it was shown that speciation rates were higher for virulent E. coli strains as compared to nonvirulent strains that were isolated from human urine samples.The same result, however, does not hold for strains in the digestive tract; according to the process X(t) and Z(t), most virulent bacterial strains speciate faster than nonvirulent strains, but the opposite is true under the model Y (t).
2. E. coli bacteria lose their virulence at a higher rate as compared to gaining it, in both urinary and digestive tracts of the human hosts.Under all three models, this behavior is exhibited for majority of the VFs (Tables 3, 7 and  10).This is consistent with the fact that bacteria maintain their virulence only if conditions are favorable for host invasion or colonization, otherwise, they lose their pathogenicity, since the expression of VFs is costly to maintain and tends to decrease bacterial tness [4,11].3. Pathogenic and nonpathogenic bacteria in the gut migrate to the urinary tract at dierent rates.This result is inferred from the analysis of the third branching model Z(t) with two migration rates (Table 10).For 5 out of 9 VFs, pathogenic bacteria are found to migrate faster than the nonpathogenic ones.However, compared separately, from Tables 3 and 7, m 32 > m 42 for most VFs (8 out of 9). 4. The probability of maintaining virulence in bacterial strains (p 2 +p 3 , p2 + p3 and p2 + p3 in the models X(t), Y (t) and Z(t), respectively) varies with the VF under considerationsee Tables 4, 8 and 11.However, under all three branching models, it is consistently seen that the VF mG has the highest chance of prevailing in virulent bacterial strains.
5. The condence region analysis of the limits of proportions of E. coli strains shows that Thm.A.5 is more successfully implemented to the data set under the rst two models, X(t) and Y (t), as compared to the third model, Z(t).From gures 3, 5, and 7 we conclude that for Thm.A.5, the branching process X(t) is a better t for analyzing strains carrying astA, mG, fyuA, hly1, iutA, and sat, while Y (t) is suitable for cnf1, iroN and papC.
A. Appendix In this section, we state the limit theorems that are applied in this paper for the analysis of bacterial strains data.We rst list some important denitions and assumptions that relate to ntype branching processes.
Definition A.1 A typep is said to be dominating if it is possible to obtain every other type, say typeq (q = 1, . . ., n, and q = p), in a branching process that starts with a single typep particle.The set of all dominating types is called the dominating class [8].Definition A.2 A branching process becomes essentially extinct if there are no particles from the dominating class at some time instance [8].

Assumptions:
(F1) Following a branching event involving a typep particle, the number of typep particles either increases, or decreases by at most one, while the number of typeq particles, q = 1, . . ., n, q = p, always increases.
(F2) Following a branching event involving a typep particle, all changes in the number of typeq particles, q = 1, . . ., n, have nite means and variances, that is, the process never explodes.(F3) The largest eigenvalue, γ, of the mean ospring matrix A is positive, i.e., the multitype branching process is supercritical, hence, there exists a positive probability of nonextinction.(F4) The largest eigenvalue of A is simple, i.e., the algebraic multiplicity of γ is one.(F5) There exists a dominating typep, such that the multitype branching process starts with at least one particle of typep.(F6) The largest eigenvalue, γ, belongs to the dominating class.−→ ∞ as t → ∞ (Lemma 3.14 in [8]).We now state two more limit theorems, which are applied in the main sections of the paper with the special case b = (1, 1, . . ., 1) ∈ R n .Let ∆ be the set of all eigenvalues, γ i , of A. Of course, the largest eigenvalue γ also belongs to ∆.There exist projection matrices P γ i , such that γ i ∈∆ P γ i = I, where I is the identity matrix.Let P be the projection matrix onto the sum of the eigenspaces corresponding to γ i < γ/2, that is, P = γ i ∈∆ 1 P γ i , where ∆ 1 = {γ i ∈ ∆ : γ i < γ/2}.Also, let ξ i = (ξ i1 , . . ., ξ in ) be a random column vector with integer coordinates, denoting the change in population if a branching event occurs at a particle

Figure 1 :
Figure 1: Diagrammatic representation of the speciation and transition parameters in the 4type branching model X(t).
), we have for N = 251, (N 1 , . . ., N 4 ) N a.s.−→ p = (p 1 , . . ., p 4 ) , where p = (p 1 , . . ., p 4 ) := v/C is the limit of the proportions of E. coli strains.For all VFs, the values of N i /N and p i , i = 1, . . ., 4, are given in β, ζ, and η be the positive eigenvalues, and β, ζ, and η, the corresponding right eigenvectors of Σ b .Then, the axes of the condence ellipsoids centered at X(T N )/N , are given as ±a β , ±b ζ , and ± c η, with a = D β/N , b = D ζ/N , and c = D η/N being the half length of the three axes.For various VFs, the axes lengths can be computed, as shown in Table

2 D 2 = 6 . 589 Figure 3 :
Figure 3: For various VFs, histograms showing the distribution of generalized squared distances for simulated trees.The values less than χ 2 3 (0.01) = 11.345 are represented by white bars, and those greater than χ 2 3 (0.01) are given in black.Arrows in each histogram represent the observed D 2 value as obtained in Table5.

Figure 5 :
Figure 5: Distribution of generalized squared distances for those VFs for which assumptions of Thm.A.5 hold under the model Y (t).The values less than χ 2 3 (0.01) = 11.345 are represented by white bars, and those greater than 11.345 are given in black.Arrows in each histogram represent the observed D 2 value.VF

Figure 7 :
Figure 7: For the process Z(t), these histograms give the distribution of the generalized squared distances (obtained from Eq.(4)) of 10000 simulated trees for those VFs for which Thm. A.5 is satised.The values less than χ 2 3 (α) = 11.345, for α = 0.01, are represented by white bars, while the remaining are given in black.Arrowed lines on each histogram represent the observed D 2 value for each VF.

Theorem A. 3 (
Theorem 1 in[2]) For the branching process X(t), lim t→∞ e −γt X(t) = W v exists with probability 1, where W is a nonnegative random variable, γ is the largest eigenvalue of the mean ospring matrix A, and v is the normalized right eigenvector of A corresponding to γ.Let b ∈ R n be a xed column vector and let N ≥ 0. Dene T b (N ) = min{t ≥ 0 : b • X(t) ≥ N } to be the rst time when b • X(t) exceeds N .Also let b • v > 0.Then, conditioned on essential nonextinction, T b (N ) < ∞ for all N ≥ 0 and b • X(t) a.s.

Table 1 :
A list of various VFs and the number of strains, N [4] in each of the four states obtained from[4].Note that N 1 + . . .+ N 4 = 251 for each VF.

Table 3 :
Parameter values for all VFs in the branching model X(t).

Table 4 .
The sum, p 2 + p 3 , gives the probability of maintaining each VF in the E. coli strains.From Table4, we infer that the probability of maintaining VFs varies in the strains; it depends on the VF under consideration.Figure2compares the value p 2 + p 3 , with the sum N 2 /N + N 3 /N .It can be seen that E. coli strains carrying VFs mG, fyuA and iutA have a higher probability of being virulent as compared to strains carrying cnf1, hly1 and sat.Condence regions We now apply Thm.A.5 to construct condence regions for p the limit of proportions of E. coli strains.Using the expression in Eq. (3), let VF

Table 4 :
Limit values, p i , for the the branching model X(t), and the proportion of strains, N i /N , in each state, i = 1, . . ., 4. Here, N = 251, N i is the number of strains in each state i, and p i = ν i /C, where ν i (i = 1, .., 4) are the components of the normalized eigenvector v and C = Figure 2: A bar chart comparing the probabilities, p 2 + p 3 , of maintaining VFs in bacterial strains, with the sum, (N 2 + N 3 )/N , of strain proportions.The values used in this gure are obtained from Table 4. p 2 + p 3 is represented by grey bars and (N 2 + N 3 )/N by black bars.

Table 5 :
Specication of condence ellipsoids for limits of proportions of E. coli

Table 5 .
VF N sim F sim f sim

Table 6 :
Values obtained from simulating 10000 trees: N sim denotes the total number of simulations, out of 10000, that are actually used in the analysis, F sim is that fraction of simulations for which the generalized square distance is greater than the observed D 2 value (given in Table5), and f sim is that fraction of simulations for which the generalized square distance is greater than χ 2 3 (0.01) = 11.345.

Table 2 ,
except that m 32 is now replaced by m 42 in the table.The best constraint on the parameters is again chosen using the BIC.

Table 7 :
Parameter values for the branching process Y (t).

Table 10 :
Parameter estimates for the branching model Z(t).

Table 11 :
Limits of proportions of E. coli strains for the branching process Z(t).

Table 12 :
For the model Z(t), Nsim represents the number of simulations (out of 10000