PM

PMCov

A maximum likelihood implementation of the covarion model

Christine Keribin and Marie-Anne Poursat

Université Paris-Sud 11
Laboratoire de Mathématiques
Equipe "Probabilité, Statistiques et Modélisation" UMR 8628

The nucleotide sequences
The topology
The computation: model types, parameters
Several computations on the same sequences and topology
Several computations with different topologies
Several computations with different sequence sets

Output file

Introduction

The basic process in the evolution of DNA sequences is the substitution of nucleotides (A, T, C, G) with the time along the phylogenic tree. Many programs or packages propose to infer phylogenetic models on different types of models. PMCov will compute the maximum likelihood estimator (MLE) for the covarion model, given the fact the topology of tree is known

The covarion model is characterized by two processes, each of them being modelled by a continuous time Markov chain:

A switching process, of instantaneous rate matrix S, allowing the sites to be turned OFF (no more nucleotide substitution) or ON (nucleotide substitutions can occur)

ON OFF

S= ( -s1 s1 ) ON

( s2 -s2 ) OFF

A nucleotide substitution, of instantaneous rate matrix Q, that only runs when S is ON

		T	C	A	G
	(	-(k1*pc+pa+pg)	k1*pc	pa	pg	) T
Q=	(	k1*pt	-(k1*pt+pa+pg)	pa	pg	) C
	(	pt	pa	-(pc+pa+k2*pg)	k2*pg	) A
	(	pt	pa	k2*pa	-(pc+pa+k2*pa)	) G

where (pt,pc,pa,pg) = π is the stationary process.
Q hence defined is the Tamura-Nei (1993) model; k1=k2 defines the Hasegawa, Kishino and Yano (HKY84) model; k1=k2=1 and pa=pc=pg=.25 is the Jukes-Cantor (1969) model.

The global covarion process can then be represented by the instantaneous rate matrix W (8x8) as follows:

		ON	OFF
W=	(	Q/t-s1s_onI4	s1s_onI4	) ON
	(	s2s_onI4	-s2s_onI4	) OFF

where I4 is the identity matrix of dimension 4, t =- trace (diag(π) * Q) is the normalization of Q and s_on=s2/(s1+s2) is stationary probability of the switch process to be ON.

Hence, trace (diag( π_W) * W)=-s_on*(1+ν), with ν=2*s1*s_on is the average number of switchs. Define now P(T)=exp(T*W/s_on); then T represents the number of substitions of nucleotides, i.e. the branch length for nucleotide substitution (and therefore, ν*T is the average number of switchs, i.e. the branch length for switch substitution).

Requirements

PMCov is a program written in C++. It should be easily run on WINDOWS or UNIX system or workstation. This paper describes the use of PMCov on a WINDOWS machine. The application requires an amount of memory proportional to the size of each simulated sequence data set.

The executable can be download here.

Running PMCov

To run PMCov, type

PMCov [FileName]

where [FileName]is the name of the input file containing the definition of the computation (see below the description of this file). The results are written in the same directory in an output file named "PM_FileName_YYYY_Month_DD.txt" . Some summary file can also be produced, see below. The errors are output on the standard output and are displayed on the screen.

Input file format

You have to described three types of information (and in the following order) to define a computation

the nucleotide sequences
the topology
the computation

A comment line always begins with //.

The nucleotide sequences

They must always be described at the beginning of the file, either by directly writing them in the file (keyword DNA), or by defining the file containing this definition (keyword DNAFILE)

//description of nucleotide sequence: 5 is the number of taxa and 100 is an upper bound of the sequence length.
//No blank in names. Name maximum length is 30.
DNA 5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

//sequences can also be written as follows, but always one line for one sequence:
DNA 5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

//example using an auxiliary file: in this case the file contains the above description, except the DNA keyword.
DNAFILE SG4JC5000s200.seq

When you use the DNAFILE keyword, the auxiliary file can contains several set of sequences of same length and with the same taxa number. In this case, the computation will be done for each set successively. The DNAFILE keyword triggers the creation of an output file named "PM_FileName_sumup_YYYY_Month_DD.txt", that outputs for each dataset and each computation, the model used, the maximum likelihood, and the values of the MLE.

The topology

The tree format is the same as used by PHYLIP (also called the ‘Newick’ format). This is a nested set of bifurcations defined by brackets and commas.

//example: keyword TOPOLOGY on one line. Definition of the topology on the next line TOPOLOGY ((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)

In this example, the tree is unrooted, as there are three branches at the higher level. Each node has three neighbors (except than the root that can also have two neighbors, in case of rooted tree).
Notice that the order in which the taxa are defined in the DNA part of the file doest not need to match the order in which these taxa are used in the TOPOLOGY definition.

The computation

Once defined the sequences and the topology, you have to declare the computation to run on these data. The computation is defined on two lines, one for selecting the model type, one for defining the input parameters.

Model types

model	covarion	keyword	model parameters
Jukes-Cantor (1969)	no	`4JC`	`no`
HKY (1984)	no	`4HKY`	`k1`
TN(1993)	no	`4TN93`	`k1, k2`
HKY (1984)	yes	`8HKY`	`s1, s2, k1`
TN(1993)	yes	`8TN93`	`s1, s2, k1, k2`

Parameters

To define one computation, they must be on a same line, using ";" as separator.

`k1=1;k2=2;`	to set an initial value to a model parameter (k1, k2, s1, s2)
`BRANCH;`	to optimize on all the branches
`ONEBRANCH=2;`	to optimize on one branch (the number is the number of the branch when running the branch tree in the left-descending order)
`PARAM;`	to optimize on all the parameters:
`PARAM=2;`	to optimize on one parameter (the number is the number of the parameter in the list defined above: parameter 1 is s1 for 8HKY and k1 for 4HKY)
`ITERLAW;`	to optimize the stationary law
`BRANCHMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1;`	to output the likelihood map according to one branch length (`NBSTEP:` number of computations, `INITSTEP`: value of the branch length for the first computation, `LSTEP`: increment length)
`PARAMMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1;`	to output the likelihood map according to one parameter (`NBSTEP:` number of computations, `INITSTEP`: value of the parameter for the first computation, `LSTEP`: increment length)
`NGLOBALITER=20;`	to define the maximum number of iterations for the global relaxation optimization process (20 is the default)
`NITER= 20;`	to define the maximum number of iterations for one optimization inside the relaxation optimization process (20 is the default)
`BRANCH;FIXBRANCH=3;`	to optimize the branch length except branch 3 .
`PARAM;FIXPARAM=2;`	to optimize the parameters except the parameter 2 (the number is the number of the parameter in the list defined above: parameter 1 is s1 for 8HKY and k1 for 4HKY)
`LIKPERSITE`	to output the likelihood for each site . The number of invariant sites is also output.

Several computations on the same sequences and topology

To perform several computation on the same set of sequences and with the same topology, it is sufficient to repeat computation definitions after the topology definition

// example of several computation on the same topology and sequences DNA 5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG TOPOLOGY ((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1) 4JC BRANCH; 4HKY k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30 8HKY k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1;

This file defines three computations

search for the branch length MLE in a Jukes-Cantor model without covarion effect
search for the branch length and parameter MLE in a HKY model without covarion effect
outputs the parameter map for the s2 parameter from s2=1 to s2=1+20*0.1

Several computations with different topologies

In the same ways, you can run in a same execution computations with different topologies

// example: 3 computations on the first topology and one on the second DNA 5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG TOPOLOGY ((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1) 4JC BRANCH; 4HKY k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30 8HKY k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1; TOPOLOGY (((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3):0.2,Taxon4:0.25,Taxon5:0.25) 8HKY k1=4.36364;s1=1;s2=1;BRANCH;PARAM;

Several computations with different sequence sets

Moreover in all these cases, when the sequences are defined with an auxiliary file (DNAFILE keyword), all the computations are run on all the sequences set of the auxiliary file. This file is as follows for two set of sequences:

5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG 5 100 Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

Output File

The output file recalls the parameter values used for the computation and outputs the result of the optimization or likelihood map. In case of an optimization, the gradient and approximate Hessian are written. This can be very useful to detect whether a local maximum has been found.

All the likelihood values are log-likelihood, except with the LIKPERSITE keyword.

When the sequences are defined using an auxiliary file (DNAFILE keyword), the creation of an output file named "PM_FileName_sumup_YYYY_Month_DD.txt" is triggered, containing a sumup for each dataset and each computation: model used, maximum likelihood value, MLE, gradient and approximate Hessian values for each involved parameter . This file is created eventhough there is only one defined set .

		ON	OFF
S=	(	-s1	s1	) ON
	(	s2	-s2	) OFF