PMCov
A maximum likelihood implementation of the covarion model
Christine Keribin and Marie-Anne Poursat
Université Paris-Sud 11
Laboratoire de Mathématiques
Equipe "Probabilité, Statistiques et Modélisation" UMR
8628
The nucleotide sequences
The topology
The computation: model types, parameters
Several computations on the same sequences and topology
Several computations with different topologies
Several computations with different sequence sets
The basic process in the evolution of DNA sequences is the substitution of nucleotides (A, T, C, G) with the time along the phylogenic tree. Many programs or packages propose to infer phylogenetic models on different types of models. PMCov will compute the maximum likelihood estimator (MLE) for the covarion model, given the fact the topology of tree is known
The covarion model is characterized by two processes, each of them being modelled by a continuous time Markov chain:
ON | OFF | |||
S= | ( | -s1 | s1 | ) ON |
( | s2 | -s2 | ) OFF |
T | C | A | G | |||
( | -(k1*pc+pa+pg) | k1*pc | pa | pg | ) T | |
Q= | ( | k1*pt | -(k1*pt+pa+pg) | pa | pg | ) C |
( | pt | pa | -(pc+pa+k2*pg) | k2*pg | ) A | |
( | pt | pa | k2*pa | -(pc+pa+k2*pa) | ) G |
The global covarion process can then be represented by the instantaneous rate matrix W (8x8) as follows:
ON | OFF | |||
W= | ( | Q/t-s1*son*I4 | s1*son*I4 | ) ON |
( | s2*son*I4 | -s2*son*I4 | ) OFF |
where I4 is the identity matrix of dimension 4, t =- trace (diag(π) * Q) is the normalization of Q and son=s2/(s1+s2) is stationary probability of the switch process to be ON.
Hence, trace (diag( πW) * W)=-son*(1+ν), with ν=2*s1*son is the average number of switchs. Define now P(T)=exp(T*W/son); then T represents the number of substitions of nucleotides, i.e. the branch length for nucleotide substitution (and therefore, ν*T is the average number of switchs, i.e. the branch length for switch substitution).
PMCov is a program written in C++. It should be easily run on WINDOWS or UNIX system or workstation. This paper describes the use of PMCov on a WINDOWS machine. The application requires an amount of memory proportional to the size of each simulated sequence data set.
The executable can be download here.
To run PMCov, type
PMCov [FileName]
where [FileName]
is the name of the input file containing the
definition of the computation (see below the description of this
file). The
results are written in the same directory in an output file named
"PM_FileName_YYYY_Month_DD.txt" . Some summary file can also be produced, see
below. The errors are output on the standard output and are displayed on the
screen.
You have to described three types of information (and in the following order) to define a computation
A comment line always begins with //.
They must always be described at the beginning of the file, either by
directly writing them in the file (keyword DNA
), or by defining the file
containing this definition (keyword DNAFILE
)
//description of nucleotide sequence: 5 is the number of taxa and 100
is an upper bound of the sequence length.
//No blank in names. Name maximum length is 30.
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
//sequences can also be written as follows, but always one line for one
sequence:
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
//example using an auxiliary file: in this case the file contains the above
description, except the DNA
keyword.
DNAFILE SG4JC5000s200.seq
When you use the DNAFILE
keyword, the auxiliary file can
contains several set of sequences of same length and with the same taxa number.
In this case, the computation will be done for each set successively. The
DNAFILE
keyword triggers the creation of an output file named
"PM_FileName_sumup_YYYY_Month_DD.txt", that outputs for each dataset and
each computation, the model used, the maximum likelihood, and the values of the
MLE.
The tree format is the same as used by PHYLIP (also called the ‘Newick’ format). This is a nested set of bifurcations defined by brackets and commas.
//example: keyword TOPOLOGY on one line. Definition of the topology on
the next line
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)
In this example, the tree is unrooted, as there are three branches at the
higher level. Each node has three neighbors (except than the root that can also
have two neighbors, in case of rooted tree).
Notice that the order in which the taxa are defined in the DNA part of the file
doest not need to match the order in which these taxa are used in the TOPOLOGY
definition.
Once defined the sequences and the topology, you have to declare the computation to run on these data. The computation is defined on two lines, one for selecting the model type, one for defining the input parameters.
model | covarion | keyword | model parameters |
Jukes-Cantor (1969) | no | 4JC |
no |
HKY (1984) | no | 4HKY |
k1 |
TN(1993) | no | 4TN93 |
k1, k2 |
HKY (1984) | yes | 8HKY |
s1, s2, k1 |
TN(1993) | yes | 8TN93 |
s1, s2, k1, k2 |
To define one computation, they must be on a same line, using ";" as separator.
k1=1;k2=2; |
to set an initial value to a model parameter (k1, k2, s1, s2) |
BRANCH; |
to optimize on all the branches |
ONEBRANCH=2; |
to optimize on one branch (the number is the number of the branch when running the branch tree in the left-descending order) |
PARAM; |
to optimize on all the parameters: |
PARAM=2; |
to optimize on one parameter (the number is the number of the parameter in the list defined above: parameter 1 is s1 for 8HKY and k1 for 4HKY) |
ITERLAW; |
to optimize the stationary law |
BRANCHMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1; |
to output the likelihood map according to one branch length
(NBSTEP: number of computations, INITSTEP : value
of the branch length for the first computation, LSTEP :
increment length) |
PARAMMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1; |
to output the likelihood map according to one parameter (NBSTEP:
number of computations, INITSTEP : value of the parameter for
the first computation, LSTEP : increment length) |
NGLOBALITER=20; |
to define the maximum number of iterations for the global relaxation optimization process (20 is the default) |
NITER= 20; |
to define the maximum number of iterations for one optimization inside the relaxation optimization process (20 is the default) |
BRANCH;FIXBRANCH=3; |
to optimize the branch length except branch 3 . |
PARAM;FIXPARAM=2; |
to optimize the parameters except the parameter 2 (the number is the number of the parameter in the list defined above: parameter 1 is s1 for 8HKY and k1 for 4HKY) |
LIKPERSITE |
to output the likelihood for each site . The number of invariant sites is also output. |
Several computations on the same sequences and topology
To perform several computation on the same set of sequences and with the same topology, it is sufficient to repeat computation definitions after the topology definition
// example of several computation on the same topology and sequences
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)
4JC
BRANCH;
4HKY
k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30
8HKY
k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1;
This file defines three computations
Several computations with different topologies
In the same ways, you can run in a same execution computations with different topologies
// example: 3 computations on the first topology and one on the second
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)
4JC
BRANCH;
4HKY
k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30
8HKY
k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1;
TOPOLOGY
(((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3):0.2,Taxon4:0.25,Taxon5:0.25)
8HKY
k1=4.36364;s1=1;s2=1;BRANCH;PARAM;
Several computations with different sequence sets
Moreover in all these cases, when the sequences are defined with an auxiliary
file (DNAFILE
keyword), all the computations are run on all the
sequences set of the auxiliary file. This file is as
follows for two set of sequences:
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
The output file recalls the parameter values used for the computation and outputs the result of the optimization or likelihood map. In case of an optimization, the gradient and approximate Hessian are written. This can be very useful to detect whether a local maximum has been found.
All the likelihood values are log-likelihood, except with the
LIKPERSITE
keyword.
When the sequences are defined using an
auxiliary file (DNAFILE
keyword), the creation of an output file named
"PM_FileName_sumup_YYYY_Month_DD.txt" is triggered,
containing a sumup for each dataset and
each computation: model used, maximum likelihood
value, MLE, gradient and approximate Hessian values for
each involved parameter . This file is created eventhough there is only one
defined set .