PMCov

A maximum likelihood implementation of the covarion model

Christine Keribin and Marie-Anne Poursat

Université Paris-Sud 11
Laboratoire de Mathématiques
Equipe "Probabilité, Statistiques et Modélisation" UMR 8628

Introduction

Requirements

Running PMCov

Input file format

The nucleotide sequences
The topology
The computation: model types, parameters
Several computations on the same sequences and topology
Several computations with different topologies
Several computations with different sequence sets

Output file

Introduction

The basic process in the evolution of DNA sequences is the substitution of nucleotides (A, T, C, G) with the time along the phylogenic tree. Many programs or packages propose to infer phylogenetic models on different types of models. PMCov will compute the maximum likelihood estimator  (MLE)  for the covarion model, given the fact the topology of tree is known

The covarion model is characterized by two processes, each of them being modelled by a continuous time Markov chain:

where (pt,pc,pa,pg) = π  is the stationary process.
Q hence defined is the Tamura-Nei (1993) model; k1=k2 defines the Hasegawa, Kishino and Yano (HKY84) model; k1=k2=1 and pa=pc=pg=.25 is the Jukes-Cantor (1969) model.

The global covarion process can then be represented by the instantaneous rate matrix W (8x8) as follows:

ON OFF
W= ( Q/t-s1*son*I4  s1*son*I4 )  ON
  ( s2*son*I4 -s2*son*I4 ) OFF

where I4 is the identity matrix of dimension 4, t =- trace (diag(π) * Q) is the normalization of Q and son=s2/(s1+s2) is stationary probability of the switch process to be ON.

Hence, trace (diag( πW) * W)=-son*(1+ν), with  ν=2*s1*son is the average number of switchs. Define now P(T)=exp(T*W/son); then T represents the number of substitions of nucleotides, i.e. the branch length for nucleotide substitution (and therefore, ν*T is the average number of switchs, i.e. the branch length for switch substitution).

Requirements

PMCov is a program written in C++. It should be easily run on WINDOWS or UNIX system or workstation. This paper describes the use of PMCov on a WINDOWS machine. The application requires an amount of memory proportional to the size of each simulated sequence data set.

The executable can be download here.

Running PMCov

To run PMCov, type

PMCov [FileName]

where [FileName]is the name of the input file containing the definition of the computation (see below the description of this file). The results are written  in the same directory in an output file named "PM_FileName_YYYY_Month_DD.txt" . Some summary file can also be produced, see below. The errors are output on the standard output and are displayed on the screen.

Input file format

You have to described three types of information  (and in the following order) to define a computation

A comment line always begins with //.

The nucleotide sequences

They must always be described at the beginning of the file, either by directly writing them in the file (keyword DNA), or by defining the file containing this definition (keyword DNAFILE)

//description of nucleotide sequence: 5 is the number of taxa and 100 is an upper bound of the sequence length.
//No blank in names. Name  maximum length is 30.
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

//sequences can also be written as follows, but always one line for one sequence:
DNA
5 100
Taxon1 GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2 AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3 GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4 ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5 GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

//example using an auxiliary file: in this case the file contains the above description, except the DNA keyword.
DNAFILE SG4JC5000s200.seq

When you use the DNAFILE keyword, the auxiliary file can contains several set of sequences of same length and with the same taxa number. In this case, the computation will be done for each set successively. The DNAFILE keyword triggers the creation of an output file named "PM_FileName_sumup_YYYY_Month_DD.txt", that outputs for each dataset and each computation, the model used, the maximum likelihood, and the values of the MLE.

The topology

The tree format is the same as used by PHYLIP (also called the ‘Newick’ format). This is a nested set of bifurcations defined by brackets and commas.

//example: keyword TOPOLOGY on one line. Definition of the topology on the next line
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)

In this example, the tree is unrooted, as there are three branches at the higher level. Each node has three neighbors (except than the root that can also have two neighbors, in case of rooted tree).
Notice that the order in which the taxa are defined in the DNA part of the file doest not need to match the order in which these taxa are used in the TOPOLOGY definition.

The computation

Once defined the sequences and the topology, you have to declare the computation to run on these data. The computation is defined on two lines, one for selecting the model type, one for defining the input parameters.

Model types

model covarion keyword model parameters
Jukes-Cantor (1969) no 4JC no
HKY (1984) no 4HKY k1
TN(1993) no 4TN93 k1, k2
HKY (1984) yes 8HKY s1, s2, k1
TN(1993) yes 8TN93 s1, s2, k1, k2

Parameters

To define one computation, they must be on a same line, using ";" as separator.

k1=1;k2=2; to set an initial value to a model parameter (k1, k2, s1, s2)
BRANCH; to optimize on all the branches
ONEBRANCH=2; to optimize on one branch (the number is the number of the branch when running the branch tree in the left-descending order)
PARAM; to optimize on all the parameters:
PARAM=2; to optimize on one parameter (the number is the number of the parameter in the list defined above: parameter 1  is s1 for 8HKY and k1 for 4HKY)
ITERLAW; to optimize the stationary law
BRANCHMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1; to output the likelihood map according to one branch length (NBSTEP: number of computations, INITSTEP: value of the branch length for the first computation, LSTEP: increment length)
PARAMMAP=1;NBSTEP=20;LSTEP=0.1;INITSTEP=0.1; to output the likelihood map according to one parameter (NBSTEP: number of computations, INITSTEP: value of the parameter for the first computation, LSTEP: increment length)
NGLOBALITER=20; to define the maximum  number of iterations for the global relaxation optimization process (20 is the default)
NITER=  20; to define the maximum  number of iterations for one optimization inside the relaxation optimization process (20 is the default)
BRANCH;FIXBRANCH=3; to optimize the branch length except  branch 3 .
PARAM;FIXPARAM=2; to optimize the parameters except the parameter 2 (the number is the number of the parameter in the list defined above: parameter 1  is s1 for 8HKY and k1 for 4HKY)
LIKPERSITE to output the likelihood for each site . The number of  invariant sites is also output.

Several computations on the same sequences and topology

To perform several computation on the same set of sequences and with the same topology, it is sufficient to repeat computation definitions after the topology definition

// example of several computation on the same topology and sequences
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)
4JC
BRANCH;
4HKY
k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30
8HKY
k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1;
 

This file defines three computations

Several computations with different topologies

In the same ways, you can run in a same execution computations with different topologies

// example: 3 computations on the first topology and one on the second
DNA
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
TOPOLOGY
((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3,(Taxon4:0.25,Taxon5:0.25):0.1)
4JC
BRANCH;
4HKY
k1=4.36364;PARAM;BRANCH;;NGLOBALITER=30
8HKY
k1=4.36364;s1=1;s2=1;PARAMMAP=2;NBSTEP=20;LSTEP=0.1;
TOPOLOGY
(((Taxon1:0.1,Taxon2:0.2):0.05,Taxon3:0.3):0.2,Taxon4:0.25,Taxon5:0.25)
8HKY
k1=4.36364;s1=1;s2=1;BRANCH;PARAM;

Several computations with different sequence sets

Moreover in all these cases, when the sequences are defined with an auxiliary file (DNAFILE keyword), all the computations are run on all the sequences set of the  auxiliary file. This file is as follows for two set of sequences:

5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG
5 100
Taxon1
GTGCCCTCTTAGCTTCGGGGGGCTCGGCCCCTGACTTCGGTCCGGCCACCGCCCTCCCCCCTCCAGCTAGGGAGGGGCGGCCAAGGGGTGGTTCTGCCGG
Taxon2
AAGCCCTCTTAGCTTCGGGCTGCTCGGCCCCTGATCCCGGGCCGGCCGTCGTCCTCCCTCCTCTAGCCGAAGAGGGACGGCCAAGGGGCGGCTTTGCCGG
Taxon3
GCGCCCCCTTATCCCCGGGCCGCCCCGCACGGGCCCTCGGCCCGGCCACCGCCCTCCTTCTTCTGGCCGAGGGGGGGCGGCAAGGGGGTGGTTCCGCCGG
Taxon4
ATGCCCCCTTAGCTGTGGGCTGCCCGGTCCCGGACCTTAGGCCGGCTACCGTGCTCCCCCCTCGAGCTGAGGAGGGGCGGCCGAAAGGTGGTTGCGCCGG
Taxon5
GAGCCCATTTGCCTCTGGGTCGCTTGGCCCTTGGCGCCGGCCCGGCCACCGTGCTGCCTCCCCTAGTCGAGGGAGGCGGGCCGACGGATGGCCTCGCCGG

Output File 

The output file recalls the parameter values used for the computation and outputs the result of the optimization or likelihood map. In case of  an optimization, the gradient and approximate Hessian are written. This can be very useful to detect whether a local maximum has been found.

All the likelihood values are log-likelihood, except with the LIKPERSITE keyword.

When the sequences are defined using   an auxiliary file (DNAFILE keyword), the creation of an output file named "PM_FileName_sumup_YYYY_Month_DD.txt" is triggered, containing a sumup  for each dataset and each computation: model used,  maximum likelihood value, MLE, gradient and approximate Hessian values for each involved parameter . This file is created eventhough there is only one defined set .