How do I understand the output of seqpdist when the Jukes Cantor distance is not defined?

1 view (last 30 days)
Using the bioinformatics toolbox and executing the following commands:
Seq1='AAAAAA'
Seq2='GGGGGG'
SS={Seq1,Seq2}
seqpdist(SS,'Alphabet','NT')
then you will find that the sequences are at a distance 27.032740041837865 apart.
This Jukes-Cantor distance should not be defined in this case, since the sequences differ in more than 75% of the sites. Does anyone know
1) why this output occurs? or what heuristic is used?
2) whether seqpdist always returns finite real numbers for pairwise distances?
  2 Comments
Paola Favaretto
Paola Favaretto on 6 May 2015
Hi Elizabeth,
If the fractional dissimilarity of two sequences is greater than 3/4, a straightforward Jukes-Cantor formula includes a negative logarithm, which for this application can be considered undefined. In the Bioinformatics toolbox, the formula used is -3/4 * log(max(eps,1-4*f/3)) where f is the fractional dissimilarity (i.e. the fraction of different observations). This overcomes the limitation of a negative log and returns a number that you can interpret as "large" distance.
Indeed, under the Jukes-Cantor assumptions, if you consider two completely unrelated sequences, you would expect the sequences to have a dissimilarity fraction equal to 3/4, because by chance 1/4 of the sites would agree if all is chosen at random with uniform distribution of the bases. Thus, any two related sequences that differ for more than 3/4 of their sites will have a distance comparable to that of two unrelated sequences.
There are several other methods implemented in the function seqpdist, some of these methods are more sophisticated and overcome the limitations of Jukes-Cantor model. I would suggest you to try them and see if those models makes more sense for your data.
Hope this helps. -Paola
Elizabeth
Elizabeth on 7 May 2015
This is helpful. I couldn't figure out that undefined 'equals' -.75*log(eps), and this is exactly what I needed to know since I'm comparing results with other programs which return NaN.
Thanks!

Sign in to comment.

Answers (0)

Categories

Find more on Biological and Health Sciences in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!