fbpx
Wikipedia

Computational phylogenetics

Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data.[1][2] Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.

Maximum Likelihood (also likelihood) optimality criterion is the process of finding the tree topology along with its branch lengths that provides the highest probability observing the sequence data, while parsimony optimality criterion is the fewest number of state-evolutionary changes required for a phylogenetic tree to explain the sequence data.[1][2]

Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms, while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification.

Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed.[citation needed] The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species.

Types of phylogenetic trees and networks edit

Phylogenetic trees generated by computational phylogenetics can be either rooted or unrooted depending on the input data and the algorithm used. A rooted tree is a directed graph that explicitly identifies a most recent common ancestor (MRCA),[citation needed] usually an imputed sequence that is not represented in the input. Genetic distance measures can be used to plot a tree with the input sequences as leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA. Identification of a root usually requires the inclusion in the input data of at least one "outgroup" known to be only distantly related to the sequences of interest.

By contrast, unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent. An unrooted tree can always be produced from a rooted tree, but a root cannot usually be placed on an unrooted tree without additional data on divergence rates, such as the assumption of the molecular clock hypothesis.[3]

The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional "tree space" through which search paths can be traced by optimization algorithms. Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology, it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters.[2]

Both rooted and unrooted phylogenetic trees can be further generalized to rooted or unrooted phylogenetic networks, which allow for the modeling of evolutionary phenomena such as hybridization or horizontal gene transfer.

Coding characters and defining homology edit

Morphological analysis edit

The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier. The types of phenotypic data used to construct this matrix depend on the taxa being compared; for individual species, they may involve measurements of average body size, lengths or sizes of particular bones or other physical features, or even behavioral manifestations. Of course, since not every possible phenotypic characteristic could be measured and encoded for analysis, the selection of which features to measure is a major inherent obstacle to the method. The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant.[4] Morphological studies can be confounded by examples of convergent evolution of phenotypes.[5] A major challenge in constructing useful classes is the high likelihood of inter-taxon overlap in the distribution of the phenotype's variation. The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records, but has been shown to have a significant effect on the trees produced; in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data.[6]

Some phenotypic classifications, particularly those used when analyzing very diverse groups of taxa, are discrete and unambiguous; classifying organisms as possessing or lacking a tail, for example, is straightforward in the majority of cases, as is counting features such as eyes or vertebrae. However, the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution. A common method is simply to sort the measurements of interest into two or more classes, rendering continuous observed variation as discretely classifiable (e.g., all examples with humerus bones longer than a given cutoff are scored as members of one state, and all members whose humerus bones are shorter than the cutoff are scored as members of a second state). This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements.[7]

Because morphological data is extremely labor-intensive to collect, whether from literature sources or from field observations, reuse of previously compiled data matrices is not uncommon, although this may propagate flaws in the original matrix into multiple derivative analyses.[8]

Molecular analysis edit

The problem of character coding is very different in molecular analyses, as the characters in biological sequence data are immediate and discretely defined - distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences. However, defining homology can be challenging due to the inherent difficulties of multiple sequence alignment. For a given gapped MSA, several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are "mutations" versus ancestral characters, and which events are insertion mutations or deletion mutations. For example, given only a pairwise alignment with a gap region, it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion. The problem is magnified in MSAs with unaligned and nonoverlapping gaps. In practice, sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation.

Distance-matrix methods edit

Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic distance" between the sequences being classified, and therefore, they require an MSA as an input. Distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches.[3] Distance methods attempt to construct an all-to-all matrix from the sequence query set describing the distance between each sequence pair. From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences. Distance-matrix methods may produce either rooted or unrooted trees, depending on the algorithm used to calculate them. They are frequently used as the basis for progressive and iterative types of multiple sequence alignments. The main disadvantage of distance-matrix methods is their inability to efficiently use information about local high-variation regions that appear across multiple subtrees.[2]

UPGMA and WPGMA edit

The UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA (Weighted Pair Group Method with Arithmetic mean) methods produce rooted trees and require a constant-rate assumption - that is, it assumes an ultrametric tree in which the distances from the root to every branch tip are equal.[9]

Neighbor-joining edit

Neighbor-joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric. The simple neighbor-joining method produces unrooted trees, but it does not assume a constant rate of evolution (i.e., a molecular clock) across lineages.[10]

Fitch–Margoliash method edit

The Fitch–Margoliash method uses a weighted least squares method for clustering based on genetic distance.[11] Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences. The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups. The distances calculated by this method must be linear; the linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances - a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. This correction is done through the use of a substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is only necessary in practice when the evolution rates differ among branches.[2] Another modification of the algorithm can be helpful, especially in case of concentrated distances (please refer to concentration of measure phenomenon and curse of dimensionality): that modification, described in,[12] has been shown to improve the efficiency of the algorithm and its robustness.

The least-squares criterion applied to these distances is more accurate but less efficient than the neighbor-joining methods. An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete,[13] so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space.

Using outgroups edit

Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees. Standard usage of distance-matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set.[3] This usage can be seen as a type of experimental control. If the outgroup has been appropriately chosen, it will have a much greater genetic distance and thus a longer branch length than any other sequence, and it will appear near the root of a rooted tree. Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest; too close a relationship defeats the purpose of the outgroup and too distant adds noise to the analysis.[3] Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related, but the gene encoded by the sequences is highly conserved across lineages. Horizontal gene transfer, especially between otherwise divergent bacteria, can also confound outgroup usage.

Maximum parsimony edit

Maximum parsimony (MP) is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data. Some ways of scoring trees also include a "cost" associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost. This is a useful approach in cases where not every possible type of event is equally likely - for example, when particular nucleotides or amino acids are known to be more mutable than others.

The most naive way of identifying the most parsimonious tree is simple enumeration - considering each possible tree in succession and searching for the tree with the smallest score. However, this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP-hard;[2] consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree, if not the best in the set. Most such methods involve a steepest descent-style minimization mechanism operating on a tree rearrangement criterion.

Branch and bound edit

The branch and bound algorithm is a general method used to increase the efficiency of searches for near-optimal solutions of NP-hard problems first applied to phylogenetics in the early 1980s.[14] Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions. As its name implies, it requires as input both a branching rule (in the case of phylogenetics, the addition of the next species or sequence to the tree) and a bound (a rule that excludes certain regions of the search space from consideration, thereby assuming that the optimal solution cannot occupy that region). Identifying a good bound is the most challenging aspect of the algorithm's application to phylogenetics. A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree. A set of criteria known as Zharkikh's rules[15] severely limit the search space by defining characteristics shared by all candidate "most parsimonious" trees. The two most basic rules require the elimination of all but one redundant sequence (for cases where multiple observations have produced identical data) and the elimination of character sites at which two or more states do not occur in at least two species. Under ideal conditions these rules and their associated algorithm would completely define a tree.

Sankoff-Morel-Cedergren algorithm edit

The Sankoff-Morel-Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences.[16] The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches, thereby favoring the tree that introduces a minimal number of such events (an alternative view holds that the trees to be favored are those that maximize the amount of sequence similarity that can be interpreted as homology, a point of view that may lead to different optimal trees [17]). The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree. The lowest-scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function. Because the method is highly computationally intensive, an approximate method in which initial guesses for the interior alignments are refined one node at a time. Both the full and the approximate version are in practice calculated by dynamic programming.[2]

MALIGN and POY edit

More recent phylogenetic tree/MSA methods use heuristics to isolate high-scoring, but not necessarily optimal, trees. The MALIGN method uses a maximum-parsimony technique to compute a multiple alignment by maximizing a cladogram score, and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA.[18] However, the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events.[19] This, in turn, has been countered by the view that such methods should be seen as heuristic approaches to find the trees that maximize the amount of sequence similarity that can be interpreted as homology.[17][20]

Maximum likelihood edit

The maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees. The method requires a substitution model to assess the probability of particular mutations; roughly, a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability. This is broadly similar to the maximum-parsimony method, but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites. In fact, the method requires that evolution at different sites and along different lineages must be statistically independent. Maximum likelihood is thus well suited to the analysis of distantly related sequences, but it is believed to be computationally intractable to compute due to its NP-hardness.[21]

The "pruning" algorithm, a variant of dynamic programming, is often used to reduce the search space by efficiently calculating the likelihood of subtrees.[2] The method calculates the likelihood for each site in a "linear" manner, starting at a node whose only descendants are leaves (that is, the tips of the tree) and working backwards toward the "bottom" node in nested sets. However, the trees produced by the method are only rooted if the substitution model is irreversible, which is not generally true of biological systems. The search for the maximum-likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically; general global optimization tools such as the Newton–Raphson method are often used.

Some tools that use maximum likelihood to infer phylogenetic trees from variant allelic frequency data (VAFs) include AncesTree and CITUP.[22][23]

Bayesian inference edit

Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods. Bayesian methods assume a prior probability distribution of the possible trees, which may simply be the probability of any one tree among all the possible trees that could be generated from the data, or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes. The choice of prior distribution is a point of contention among users of Bayesian-inference phylogenetics methods.[2]

Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms, although the choice of move set varies; selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step[24] and swapping descendant subtrees of a random internal node between two related trees.[25] The use of Bayesian methods in phylogenetics has been controversial, largely due to incomplete specification of the choice of move set, acceptance criterion, and prior distribution in published work.[2] Bayesian methods are generally held to be superior to parsimony-based methods; they can be more prone to long-branch attraction than maximum likelihood techniques,[26] although they are better able to accommodate missing data.[27]

Whereas likelihood methods find the tree that maximizes the probability of the data, a Bayesian approach recovers a tree that represents the most likely clades, by drawing on the posterior distribution. However, estimates of the posterior probability of clades (measuring their 'support') can be quite wide of the mark, especially in clades that aren't overwhelmingly likely. As such, other methods have been put forwards to estimate posterior probability.[28]

Some tools that use Bayesian inference to infer phylogenetic trees from variant allelic frequency data (VAFs) include Canopy, EXACT, and PhyloWGS.[29][30][31]

Model selection edit

Molecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied. At their simplest, substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences. The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other (alternatively, the distance is linear only shortly before coalescence). The longer the amount of time after divergence, the more likely it becomes that two mutations occur at the same nucleotide site. Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history. The extent of this undercount increases with increasing time since divergence, which can lead to the phenomenon of long branch attraction, or the misassignment of two distantly related but convergently evolving sequences as closely related.[32] The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events.[2]

Types of models edit

All substitution models assign a set of weights to each possible change of state represented in the sequence. The most common model types are implicitly reversible because they assign the same weight to, for example, a G>C nucleotide mutation as to a C>G mutation. The simplest possible model, the Jukes-Cantor model, assigns an equal probability to every possible change of state for a given nucleotide base. The rate of change between any two distinct nucleotides will be one-third of the overall substitution rate.[2] More advanced models distinguish between transitions and transversions. The most general possible time-reversible model, called the GTR model, has six mutation rate parameters. An even more generalized model known as the general 12-parameter model breaks time-reversibility, at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages.[2] One possible variation on this theme adjusts the rates so that overall GC content - an important measure of DNA double helix stability - varies over time.[33]

Models may also allow for the variation of rates with positions in the input sequence. The most obvious example of such variation follows from the arrangement of nucleotides in protein-coding genes into three-base codons. If the location of the open reading frame (ORF) is known, rates of mutation can be adjusted for position of a given site within a codon, since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon's meaning in the genetic code.[32] A less hypothesis-driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution, often the gamma distribution or log-normal distribution.[2] Finally, a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates, so that the mutation rate of a given site is correlated across sites and lineages.[34]

Choosing the best model edit

The selection of an appropriate model is critical for the production of good phylogenetic analyses, both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated, and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit.[32] The most common method of model selection is the likelihood ratio test (LRT), which produces a likelihood estimate that can be interpreted as a measure of "goodness of fit" between the model and the input data.[32] However, care must be taken in using these results, since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model, which can lead to the naive selection of models that are overly complex.[2] For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models. A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models; it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected.[35]

An alternative model selection method is the Akaike information criterion (AIC), formally an estimate of the Kullback–Leibler divergence between the true model and the model being tested. It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models.[32] The AIC is calculated on an individual model rather than a pair, so it is independent of the order in which models are assessed. A related alternative, the Bayesian information criterion (BIC), has a similar basic interpretation but penalizes complex models more heavily.[32] Determining the most suitable model for phylogeny reconstruction constitutes a fundamental step in numerous evolutionary studies. However, various criteria for model selection are leading to debate over which criterion is preferable. It has recently been shown that, when topologies and ancestral sequence reconstruction are the desired output, choosing one criterion over another is not crucial. Instead, using the most complex nucleotide substitution model, GTR+I+G, leads to similar results for the inference of tree topology and ancestral sequences.[36]

A comprehensive step-by-step protocol on constructing phylogenetic trees, including DNA/Amino Acid contiguous sequence assembly, multiple sequence alignment, model-test (testing best-fitting substitution models) and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference, is available at Protocol Exchange[37]

A non traditional way of evaluating the phylogenetic tree is to compare it with clustering result. One can use a Multidimensional Scaling technique, so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the sequences in 3D, and then map the phylogenetic tree onto the clustering result. A better tree usually has a higher correlation with the clustering result.[38]

Evaluating tree support edit

As with all statistical analysis, the estimation of phylogenies from character data requires an evaluation of confidence. A number of methods exist to test the amount of support for a phylogenetic tree, either by evaluating the support for each sub-tree in the phylogeny (nodal support) or evaluating whether the phylogeny is significantly different from other possible trees (alternative tree hypothesis tests).

Nodal support edit

The most common method for assessing tree support is to evaluate the statistical support for each node on the tree. Typically, a node with very low support is not considered valid in further analysis, and visually may be collapsed into a polytomy to indicate that relationships within a clade are unresolved.

Consensus tree edit

Many methods for assessing nodal support involve consideration of multiple phylogenies. The consensus tree summarizes the nodes that are shared among a set of trees.[39] In a *strict consensus,* only nodes found in every tree are shown, and the rest are collapsed into an unresolved polytomy. Less conservative methods, such as the *majority-rule consensus* tree, consider nodes that are supported by a given percentage of trees under consideration (such as at least 50%).

For example, in maximum parsimony analysis, there may be many trees with the same parsimony score. A strict consensus tree would show which nodes are found in all equally parsimonious trees, and which nodes differ. Consensus trees are also used to evaluate support on phylogenies reconstructed with Bayesian inference (see below).

Bootstrapping and jackknifing edit

In statistics, the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data. For example, given a set of 100 data points, a pseudoreplicate is a data set of the same size (100 points) randomly sampled from the original data, with replacement. That is, each original data point may be represented more than once in the pseudoreplicate, or not at all. Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates.

In phylogenetics, bootstrapping is conducted using the columns of the character matrix. Each pseudoreplicate contains the same number of species (rows) and characters (columns) randomly sampled from the original matrix, with replacement. A phylogeny is reconstructed from each pseudoreplicate, with the same methods used to reconstruct the phylogeny from the original data. For each node on the phylogeny, the nodal support is the percentage of pseudoreplicates containing that node.[40]

The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories,[41] finding that 70% bootstrap support corresponds to a 95% probability that the clade exists. However, this was tested under ideal conditions (e.g. no change in evolutionary rates, symmetric phylogenies). In practice, values above 70% are generally supported and left to the researcher or reader to evaluate confidence. Nodes with support lower than 70% are typically considered unresolved.

Jackknifing in phylogenetics is a similar procedure, except the columns of the matrix are sampled without replacement. Pseudoreplicates are generated by randomly subsampling the data—for example, a "10% jackknife" would involve randomly sampling 10% of the matrix many times to evaluate nodal support.

Posterior probability edit

Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model, rather than a single "best" tree. The trees in the posterior distribution generally have many different topologies. When the input data is variant allelic frequency data (VAF), the tool EXACT can compute the probabilities of trees exactly, for small, biologically relevant tree sizes, by exhaustively searching the entire tree space.[29]

Most Bayesian inference methods utilize a Markov-chain Monte Carlo iteration, and the initial steps of this chain are not considered reliable reconstructions of the phylogeny. Trees generated early in the chain are usually discarded as burn-in. The most common method of evaluating nodal support in a Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution (post-burn-in) which contain the node.

The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model.[42] Therefore, the threshold for accepting a node as supported is generally higher than for bootstrapping.

Step counting methods edit

Bremer support counts the number of extra steps needed to contradict a clade.

Shortcomings edit

These measures each have their weaknesses. For example, smaller or larger clades tend to attract larger support values than mid-sized clades, simply as a result of the number of taxa in them.[43]

Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade.[44]

Limitations and workarounds edit

Ultimately, there is no way to measure whether a particular phylogenetic hypothesis is accurate or not, unless the true relationships among the taxa being examined are already known (which may happen with bacteria or viruses under laboratory conditions). The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence. Several potential pitfalls have been identified:

Homoplasy edit

Certain characters are more likely to evolve convergently than others; logically, such characters should be given less weight in the reconstruction of a tree.[45] Weights in the form of a model of evolution can be inferred from sets of molecular data, so that maximum likelihood or Bayesian methods can be used to analyze them. For molecular sequences, this problem is exacerbated when the taxa under study have diverged substantially. As time since the divergence of two taxa increase, so does the probability of multiple substitutions on the same site, or back mutations, all of which result in homoplasies. For morphological data, unfortunately, the only objective way to determine convergence is by the construction of a tree – a somewhat circular method. Even so, weighting homoplasious characters[how?] does indeed lead to better-supported trees.[45] Further refinement can be brought by weighting changes in one direction higher than changes in another; for instance, the presence of thoracic wings almost guarantees placement among the pterygote insects because, although wings are often lost secondarily, there is no evidence that they have been gained more than once.[46]

Horizontal gene transfer edit

In general, organisms can inherit genes in two ways: vertical gene transfer and horizontal gene transfer. Vertical gene transfer is the passage of genes from parent to offspring, and horizontal (also called lateral) gene transfer occurs when genes jump between unrelated organisms, a common phenomenon especially in prokaryotes; a good example of this is the acquired antibiotic resistance as a result of gene exchange between various bacteria leading to multi-drug-resistant bacterial species. There have also been well-documented cases of horizontal gene transfer between eukaryotes.

Horizontal gene transfer has complicated the determination of phylogenies of organisms, and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees. The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically; this requires analyzing a large number of genes.

Hybrids, speciation, introgressions and incomplete lineage sorting edit

The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion. While such an assumption may hold on a larger scale (bar horizontal gene transfer, see above), speciation is often much less orderly. Research since the cladistic method was introduced has shown that hybrid speciation, once thought rare, is in fact quite common, particularly in plants.[47][48] Also paraphyletic speciation is common, making the assumption of a bifurcating pattern unsuitable, leading to phylogenetic networks rather than trees.[49][50] Introgression can also move genes between otherwise distinct species and sometimes even genera,[51] complicating phylogenetic analysis based on genes.[52] This phenomenon can contribute to "incomplete lineage sorting" and is thought to be a common phenomenon across a number of groups. In species level analysis this can be dealt with by larger sampling or better whole genome analysis.[53] Often the problem is avoided by restricting the analysis to fewer, not closely related specimens.

Taxon sampling edit

Owing to the development of advanced sequencing techniques in molecular biology, it has become feasible to gather large amounts of data (DNA or amino acid sequences) to infer phylogenetic hypotheses. For example, it is not rare to find studies with character matrices based on whole mitochondrial genomes (~16,000 nucleotides, in many animals). However, simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters, because the more taxa there are, the more accurate and more robust is the resulting phylogenetic tree.[54][55] This may be partly due to the breaking up of long branches.

Phylogenetic signal edit

Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal, a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly. Tests for phylogenetic signal exist.[56]

Continuous characters edit

Morphological characters that sample a continuum may contain phylogenetic signal, but are hard to code as discrete characters. Several methods have been used, one of which is gap coding, and there are variations on gap coding.[57] In the original form of gap coding:[57]

group means for a character are first ordered by size. The pooled within-group standard deviation is calculated ... and differences between adjacent means ... are compared relative to this standard deviation. Any pair of adjacent means is considered different and given different integer scores ... if the means are separated by a "gap" greater than the within-group standard deviation ... times some arbitrary constant.

If more taxa are added to the analysis, the gaps between taxa may become so small that all information is lost. Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa.[57]

Missing data edit

In general, the more data that are available when constructing a tree, the more accurate and reliable the resulting tree will be. Missing data are no more detrimental than simply having fewer data, although the impact is greatest when most of the missing data are in a small number of taxa. Concentrating the missing data across a small number of characters produces a more robust tree.[58]

The role of fossils edit

Because many characters involve embryological, or soft-tissue or molecular characters that (at best) hardly ever fossilize, and the interpretation of fossils is more ambiguous than that of living taxa, extinct taxa almost invariably have higher proportions of missing data than living ones. However, despite these limitations, the inclusion of fossils is invaluable, as they can provide information in sparse areas of trees, breaking up long branches and constraining intermediate character states; thus, fossil taxa contribute as much to tree resolution as modern taxa.[59] Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record;[1] stratocladistics incorporates age information into data matrices for phylogenetic analyses.

See also edit

References edit

  1. ^ a b c Khalafvand, Tyler (2015). "Finding Structure in the Phylogeny Search Space". Dalhousie University.
  2. ^ a b c d e f g h i j k l m n o Felsenstein J (2004). Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates. ISBN 978-0-87893-177-4.
  3. ^ a b c d Mount DM (2004). Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press. ISBN 978-0-87969-712-9.
  4. ^ Swiderski DL, Zelditch ML, Fink WL (September 1998). "Why morphometrics is not special: coding quantitative data for phylogenetic analysis". Systematic Biology. 47 (3): 508–19. JSTOR 2585256. PMID 12066691.
  5. ^ Gaubert P, Wozencraft WC, Cordeiro-Estrela P, Veron G (December 2005). "Mosaics of convergences and noise in morphological phylogenies: what's in a viverrid-like carnivoran?". Systematic Biology. 54 (6): 865–94. doi:10.1080/10635150500232769. PMID 16282167.
  6. ^ Strait DS, Grine FE (December 2004). "Inferring hominoid and early hominid phylogeny using craniodental characters: the role of fossil taxa". Journal of Human Evolution. 47 (6): 399–452. doi:10.1016/j.jhevol.2004.08.008. PMID 15566946.
  7. ^ Wiens JJ (2001). "Character analysis in morphological phylogenetics: problems and solutions". Systematic Biology. 50 (5): 689–99. doi:10.1080/106351501753328811. PMID 12116939.
  8. ^ Jenner RA (2001). "Bilaterian phylogeny and uncritical recycling of morphological data sets". Systematic Biology. 50 (5): 730–42. doi:10.1080/106351501753328857. PMID 12116943.
  9. ^ Sokal R, Michener C (1958). "A statistical method for evaluating systematic relationships". University of Kansas Science Bulletin. 38: 1409–1438.
  10. ^ Saitou N, Nei M (July 1987). "The neighbor-joining method: a new method for reconstructing phylogenetic trees". Molecular Biology and Evolution. 4 (4): 406–25. doi:10.1093/oxfordjournals.molbev.a040454. PMID 3447015.
  11. ^ Fitch WM, Margoliash E (January 1967). "Construction of phylogenetic trees". Science. 155 (3760): 279–84. Bibcode:1967Sci...155..279F. doi:10.1126/science.155.3760.279. PMID 5334057.
  12. ^ Lespinats S, Grando D, Maréchal E, Hakimi MA, Tenaillon O, Bastien O (2011). "How Fitch-Margoliash Algorithm can Benefit from Multi Dimensional Scaling". Evolutionary Bioinformatics Online. 7: 61–85. doi:10.4137/EBO.S7048. PMC 3118699. PMID 21697992.
  13. ^ Day WH (1987). "Computational complexity of inferring phylogenies from dissimilarity matrices". Bulletin of Mathematical Biology. 49 (4): 461–7. doi:10.1007/BF02458863. PMID 3664032. S2CID 189885258.
  14. ^ Hendy MD, Penny D (1982). "Branch and bound algorithms to determine minimal evolutionary trees". Mathematical Biosciences. 59 (2): 277–290. doi:10.1016/0025-5564(82)90027-X.
  15. ^ Ratner VA, Zharkikh AA, Kolchanov N, Rodin S, Solovyov S, Antonov AS (1995). Molecular Evolution. Biomathematics Series. Vol. 24. New York: Springer-Verlag. ISBN 978-3-662-12530-4.
  16. ^ Sankoff D, Morel C, Cedergren RJ (October 1973). "Evolution of 5S RNA and the non-randomness of base replacement". Nature. 245 (147): 232–4. doi:10.1038/newbio245232a0. PMID 4201431.
  17. ^ a b De Laet J (2005). "Parsimony and the problem of inapplicables in sequence data.". In Albert VA (ed.). Parsimony, phylogeny and genomics. Oxford University Press. pp. 81–116. ISBN 978-0-19-856493-5.
  18. ^ Wheeler WC, Gladstein DS (1994). "MALIGN: a multiple nucleic acid sequence alignment program". Journal of Heredity. 85 (5): 417–418. doi:10.1093/oxfordjournals.jhered.a111492.
  19. ^ Simmons MP (June 2004). "Independence of alignment and tree search". Molecular Phylogenetics and Evolution. 31 (3): 874–9. doi:10.1016/j.ympev.2003.10.008. PMID 15120385.
  20. ^ De Laet J (2015). "Parsimony analysis of unaligned sequence data: maximization of homology and minimization of homoplasy, not Minimization of operationally defined total cost or minimization of equally weighted transformations". Cladistics. 31 (5): 550–567. doi:10.1111/cla.12098. PMID 34772278. S2CID 221582410.
  21. ^ Chor B, Tuller T (June 2005). "Maximum likelihood of evolutionary trees: hardness and approximation". Bioinformatics. 21 (Suppl 1): i97–106. doi:10.1093/bioinformatics/bti1027. PMID 15961504.
  22. ^ El-Kebir M, Oesper L, Acheson-Field H, Raphael BJ (June 2015). "Reconstruction of clonal trees and tumor composition from multi-sample sequencing data". Bioinformatics. 31 (12): i62-70. doi:10.1093/bioinformatics/btv261. PMC 4542783. PMID 26072510.
  23. ^ Malikic S, McPherson AW, Donmez N, Sahinalp CS (May 2015). "Clonality inference in multiple tumor samples using phylogeny". Bioinformatics. 31 (9): 1349–56. doi:10.1093/bioinformatics/btv003. PMID 25568283.
  24. ^ Mau B, Newton MA (1997). "Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo". Journal of Computational and Graphical Statistics. 6 (1): 122–131. doi:10.2307/1390728. JSTOR 1390728.
  25. ^ Yang Z, Rannala B (July 1997). "Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method". Molecular Biology and Evolution. 14 (7): 717–24. doi:10.1093/oxfordjournals.molbev.a025811. PMID 9214744.
  26. ^ Kolaczkowski B, Thornton JW (December 2009). Delport W (ed.). "Long-branch attraction bias and inconsistency in Bayesian phylogenetics". PLOS ONE. 4 (12): e7891. Bibcode:2009PLoSO...4.7891K. doi:10.1371/journal.pone.0007891. PMC 2785476. PMID 20011052.
  27. ^ Simmons MP (2012). "Misleading results of likelihood-based phylogenetic analyses in the presence of missing data". Cladistics. 28 (2): 208–222. doi:10.1111/j.1096-0031.2011.00375.x. PMID 34872185. S2CID 53123024.
  28. ^ Larget B (July 2013). "The estimation of tree posterior probabilities using conditional clade probability distributions". Systematic Biology. 62 (4): 501–11. doi:10.1093/sysbio/syt014. PMC 3676676. PMID 23479066.
  29. ^ a b Ray S, Jia B, Safavi S, van Opijnen T, Isberg R, Rosch J, Bento J (22 August 2019). "Exact inference under the perfect phylogeny model". arXiv:1908.08623. Bibcode:2019arXiv190808623R. {{cite journal}}: Cite journal requires |journal= (help)
  30. ^ Jiang Y, Qiu Y, Minn AJ, Zhang NR (September 2016). "Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next-generation sequencing". Proceedings of the National Academy of Sciences of the United States of America. 113 (37): E5528-37. Bibcode:2016PNAS..113E5528J. doi:10.1073/pnas.1522203113. PMC 5027458. PMID 27573852.
  31. ^ Deshwar AG, Vembu S, Yung CK, Jang GH, Stein L, Morris Q (February 2015). "PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors". Genome Biology. 16 (1): 35. doi:10.1186/s13059-015-0602-8. PMC 4359439. PMID 25786235.
  32. ^ a b c d e f Sullivan J, Joyce P (2005). "Model Selection in Phylogenetics". Annual Review of Ecology, Evolution, and Systematics. 36 (1): 445–466. doi:10.1146/annurev.ecolsys.36.102003.152633. PMC 3144157. PMID 20671039.
  33. ^ Galtier N, Gouy M (July 1998). "Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis". Molecular Biology and Evolution. 15 (7): 871–9. doi:10.1093/oxfordjournals.molbev.a025991. PMID 9656487.
  34. ^ Fitch WM, Markowitz E (October 1970). "An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution". Biochemical Genetics. 4 (5): 579–93. doi:10.1007/bf00486096. PMID 5489762. S2CID 26638948.
  35. ^ Pol D (December 2004). "Empirical problems of the hierarchical likelihood ratio test for model selection". Systematic Biology. 53 (6): 949–62. doi:10.1080/10635150490888868. PMID 15764562.
  36. ^ Abadi S, Azouri D, Pupko T, Mayrose I (February 2019). "Model selection may not be a mandatory step for phylogeny reconstruction". Nature Communications. 10 (1): 934. Bibcode:2019NatCo..10..934A. doi:10.1038/s41467-019-08822-w. PMC 6389923. PMID 30804347.
  37. ^ Bast F (2013). "Sequence similarity search, Multiple Sequence Alignment, Model Selection, Distance Matrix and Phylogeny Reconstruction". Protocol Exchange. doi:10.1038/protex.2013.065.
  38. ^ Ruan Y, House GL, Ekanayake S, Schütte U, Bever JD, Tang H, Fox G (26 May 2014). "Integration of clustering and multidimensional scaling to determine phylogenetic trees as spherical phylograms visualized in 3 dimensions". 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE. pp. 720–729. doi:10.1109/CCGrid.2014.126. ISBN 978-1-4799-2784-5. S2CID 9581901.
  39. ^ Baum DA, Smith SD (2013). Tree Thinking: An Introduction to Phylogenetic Biology. Roberts. p. 442. ISBN 978-1-936221-16-5.
  40. ^ Felsenstein J (July 1985). "Confidence Limits on Phylogenies: An Approach Using the Bootstrap". Evolution; International Journal of Organic Evolution. 39 (4): 783–791. doi:10.2307/2408678. JSTOR 2408678. PMID 28561359.
  41. ^ Hillis DM, Bull JJ (1993). "An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis". Systematic Biology. 42 (2): 182–192. doi:10.1093/sysbio/42.2.182. ISSN 1063-5157.
  42. ^ Huelsenbeck J, Rannala B (December 2004). "Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models". Systematic Biology. 53 (6): 904–13. doi:10.1080/10635150490522629. PMID 15764559.
  43. ^ Chemisquy MA, Prevosti FJ (2013). "Evaluating the clade size effect in alternative measures of branch support". Journal of Zoological Systematics and Evolutionary Research. 51 (4): 260–273. doi:10.1111/jzs.12024. hdl:11336/4144.
  44. ^ Phillips MJ, Delsuc F, Penny D (July 2004). "Genome-scale phylogeny and the detection of systematic biases" (PDF). Molecular Biology and Evolution. 21 (7): 1455–8. doi:10.1093/molbev/msh137. PMID 15084674.
  45. ^ a b Goloboff PA, Carpenter JM, Arias JS, Esquivel DR (2008). "Weighting against homoplasy improves phylogenetic analysis of morphological data sets". Cladistics. 24 (5): 758–773. doi:10.1111/j.1096-0031.2008.00209.x. hdl:11336/82003. S2CID 913161.
  46. ^ Goloboff PA (1997). "Self-Weighted Optimization: Tree Searches and Character State Reconstructions under Implied Transformation Costs". Cladistics. 13 (3): 225–245. doi:10.1111/j.1096-0031.1997.tb00317.x. PMID 34911233. S2CID 196595734.
  47. ^ Arnold ML (1996). Natural Hybridization and Evolution. New York: Oxford University Press. p. 232. ISBN 978-0-19-509975-1.
  48. ^ Wendel JF, Doyle JJ (1998). "DNA Sequencing". In Soltis DE, Soltis PS, Doyle JJ (eds.). Molecular Systematics of Plants II. Boston: Kluwer. pp. 265–296. ISBN 978-0-19-535668-7.
  49. ^ Funk DJ, Omland KE (2003). "Species-level paraphyly and polyphyly: Frequency, causes, and consequences, with insights from animal mitochondrial DNA". Annual Review of Ecology, Evolution, and Systematics. 34: 397–423. doi:10.1146/annurev.ecolsys.34.011802.132421. S2CID 33951905.
  50. ^ "Genealogy of Life (GoLife)". National Science Foundation. Retrieved 5 May 2015. The GoLife program builds upon the AToL program by accommodating the complexity of diversification patterns across all of life's history. Our current knowledge of processes such as hybridization, endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted - for every branch of the tree - as a single, typological, bifurcating tree.
  51. ^ Kutschera VE, Bidon T, Hailer F, Rodi J, Fain SR, Janke A (2014). "Bears in a forest of gene trees: phylogenetic inference is complicated by incomplete lineage sorting and gene flow". Molecular Biology and Evolution. 31 (8): 2004–2017. doi:10.1093/molbev/msu186. PMC 4104321. PMID 24903145.
  52. ^ Qu Y, Zhang R, Quan Q, Song G, Li SH, Lei F (December 2012). "Incomplete lineage sorting or secondary admixture: disentangling historical divergence from recent gene flow in the Vinous-throated parrotbill (Paradoxornis webbianus)". Molecular Ecology. 21 (24): 6117–33. Bibcode:2012MolEc..21.6117Q. doi:10.1111/mec.12080. PMID 23095021. S2CID 22635918.
  53. ^ Pollard DA, Iyer VN, Moses AM, Eisen MB (October 2006). "Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting". PLOS Genetics. 2 (10): e173. doi:10.1371/journal.pgen.0020173. PMC 1626107. PMID 17132051.
  54. ^ Zwickl DJ, Hillis DM (August 2002). "Increased taxon sampling greatly reduces phylogenetic error". Systematic Biology. 51 (4): 588–98. doi:10.1080/10635150290102339. PMID 12228001.
  55. ^ Wiens JJ (February 2006). "Missing data and the design of phylogenetic analyses". Journal of Biomedical Informatics. 39 (1): 34–42. doi:10.1016/j.jbi.2005.04.001. PMID 15922672.
  56. ^ Blomberg SP, Garland T, Ives AR (April 2003). "Testing for phylogenetic signal in comparative data: behavioral traits are more labile". Evolution; International Journal of Organic Evolution. 57 (4): 717–45. doi:10.1111/j.0014-3820.2003.tb00285.x. PMID 12778543. S2CID 221735844.
  57. ^ a b c Archie JW (1985). "Methods for coding variable morphological features for numerical taxonomic analysis". Systematic Zoology. 34 (3): 326–345. doi:10.2307/2413151. JSTOR 2413151.
  58. ^ Prevosti FJ, Chemisquy MA (2009). "The impact of missing data on real morphological phylogenies: Influence of the number and distribution of missing entries". Cladistics. 26 (3): 326–339. doi:10.1111/j.1096-0031.2009.00289.x. hdl:11336/69010. PMID 34875786. S2CID 86850694.
  59. ^ Cobbett A, Wilkinson M, Wills MA (October 2007). "Fossils impact as hard as living taxa in parsimony analyses of morphology". Systematic Biology. 56 (5): 753–66. doi:10.1080/10635150701627296. PMID 17886145.

Further reading edit

  • Semple C, Steel M (2003). Phylogenetics. Oxford University Press. ISBN 978-0-19-850942-4.
  • Cipra BA (2007). (PDF). SIAM News. 40 (6). Archived from the original (PDF) on 3 March 2016.
  • Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007). . Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. Archived from the original on 11 August 2011. Retrieved 17 August 2011.
  • Huson DH, Rupp R, Scornavacca C (2010). Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press. ISBN 978-1-139-49287-4.

External links edit

  •   Media related to Computational phylogenetics at Wikimedia Commons

computational, phylogenetics, major, contributor, this, article, appears, have, close, connection, with, subject, require, cleanup, comply, with, wikipedia, content, policies, particularly, neutral, point, view, please, discuss, further, talk, page, february, . A major contributor to this article appears to have a close connection with its subject It may require cleanup to comply with Wikipedia s content policies particularly neutral point of view Please discuss further on the talk page February 2024 Learn how and when to remove this template message Computational phylogenetics phylogeny inference or phylogenetic inference focuses on computational and optimization algorithms heuristics and approaches involved in phylogenetic analyses The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes species or taxa Maximum likelihood parsimony Bayesian and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data 1 2 Nearest Neighbour Interchange NNI Subtree Prune and Regraft SPR and Tree Bisection and Reconnection TBR known as tree rearrangements are deterministic algorithms to search for optimal or the best phylogenetic tree The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space Maximum Likelihood also likelihood optimality criterion is the process of finding the tree topology along with its branch lengths that provides the highest probability observing the sequence data while parsimony optimality criterion is the fewest number of state evolutionary changes required for a phylogenetic tree to explain the sequence data 1 2 Traditional phylogenetics relies on morphological data obtained by measuring and quantifying the phenotypic properties of representative organisms while the more recent field of molecular phylogenetics uses nucleotide sequences encoding genes or amino acid sequences encoding proteins as the basis for classification Many forms of molecular phylogenetics are closely related to and make extensive use of sequence alignment in constructing and refining phylogenetic trees which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species The phylogenetic trees constructed by computational methods are unlikely to perfectly reproduce the evolutionary tree that represents the historical relationships between the species being analyzed citation needed The historical species tree may also differ from the historical tree of an individual homologous gene shared by those species Contents 1 Types of phylogenetic trees and networks 2 Coding characters and defining homology 2 1 Morphological analysis 2 2 Molecular analysis 3 Distance matrix methods 3 1 UPGMA and WPGMA 3 2 Neighbor joining 3 3 Fitch Margoliash method 3 4 Using outgroups 4 Maximum parsimony 4 1 Branch and bound 4 2 Sankoff Morel Cedergren algorithm 4 3 MALIGN and POY 5 Maximum likelihood 6 Bayesian inference 7 Model selection 7 1 Types of models 7 2 Choosing the best model 8 Evaluating tree support 8 1 Nodal support 8 1 1 Consensus tree 8 1 2 Bootstrapping and jackknifing 8 1 3 Posterior probability 8 1 4 Step counting methods 8 2 Shortcomings 9 Limitations and workarounds 9 1 Homoplasy 9 2 Horizontal gene transfer 9 3 Hybrids speciation introgressions and incomplete lineage sorting 9 4 Taxon sampling 9 5 Phylogenetic signal 9 6 Continuous characters 9 7 Missing data 10 The role of fossils 11 See also 12 References 13 Further reading 14 External linksTypes of phylogenetic trees and networks editPhylogenetic trees generated by computational phylogenetics can be either rooted or unrooted depending on the input data and the algorithm used A rooted tree is a directed graph that explicitly identifies a most recent common ancestor MRCA citation needed usually an imputed sequence that is not represented in the input Genetic distance measures can be used to plot a tree with the input sequences as leaf nodes and their distances from the root proportional to their genetic distance from the hypothesized MRCA Identification of a root usually requires the inclusion in the input data of at least one outgroup known to be only distantly related to the sequences of interest By contrast unrooted trees plot the distances and relationships between input sequences without making assumptions regarding their descent An unrooted tree can always be produced from a rooted tree but a root cannot usually be placed on an unrooted tree without additional data on divergence rates such as the assumption of the molecular clock hypothesis 3 The set of all possible phylogenetic trees for a given group of input sequences can be conceptualized as a discretely defined multidimensional tree space through which search paths can be traced by optimization algorithms Although counting the total number of trees for a nontrivial number of input sequences can be complicated by variations in the definition of a tree topology it is always true that there are more rooted than unrooted trees for a given number of inputs and choice of parameters 2 Both rooted and unrooted phylogenetic trees can be further generalized to rooted or unrooted phylogenetic networks which allow for the modeling of evolutionary phenomena such as hybridization or horizontal gene transfer Coding characters and defining homology editMorphological analysis edit The basic problem in morphological phylogenetics is the assembly of a matrix representing a mapping from each of the taxa being compared to representative measurements for each of the phenotypic characteristics being used as a classifier The types of phenotypic data used to construct this matrix depend on the taxa being compared for individual species they may involve measurements of average body size lengths or sizes of particular bones or other physical features or even behavioral manifestations Of course since not every possible phenotypic characteristic could be measured and encoded for analysis the selection of which features to measure is a major inherent obstacle to the method The decision of which traits to use as a basis for the matrix necessarily represents a hypothesis about which traits of a species or higher taxon are evolutionarily relevant 4 Morphological studies can be confounded by examples of convergent evolution of phenotypes 5 A major challenge in constructing useful classes is the high likelihood of inter taxon overlap in the distribution of the phenotype s variation The inclusion of extinct taxa in morphological analysis is often difficult due to absence of or incomplete fossil records but has been shown to have a significant effect on the trees produced in one study only the inclusion of extinct species of apes produced a morphologically derived tree that was consistent with that produced from molecular data 6 Some phenotypic classifications particularly those used when analyzing very diverse groups of taxa are discrete and unambiguous classifying organisms as possessing or lacking a tail for example is straightforward in the majority of cases as is counting features such as eyes or vertebrae However the most appropriate representation of continuously varying phenotypic measurements is a controversial problem without a general solution A common method is simply to sort the measurements of interest into two or more classes rendering continuous observed variation as discretely classifiable e g all examples with humerus bones longer than a given cutoff are scored as members of one state and all members whose humerus bones are shorter than the cutoff are scored as members of a second state This results in an easily manipulated data set but has been criticized for poor reporting of the basis for the class definitions and for sacrificing information compared to methods that use a continuous weighted distribution of measurements 7 Because morphological data is extremely labor intensive to collect whether from literature sources or from field observations reuse of previously compiled data matrices is not uncommon although this may propagate flaws in the original matrix into multiple derivative analyses 8 Molecular analysis edit The problem of character coding is very different in molecular analyses as the characters in biological sequence data are immediate and discretely defined distinct nucleotides in DNA or RNA sequences and distinct amino acids in protein sequences However defining homology can be challenging due to the inherent difficulties of multiple sequence alignment For a given gapped MSA several rooted phylogenetic trees can be constructed that vary in their interpretations of which changes are mutations versus ancestral characters and which events are insertion mutations or deletion mutations For example given only a pairwise alignment with a gap region it is impossible to determine whether one sequence bears an insertion mutation or the other carries a deletion The problem is magnified in MSAs with unaligned and nonoverlapping gaps In practice sizable regions of a calculated alignment may be discounted in phylogenetic tree construction to avoid integrating noisy data into the tree calculation Distance matrix methods editMain article Distance matrices in phylogeny Distance matrix methods of phylogenetic analysis explicitly rely on a measure of genetic distance between the sequences being classified and therefore they require an MSA as an input Distance is often defined as the fraction of mismatches at aligned positions with gaps either ignored or counted as mismatches 3 Distance methods attempt to construct an all to all matrix from the sequence query set describing the distance between each sequence pair From this is constructed a phylogenetic tree that places closely related sequences under the same interior node and whose branch lengths closely reproduce the observed distances between sequences Distance matrix methods may produce either rooted or unrooted trees depending on the algorithm used to calculate them They are frequently used as the basis for progressive and iterative types of multiple sequence alignments The main disadvantage of distance matrix methods is their inability to efficiently use information about local high variation regions that appear across multiple subtrees 2 UPGMA and WPGMA edit Main articles UPGMA and WPGMA The UPGMA Unweighted Pair Group Method with Arithmetic mean and WPGMA Weighted Pair Group Method with Arithmetic mean methods produce rooted trees and require a constant rate assumption that is it assumes an ultrametric tree in which the distances from the root to every branch tip are equal 9 Neighbor joining edit Main article Neighbor joining Neighbor joining methods apply general cluster analysis techniques to sequence analysis using genetic distance as a clustering metric The simple neighbor joining method produces unrooted trees but it does not assume a constant rate of evolution i e a molecular clock across lineages 10 Fitch Margoliash method edit The Fitch Margoliash method uses a weighted least squares method for clustering based on genetic distance 11 Closely related sequences are given more weight in the tree construction process to correct for the increased inaccuracy in measuring distances between distantly related sequences The distances used as input to the algorithm must be normalized to prevent large artifacts in computing relationships between closely related and distantly related groups The distances calculated by this method must be linear the linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances a property that applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites This correction is done through the use of a substitution matrix such as that derived from the Jukes Cantor model of DNA evolution The distance correction is only necessary in practice when the evolution rates differ among branches 2 Another modification of the algorithm can be helpful especially in case of concentrated distances please refer to concentration of measure phenomenon and curse of dimensionality that modification described in 12 has been shown to improve the efficiency of the algorithm and its robustness The least squares criterion applied to these distances is more accurate but less efficient than the neighbor joining methods An additional improvement that corrects for correlations between distances that arise from many closely related sequences in the data set can also be applied at increased computational cost Finding the optimal least squares tree with any correction factor is NP complete 13 so heuristic search methods like those used in maximum parsimony analysis are applied to the search through tree space Using outgroups edit Independent information about the relationship between sequences or groups can be used to help reduce the tree search space and root unrooted trees Standard usage of distance matrix methods involves the inclusion of at least one outgroup sequence known to be only distantly related to the sequences of interest in the query set 3 This usage can be seen as a type of experimental control If the outgroup has been appropriately chosen it will have a much greater genetic distance and thus a longer branch length than any other sequence and it will appear near the root of a rooted tree Choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest too close a relationship defeats the purpose of the outgroup and too distant adds noise to the analysis 3 Care should also be taken to avoid situations in which the species from which the sequences were taken are distantly related but the gene encoded by the sequences is highly conserved across lineages Horizontal gene transfer especially between otherwise divergent bacteria can also confound outgroup usage Maximum parsimony editMain article Maximum parsimony phylogenetics Maximum parsimony MP is a method of identifying the potential phylogenetic tree that requires the smallest total number of evolutionary events to explain the observed sequence data Some ways of scoring trees also include a cost associated with particular types of evolutionary events and attempt to locate the tree with the smallest total cost This is a useful approach in cases where not every possible type of event is equally likely for example when particular nucleotides or amino acids are known to be more mutable than others The most naive way of identifying the most parsimonious tree is simple enumeration considering each possible tree in succession and searching for the tree with the smallest score However this is only possible for a relatively small number of sequences or species because the problem of identifying the most parsimonious tree is known to be NP hard 2 consequently a number of heuristic search methods for optimization have been developed to locate a highly parsimonious tree if not the best in the set Most such methods involve a steepest descent style minimization mechanism operating on a tree rearrangement criterion Branch and bound edit The branch and bound algorithm is a general method used to increase the efficiency of searches for near optimal solutions of NP hard problems first applied to phylogenetics in the early 1980s 14 Branch and bound is particularly well suited to phylogenetic tree construction because it inherently requires dividing a problem into a tree structure as it subdivides the problem space into smaller regions As its name implies it requires as input both a branching rule in the case of phylogenetics the addition of the next species or sequence to the tree and a bound a rule that excludes certain regions of the search space from consideration thereby assuming that the optimal solution cannot occupy that region Identifying a good bound is the most challenging aspect of the algorithm s application to phylogenetics A simple way of defining the bound is a maximum number of assumed evolutionary changes allowed per tree A set of criteria known as Zharkikh s rules 15 severely limit the search space by defining characteristics shared by all candidate most parsimonious trees The two most basic rules require the elimination of all but one redundant sequence for cases where multiple observations have produced identical data and the elimination of character sites at which two or more states do not occur in at least two species Under ideal conditions these rules and their associated algorithm would completely define a tree Sankoff Morel Cedergren algorithm edit The Sankoff Morel Cedergren algorithm was among the first published methods to simultaneously produce an MSA and a phylogenetic tree for nucleotide sequences 16 The method uses a maximum parsimony calculation in conjunction with a scoring function that penalizes gaps and mismatches thereby favoring the tree that introduces a minimal number of such events an alternative view holds that the trees to be favored are those that maximize the amount of sequence similarity that can be interpreted as homology a point of view that may lead to different optimal trees 17 The imputed sequences at the interior nodes of the tree are scored and summed over all the nodes in each possible tree The lowest scoring tree sum provides both an optimal tree and an optimal MSA given the scoring function Because the method is highly computationally intensive an approximate method in which initial guesses for the interior alignments are refined one node at a time Both the full and the approximate version are in practice calculated by dynamic programming 2 MALIGN and POY edit More recent phylogenetic tree MSA methods use heuristics to isolate high scoring but not necessarily optimal trees The MALIGN method uses a maximum parsimony technique to compute a multiple alignment by maximizing a cladogram score and its companion POY uses an iterative method that couples the optimization of the phylogenetic tree with improvements in the corresponding MSA 18 However the use of these methods in constructing evolutionary hypotheses has been criticized as biased due to the deliberate construction of trees reflecting minimal evolutionary events 19 This in turn has been countered by the view that such methods should be seen as heuristic approaches to find the trees that maximize the amount of sequence similarity that can be interpreted as homology 17 20 Maximum likelihood editThe maximum likelihood method uses standard statistical techniques for inferring probability distributions to assign probabilities to particular possible phylogenetic trees The method requires a substitution model to assess the probability of particular mutations roughly a tree that requires more mutations at interior nodes to explain the observed phylogeny will be assessed as having a lower probability This is broadly similar to the maximum parsimony method but maximum likelihood allows additional statistical flexibility by permitting varying rates of evolution across both lineages and sites In fact the method requires that evolution at different sites and along different lineages must be statistically independent Maximum likelihood is thus well suited to the analysis of distantly related sequences but it is believed to be computationally intractable to compute due to its NP hardness 21 The pruning algorithm a variant of dynamic programming is often used to reduce the search space by efficiently calculating the likelihood of subtrees 2 The method calculates the likelihood for each site in a linear manner starting at a node whose only descendants are leaves that is the tips of the tree and working backwards toward the bottom node in nested sets However the trees produced by the method are only rooted if the substitution model is irreversible which is not generally true of biological systems The search for the maximum likelihood tree also includes a branch length optimization component that is difficult to improve upon algorithmically general global optimization tools such as the Newton Raphson method are often used Some tools that use maximum likelihood to infer phylogenetic trees from variant allelic frequency data VAFs include AncesTree and CITUP 22 23 Bayesian inference editMain article Bayesian inference in phylogeny Bayesian inference can be used to produce phylogenetic trees in a manner closely related to the maximum likelihood methods Bayesian methods assume a prior probability distribution of the possible trees which may simply be the probability of any one tree among all the possible trees that could be generated from the data or may be a more sophisticated estimate derived from the assumption that divergence events such as speciation occur as stochastic processes The choice of prior distribution is a point of contention among users of Bayesian inference phylogenetics methods 2 Implementations of Bayesian methods generally use Markov chain Monte Carlo sampling algorithms although the choice of move set varies selections used in Bayesian phylogenetics include circularly permuting leaf nodes of a proposed tree at each step 24 and swapping descendant subtrees of a random internal node between two related trees 25 The use of Bayesian methods in phylogenetics has been controversial largely due to incomplete specification of the choice of move set acceptance criterion and prior distribution in published work 2 Bayesian methods are generally held to be superior to parsimony based methods they can be more prone to long branch attraction than maximum likelihood techniques 26 although they are better able to accommodate missing data 27 Whereas likelihood methods find the tree that maximizes the probability of the data a Bayesian approach recovers a tree that represents the most likely clades by drawing on the posterior distribution However estimates of the posterior probability of clades measuring their support can be quite wide of the mark especially in clades that aren t overwhelmingly likely As such other methods have been put forwards to estimate posterior probability 28 Some tools that use Bayesian inference to infer phylogenetic trees from variant allelic frequency data VAFs include Canopy EXACT and PhyloWGS 29 30 31 Model selection editMolecular phylogenetics methods rely on a defined substitution model that encodes a hypothesis about the relative rates of mutation at various sites along the gene or amino acid sequences being studied At their simplest substitution models aim to correct for differences in the rates of transitions and transversions in nucleotide sequences The use of substitution models is necessitated by the fact that the genetic distance between two sequences increases linearly only for a short time after the two sequences diverge from each other alternatively the distance is linear only shortly before coalescence The longer the amount of time after divergence the more likely it becomes that two mutations occur at the same nucleotide site Simple genetic distance calculations will thus undercount the number of mutation events that have occurred in evolutionary history The extent of this undercount increases with increasing time since divergence which can lead to the phenomenon of long branch attraction or the misassignment of two distantly related but convergently evolving sequences as closely related 32 The maximum parsimony method is particularly susceptible to this problem due to its explicit search for a tree representing a minimum number of distinct evolutionary events 2 Types of models edit Main article Substitution model All substitution models assign a set of weights to each possible change of state represented in the sequence The most common model types are implicitly reversible because they assign the same weight to for example a G gt C nucleotide mutation as to a C gt G mutation The simplest possible model the Jukes Cantor model assigns an equal probability to every possible change of state for a given nucleotide base The rate of change between any two distinct nucleotides will be one third of the overall substitution rate 2 More advanced models distinguish between transitions and transversions The most general possible time reversible model called the GTR model has six mutation rate parameters An even more generalized model known as the general 12 parameter model breaks time reversibility at the cost of much additional complexity in calculating genetic distances that are consistent among multiple lineages 2 One possible variation on this theme adjusts the rates so that overall GC content an important measure of DNA double helix stability varies over time 33 Models may also allow for the variation of rates with positions in the input sequence The most obvious example of such variation follows from the arrangement of nucleotides in protein coding genes into three base codons If the location of the open reading frame ORF is known rates of mutation can be adjusted for position of a given site within a codon since it is known that wobble base pairing can allow for higher mutation rates in the third nucleotide of a given codon without affecting the codon s meaning in the genetic code 32 A less hypothesis driven example that does not rely on ORF identification simply assigns to each site a rate randomly drawn from a predetermined distribution often the gamma distribution or log normal distribution 2 Finally a more conservative estimate of rate variations known as the covarion method allows autocorrelated variations in rates so that the mutation rate of a given site is correlated across sites and lineages 34 Choosing the best model edit The selection of an appropriate model is critical for the production of good phylogenetic analyses both because underparameterized or overly restrictive models may produce aberrant behavior when their underlying assumptions are violated and because overly complex or overparameterized models are computationally expensive and the parameters may be overfit 32 The most common method of model selection is the likelihood ratio test LRT which produces a likelihood estimate that can be interpreted as a measure of goodness of fit between the model and the input data 32 However care must be taken in using these results since a more complex model with more parameters will always have a higher likelihood than a simplified version of the same model which can lead to the naive selection of models that are overly complex 2 For this reason model selection computer programs will choose the simplest model that is not significantly worse than more complex substitution models A significant disadvantage of the LRT is the necessity of making a series of pairwise comparisons between models it has been shown that the order in which the models are compared has a major effect on the one that is eventually selected 35 An alternative model selection method is the Akaike information criterion AIC formally an estimate of the Kullback Leibler divergence between the true model and the model being tested It can be interpreted as a likelihood estimate with a correction factor to penalize overparameterized models 32 The AIC is calculated on an individual model rather than a pair so it is independent of the order in which models are assessed A related alternative the Bayesian information criterion BIC has a similar basic interpretation but penalizes complex models more heavily 32 Determining the most suitable model for phylogeny reconstruction constitutes a fundamental step in numerous evolutionary studies However various criteria for model selection are leading to debate over which criterion is preferable It has recently been shown that when topologies and ancestral sequence reconstruction are the desired output choosing one criterion over another is not crucial Instead using the most complex nucleotide substitution model GTR I G leads to similar results for the inference of tree topology and ancestral sequences 36 A comprehensive step by step protocol on constructing phylogenetic trees including DNA Amino Acid contiguous sequence assembly multiple sequence alignment model test testing best fitting substitution models and phylogeny reconstruction using Maximum Likelihood and Bayesian Inference is available at Protocol Exchange 37 A non traditional way of evaluating the phylogenetic tree is to compare it with clustering result One can use a Multidimensional Scaling technique so called Interpolative Joining to do dimensionality reduction to visualize the clustering result for the sequences in 3D and then map the phylogenetic tree onto the clustering result A better tree usually has a higher correlation with the clustering result 38 Evaluating tree support editAs with all statistical analysis the estimation of phylogenies from character data requires an evaluation of confidence A number of methods exist to test the amount of support for a phylogenetic tree either by evaluating the support for each sub tree in the phylogeny nodal support or evaluating whether the phylogeny is significantly different from other possible trees alternative tree hypothesis tests Nodal support edit The most common method for assessing tree support is to evaluate the statistical support for each node on the tree Typically a node with very low support is not considered valid in further analysis and visually may be collapsed into a polytomy to indicate that relationships within a clade are unresolved Consensus tree edit Many methods for assessing nodal support involve consideration of multiple phylogenies The consensus tree summarizes the nodes that are shared among a set of trees 39 In a strict consensus only nodes found in every tree are shown and the rest are collapsed into an unresolved polytomy Less conservative methods such as the majority rule consensus tree consider nodes that are supported by a given percentage of trees under consideration such as at least 50 For example in maximum parsimony analysis there may be many trees with the same parsimony score A strict consensus tree would show which nodes are found in all equally parsimonious trees and which nodes differ Consensus trees are also used to evaluate support on phylogenies reconstructed with Bayesian inference see below Bootstrapping and jackknifing edit In statistics the bootstrap is a method for inferring the variability of data that has an unknown distribution using pseudoreplications of the original data For example given a set of 100 data points a pseudoreplicate is a data set of the same size 100 points randomly sampled from the original data with replacement That is each original data point may be represented more than once in the pseudoreplicate or not at all Statistical support involves evaluation of whether the original data has similar properties to a large set of pseudoreplicates In phylogenetics bootstrapping is conducted using the columns of the character matrix Each pseudoreplicate contains the same number of species rows and characters columns randomly sampled from the original matrix with replacement A phylogeny is reconstructed from each pseudoreplicate with the same methods used to reconstruct the phylogeny from the original data For each node on the phylogeny the nodal support is the percentage of pseudoreplicates containing that node 40 The statistical rigor of the bootstrap test has been empirically evaluated using viral populations with known evolutionary histories 41 finding that 70 bootstrap support corresponds to a 95 probability that the clade exists However this was tested under ideal conditions e g no change in evolutionary rates symmetric phylogenies In practice values above 70 are generally supported and left to the researcher or reader to evaluate confidence Nodes with support lower than 70 are typically considered unresolved Jackknifing in phylogenetics is a similar procedure except the columns of the matrix are sampled without replacement Pseudoreplicates are generated by randomly subsampling the data for example a 10 jackknife would involve randomly sampling 10 of the matrix many times to evaluate nodal support Posterior probability edit Reconstruction of phylogenies using Bayesian inference generates a posterior distribution of highly probable trees given the data and evolutionary model rather than a single best tree The trees in the posterior distribution generally have many different topologies When the input data is variant allelic frequency data VAF the tool EXACT can compute the probabilities of trees exactly for small biologically relevant tree sizes by exhaustively searching the entire tree space 29 Most Bayesian inference methods utilize a Markov chain Monte Carlo iteration and the initial steps of this chain are not considered reliable reconstructions of the phylogeny Trees generated early in the chain are usually discarded as burn in The most common method of evaluating nodal support in a Bayesian phylogenetic analysis is to calculate the percentage of trees in the posterior distribution post burn in which contain the node The statistical support for a node in Bayesian inference is expected to reflect the probability that a clade really exists given the data and evolutionary model 42 Therefore the threshold for accepting a node as supported is generally higher than for bootstrapping Step counting methods edit Bremer support counts the number of extra steps needed to contradict a clade Shortcomings edit These measures each have their weaknesses For example smaller or larger clades tend to attract larger support values than mid sized clades simply as a result of the number of taxa in them 43 Bootstrap support can provide high estimates of node support as a result of noise in the data rather than the true existence of a clade 44 Limitations and workarounds editUltimately there is no way to measure whether a particular phylogenetic hypothesis is accurate or not unless the true relationships among the taxa being examined are already known which may happen with bacteria or viruses under laboratory conditions The best result an empirical phylogeneticist can hope to attain is a tree with branches that are well supported by the available evidence Several potential pitfalls have been identified Homoplasy edit Main article Convergent evolution Certain characters are more likely to evolve convergently than others logically such characters should be given less weight in the reconstruction of a tree 45 Weights in the form of a model of evolution can be inferred from sets of molecular data so that maximum likelihood or Bayesian methods can be used to analyze them For molecular sequences this problem is exacerbated when the taxa under study have diverged substantially As time since the divergence of two taxa increase so does the probability of multiple substitutions on the same site or back mutations all of which result in homoplasies For morphological data unfortunately the only objective way to determine convergence is by the construction of a tree a somewhat circular method Even so weighting homoplasious characters how does indeed lead to better supported trees 45 Further refinement can be brought by weighting changes in one direction higher than changes in another for instance the presence of thoracic wings almost guarantees placement among the pterygote insects because although wings are often lost secondarily there is no evidence that they have been gained more than once 46 Horizontal gene transfer edit In general organisms can inherit genes in two ways vertical gene transfer and horizontal gene transfer Vertical gene transfer is the passage of genes from parent to offspring and horizontal also called lateral gene transfer occurs when genes jump between unrelated organisms a common phenomenon especially in prokaryotes a good example of this is the acquired antibiotic resistance as a result of gene exchange between various bacteria leading to multi drug resistant bacterial species There have also been well documented cases of horizontal gene transfer between eukaryotes Horizontal gene transfer has complicated the determination of phylogenies of organisms and inconsistencies in phylogeny have been reported among specific groups of organisms depending on the genes used to construct evolutionary trees The only way to determine which genes have been acquired vertically and which horizontally is to parsimoniously assume that the largest set of genes that have been inherited together have been inherited vertically this requires analyzing a large number of genes Hybrids speciation introgressions and incomplete lineage sorting edit The basic assumption underlying the mathematical model of cladistics is a situation where species split neatly in bifurcating fashion While such an assumption may hold on a larger scale bar horizontal gene transfer see above speciation is often much less orderly Research since the cladistic method was introduced has shown that hybrid speciation once thought rare is in fact quite common particularly in plants 47 48 Also paraphyletic speciation is common making the assumption of a bifurcating pattern unsuitable leading to phylogenetic networks rather than trees 49 50 Introgression can also move genes between otherwise distinct species and sometimes even genera 51 complicating phylogenetic analysis based on genes 52 This phenomenon can contribute to incomplete lineage sorting and is thought to be a common phenomenon across a number of groups In species level analysis this can be dealt with by larger sampling or better whole genome analysis 53 Often the problem is avoided by restricting the analysis to fewer not closely related specimens Taxon sampling edit Owing to the development of advanced sequencing techniques in molecular biology it has become feasible to gather large amounts of data DNA or amino acid sequences to infer phylogenetic hypotheses For example it is not rare to find studies with character matrices based on whole mitochondrial genomes 16 000 nucleotides in many animals However simulations have shown that it is more important to increase the number of taxa in the matrix than to increase the number of characters because the more taxa there are the more accurate and more robust is the resulting phylogenetic tree 54 55 This may be partly due to the breaking up of long branches Phylogenetic signal edit Another important factor that affects the accuracy of tree reconstruction is whether the data analyzed actually contain a useful phylogenetic signal a term that is used generally to denote whether a character evolves slowly enough to have the same state in closely related taxa as opposed to varying randomly Tests for phylogenetic signal exist 56 Continuous characters editMorphological characters that sample a continuum may contain phylogenetic signal but are hard to code as discrete characters Several methods have been used one of which is gap coding and there are variations on gap coding 57 In the original form of gap coding 57 group means for a character are first ordered by size The pooled within group standard deviation is calculated and differences between adjacent means are compared relative to this standard deviation Any pair of adjacent means is considered different and given different integer scores if the means are separated by a gap greater than the within group standard deviation times some arbitrary constant If more taxa are added to the analysis the gaps between taxa may become so small that all information is lost Generalized gap coding works around that problem by comparing individual pairs of taxa rather than considering one set that contains all of the taxa 57 Missing data edit In general the more data that are available when constructing a tree the more accurate and reliable the resulting tree will be Missing data are no more detrimental than simply having fewer data although the impact is greatest when most of the missing data are in a small number of taxa Concentrating the missing data across a small number of characters produces a more robust tree 58 The role of fossils editBecause many characters involve embryological or soft tissue or molecular characters that at best hardly ever fossilize and the interpretation of fossils is more ambiguous than that of living taxa extinct taxa almost invariably have higher proportions of missing data than living ones However despite these limitations the inclusion of fossils is invaluable as they can provide information in sparse areas of trees breaking up long branches and constraining intermediate character states thus fossil taxa contribute as much to tree resolution as modern taxa 59 Fossils can also constrain the age of lineages and thus demonstrate how consistent a tree is with the stratigraphic record 1 stratocladistics incorporates age information into data matrices for phylogenetic analyses See also editList of phylogenetics softwareBayesian network Bioinformatics Cladistics Computational biology Disk covering method Evolutionary dynamics Microbial phylogenetics PHYLIP Phylogenetic comparative methods Phylogenetic tree Phylogenetics Population genetics Quantitative comparative linguistics Statistical classification Systematics Taxonomy biology References edit a b c Khalafvand Tyler 2015 Finding Structure in the Phylogeny Search Space Dalhousie University a b c d e f g h i j k l m n o Felsenstein J 2004 Inferring Phylogenies Sunderland Massachusetts Sinauer Associates ISBN 978 0 87893 177 4 a b c d Mount DM 2004 Bioinformatics Sequence and Genome Analysis 2nd ed Cold Spring Harbor New York Cold Spring Harbor Laboratory Press ISBN 978 0 87969 712 9 Swiderski DL Zelditch ML Fink WL September 1998 Why morphometrics is not special coding quantitative data for phylogenetic analysis Systematic Biology 47 3 508 19 JSTOR 2585256 PMID 12066691 Gaubert P Wozencraft WC Cordeiro Estrela P Veron G December 2005 Mosaics of convergences and noise in morphological phylogenies what s in a viverrid like carnivoran Systematic Biology 54 6 865 94 doi 10 1080 10635150500232769 PMID 16282167 Strait DS Grine FE December 2004 Inferring hominoid and early hominid phylogeny using craniodental characters the role of fossil taxa Journal of Human Evolution 47 6 399 452 doi 10 1016 j jhevol 2004 08 008 PMID 15566946 Wiens JJ 2001 Character analysis in morphological phylogenetics problems and solutions Systematic Biology 50 5 689 99 doi 10 1080 106351501753328811 PMID 12116939 Jenner RA 2001 Bilaterian phylogeny and uncritical recycling of morphological data sets Systematic Biology 50 5 730 42 doi 10 1080 106351501753328857 PMID 12116943 Sokal R Michener C 1958 A statistical method for evaluating systematic relationships University of Kansas Science Bulletin 38 1409 1438 Saitou N Nei M July 1987 The neighbor joining method a new method for reconstructing phylogenetic trees Molecular Biology and Evolution 4 4 406 25 doi 10 1093 oxfordjournals molbev a040454 PMID 3447015 Fitch WM Margoliash E January 1967 Construction of phylogenetic trees Science 155 3760 279 84 Bibcode 1967Sci 155 279F doi 10 1126 science 155 3760 279 PMID 5334057 Lespinats S Grando D Marechal E Hakimi MA Tenaillon O Bastien O 2011 How Fitch Margoliash Algorithm can Benefit from Multi Dimensional Scaling Evolutionary Bioinformatics Online 7 61 85 doi 10 4137 EBO S7048 PMC 3118699 PMID 21697992 Day WH 1987 Computational complexity of inferring phylogenies from dissimilarity matrices Bulletin of Mathematical Biology 49 4 461 7 doi 10 1007 BF02458863 PMID 3664032 S2CID 189885258 Hendy MD Penny D 1982 Branch and bound algorithms to determine minimal evolutionary trees Mathematical Biosciences 59 2 277 290 doi 10 1016 0025 5564 82 90027 X Ratner VA Zharkikh AA Kolchanov N Rodin S Solovyov S Antonov AS 1995 Molecular Evolution Biomathematics Series Vol 24 New York Springer Verlag ISBN 978 3 662 12530 4 Sankoff D Morel C Cedergren RJ October 1973 Evolution of 5S RNA and the non randomness of base replacement Nature 245 147 232 4 doi 10 1038 newbio245232a0 PMID 4201431 a b De Laet J 2005 Parsimony and the problem of inapplicables in sequence data In Albert VA ed Parsimony phylogeny and genomics Oxford University Press pp 81 116 ISBN 978 0 19 856493 5 Wheeler WC Gladstein DS 1994 MALIGN a multiple nucleic acid sequence alignment program Journal of Heredity 85 5 417 418 doi 10 1093 oxfordjournals jhered a111492 Simmons MP June 2004 Independence of alignment and tree search Molecular Phylogenetics and Evolution 31 3 874 9 doi 10 1016 j ympev 2003 10 008 PMID 15120385 De Laet J 2015 Parsimony analysis of unaligned sequence data maximization of homology and minimization of homoplasy not Minimization of operationally defined total cost or minimization of equally weighted transformations Cladistics 31 5 550 567 doi 10 1111 cla 12098 PMID 34772278 S2CID 221582410 Chor B Tuller T June 2005 Maximum likelihood of evolutionary trees hardness and approximation Bioinformatics 21 Suppl 1 i97 106 doi 10 1093 bioinformatics bti1027 PMID 15961504 El Kebir M Oesper L Acheson Field H Raphael BJ June 2015 Reconstruction of clonal trees and tumor composition from multi sample sequencing data Bioinformatics 31 12 i62 70 doi 10 1093 bioinformatics btv261 PMC 4542783 PMID 26072510 Malikic S McPherson AW Donmez N Sahinalp CS May 2015 Clonality inference in multiple tumor samples using phylogeny Bioinformatics 31 9 1349 56 doi 10 1093 bioinformatics btv003 PMID 25568283 Mau B Newton MA 1997 Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo Journal of Computational and Graphical Statistics 6 1 122 131 doi 10 2307 1390728 JSTOR 1390728 Yang Z Rannala B July 1997 Bayesian phylogenetic inference using DNA sequences a Markov Chain Monte Carlo Method Molecular Biology and Evolution 14 7 717 24 doi 10 1093 oxfordjournals molbev a025811 PMID 9214744 Kolaczkowski B Thornton JW December 2009 Delport W ed Long branch attraction bias and inconsistency in Bayesian phylogenetics PLOS ONE 4 12 e7891 Bibcode 2009PLoSO 4 7891K doi 10 1371 journal pone 0007891 PMC 2785476 PMID 20011052 Simmons MP 2012 Misleading results of likelihood based phylogenetic analyses in the presence of missing data Cladistics 28 2 208 222 doi 10 1111 j 1096 0031 2011 00375 x PMID 34872185 S2CID 53123024 Larget B July 2013 The estimation of tree posterior probabilities using conditional clade probability distributions Systematic Biology 62 4 501 11 doi 10 1093 sysbio syt014 PMC 3676676 PMID 23479066 a b Ray S Jia B Safavi S van Opijnen T Isberg R Rosch J Bento J 22 August 2019 Exact inference under the perfect phylogeny model arXiv 1908 08623 Bibcode 2019arXiv190808623R a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help Jiang Y Qiu Y Minn AJ Zhang NR September 2016 Assessing intratumor heterogeneity and tracking longitudinal and spatial clonal evolutionary history by next generation sequencing Proceedings of the National Academy of Sciences of the United States of America 113 37 E5528 37 Bibcode 2016PNAS 113E5528J doi 10 1073 pnas 1522203113 PMC 5027458 PMID 27573852 Deshwar AG Vembu S Yung CK Jang GH Stein L Morris Q February 2015 PhyloWGS reconstructing subclonal composition and evolution from whole genome sequencing of tumors Genome Biology 16 1 35 doi 10 1186 s13059 015 0602 8 PMC 4359439 PMID 25786235 a b c d e f Sullivan J Joyce P 2005 Model Selection in Phylogenetics Annual Review of Ecology Evolution and Systematics 36 1 445 466 doi 10 1146 annurev ecolsys 36 102003 152633 PMC 3144157 PMID 20671039 Galtier N Gouy M July 1998 Inferring pattern and process maximum likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis Molecular Biology and Evolution 15 7 871 9 doi 10 1093 oxfordjournals molbev a025991 PMID 9656487 Fitch WM Markowitz E October 1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution Biochemical Genetics 4 5 579 93 doi 10 1007 bf00486096 PMID 5489762 S2CID 26638948 Pol D December 2004 Empirical problems of the hierarchical likelihood ratio test for model selection Systematic Biology 53 6 949 62 doi 10 1080 10635150490888868 PMID 15764562 Abadi S Azouri D Pupko T Mayrose I February 2019 Model selection may not be a mandatory step for phylogeny reconstruction Nature Communications 10 1 934 Bibcode 2019NatCo 10 934A doi 10 1038 s41467 019 08822 w PMC 6389923 PMID 30804347 Bast F 2013 Sequence similarity search Multiple Sequence Alignment Model Selection Distance Matrix and Phylogeny Reconstruction Protocol Exchange doi 10 1038 protex 2013 065 Ruan Y House GL Ekanayake S Schutte U Bever JD Tang H Fox G 26 May 2014 Integration of clustering and multidimensional scaling to determine phylogenetic trees as spherical phylograms visualized in 3 dimensions 2014 14th IEEE ACM International Symposium on Cluster Cloud and Grid Computing IEEE pp 720 729 doi 10 1109 CCGrid 2014 126 ISBN 978 1 4799 2784 5 S2CID 9581901 Baum DA Smith SD 2013 Tree Thinking An Introduction to Phylogenetic Biology Roberts p 442 ISBN 978 1 936221 16 5 Felsenstein J July 1985 Confidence Limits on Phylogenies An Approach Using the Bootstrap Evolution International Journal of Organic Evolution 39 4 783 791 doi 10 2307 2408678 JSTOR 2408678 PMID 28561359 Hillis DM Bull JJ 1993 An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis Systematic Biology 42 2 182 192 doi 10 1093 sysbio 42 2 182 ISSN 1063 5157 Huelsenbeck J Rannala B December 2004 Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models Systematic Biology 53 6 904 13 doi 10 1080 10635150490522629 PMID 15764559 Chemisquy MA Prevosti FJ 2013 Evaluating the clade size effect in alternative measures of branch support Journal of Zoological Systematics and Evolutionary Research 51 4 260 273 doi 10 1111 jzs 12024 hdl 11336 4144 Phillips MJ Delsuc F Penny D July 2004 Genome scale phylogeny and the detection of systematic biases PDF Molecular Biology and Evolution 21 7 1455 8 doi 10 1093 molbev msh137 PMID 15084674 a b Goloboff PA Carpenter JM Arias JS Esquivel DR 2008 Weighting against homoplasy improves phylogenetic analysis of morphological data sets Cladistics 24 5 758 773 doi 10 1111 j 1096 0031 2008 00209 x hdl 11336 82003 S2CID 913161 Goloboff PA 1997 Self Weighted Optimization Tree Searches and Character State Reconstructions under Implied Transformation Costs Cladistics 13 3 225 245 doi 10 1111 j 1096 0031 1997 tb00317 x PMID 34911233 S2CID 196595734 Arnold ML 1996 Natural Hybridization and Evolution New York Oxford University Press p 232 ISBN 978 0 19 509975 1 Wendel JF Doyle JJ 1998 DNA Sequencing In Soltis DE Soltis PS Doyle JJ eds Molecular Systematics of Plants II Boston Kluwer pp 265 296 ISBN 978 0 19 535668 7 Funk DJ Omland KE 2003 Species level paraphyly and polyphyly Frequency causes and consequences with insights from animal mitochondrial DNA Annual Review of Ecology Evolution and Systematics 34 397 423 doi 10 1146 annurev ecolsys 34 011802 132421 S2CID 33951905 Genealogy of Life GoLife National Science Foundation Retrieved 5 May 2015 The GoLife program builds upon the AToL program by accommodating the complexity of diversification patterns across all of life s history Our current knowledge of processes such as hybridization endosymbiosis and lateral gene transfer makes clear that the evolutionary history of life on Earth cannot accurately be depicted for every branch of the tree as a single typological bifurcating tree Kutschera VE Bidon T Hailer F Rodi J Fain SR Janke A 2014 Bears in a forest of gene trees phylogenetic inference is complicated by incomplete lineage sorting and gene flow Molecular Biology and Evolution 31 8 2004 2017 doi 10 1093 molbev msu186 PMC 4104321 PMID 24903145 Qu Y Zhang R Quan Q Song G Li SH Lei F December 2012 Incomplete lineage sorting or secondary admixture disentangling historical divergence from recent gene flow in the Vinous throated parrotbill Paradoxornis webbianus Molecular Ecology 21 24 6117 33 Bibcode 2012MolEc 21 6117Q doi 10 1111 mec 12080 PMID 23095021 S2CID 22635918 Pollard DA Iyer VN Moses AM Eisen MB October 2006 Widespread discordance of gene trees with species tree in Drosophila evidence for incomplete lineage sorting PLOS Genetics 2 10 e173 doi 10 1371 journal pgen 0020173 PMC 1626107 PMID 17132051 Zwickl DJ Hillis DM August 2002 Increased taxon sampling greatly reduces phylogenetic error Systematic Biology 51 4 588 98 doi 10 1080 10635150290102339 PMID 12228001 Wiens JJ February 2006 Missing data and the design of phylogenetic analyses Journal of Biomedical Informatics 39 1 34 42 doi 10 1016 j jbi 2005 04 001 PMID 15922672 Blomberg SP Garland T Ives AR April 2003 Testing for phylogenetic signal in comparative data behavioral traits are more labile Evolution International Journal of Organic Evolution 57 4 717 45 doi 10 1111 j 0014 3820 2003 tb00285 x PMID 12778543 S2CID 221735844 a b c Archie JW 1985 Methods for coding variable morphological features for numerical taxonomic analysis Systematic Zoology 34 3 326 345 doi 10 2307 2413151 JSTOR 2413151 Prevosti FJ Chemisquy MA 2009 The impact of missing data on real morphological phylogenies Influence of the number and distribution of missing entries Cladistics 26 3 326 339 doi 10 1111 j 1096 0031 2009 00289 x hdl 11336 69010 PMID 34875786 S2CID 86850694 Cobbett A Wilkinson M Wills MA October 2007 Fossils impact as hard as living taxa in parsimony analyses of morphology Systematic Biology 56 5 753 66 doi 10 1080 10635150701627296 PMID 17886145 Further reading editSemple C Steel M 2003 Phylogenetics Oxford University Press ISBN 978 0 19 850942 4 Cipra BA 2007 Algebraic Geometers See Ideal Approach to Biology PDF SIAM News 40 6 Archived from the original PDF on 3 March 2016 Press WH Teukolsky SA Vetterling WT Flannery BP 2007 Section 16 4 Hierarchical Clustering by Phylogenetic Trees Numerical Recipes The Art of Scientific Computing 3rd ed New York Cambridge University Press ISBN 978 0 521 88068 8 Archived from the original on 11 August 2011 Retrieved 17 August 2011 Huson DH Rupp R Scornavacca C 2010 Phylogenetic Networks Concepts Algorithms and Applications Cambridge University Press ISBN 978 1 139 49287 4 External links edit nbsp Media related to Computational phylogenetics at Wikimedia Commons Retrieved from https en wikipedia org w index php title Computational phylogenetics amp oldid 1213322627 Fitch Margoliash method, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.