## Clusters and tree structure from genealogical data

Modern biology provides a wealth of interesting mathematical challenges in the modeling and reconstruction of evolution. A new eprint explores the theoretical prospects for defining a phylogenetic tree structure, despite complications like lateral gene transfer, hybdridization and the difference between gene and species trees:

A. Dress, et al. Species, Clusters and the ‘Tree of Life’: A graph-theoretic perspective. eprint: 0908.2885

Abstract: A hierarchical structure describing the inter-relationships of species has long been a fundamental concept in systematic biology, from Linnean classification through to the more recent quest for a ‘Tree of Life.’ In this paper we use an approach based on discrete mathematics to address a basic question: Could one delineate this hierarchical structure in nature purely by reference to the ‘genealogy’ of present-day individuals, which describes how they are related with one another by ancestry through a continuous line of descent? We describe several mathematically precise ways by which one can naturally define collections of subsets of present day individuals so that these subsets are nested (and so form a tree) based purely on the directed graph that describes the ancestry of these individuals. We also explore the relationship between these and related clustering constructions.

The starting-point is an extremely large graph with all individuals that ever lived as nodes and (directed) edges between parents and their children. That is, the graph is G = (V,E) with V containing all individuals and $(u,v)\in E$ if v directly inherited genetic material from u. Dress et al. address the following question: Using only this graph, and in particular without recourse to any species concept, etc., how can we obtain a tree structure? This question is general enough to require a framework that can handle lateral gene transfer and hybridization, as well as more normal (to a H. sapiens) transmission of genetic material from parents to children. The authors show how different types of clusters can be defined on a subset of extant or observed individuals $X \subseteq V$ using the genealogical data in G. The technical graph-theoretical notions used make the paper a slow read and I haven’t understood everything in enough detail to provide a summary more concise than the paper itself. Instead, here’s a bit of discussion on the extent to which the tree structure survives discoveries of lateral gene transfer:

“For prokaryotes, where a tree structure is most vigorously called into question, the concept of a tree is still well defined, but it may indeed be poorly resolved (depending on the type of cluster considered, and the extent to which a LGT event from individual x to y might be counted as an arc in G from x to y – for example, one could indicate all such instances or just those for which the gene transfers survives to a present copy). In cases where LGT (and other types of reticulate evolution) are extensive and on-going, then set systems such as weak hierarchies may give a more informative picture of evolution than a tree. We have described one way to generate such a hierarchy above, but it may be useful to explore other approaches.”