Next: Discussion Up: No Title Previous: Content Sensors

Integrated Gene Finding Methods

Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak [1], and there are dependencies between signals and contents that they cannot capture [11], such as the possible correlation between splice site strength and exon size [78]. During the last five years, a number of systems have been developed that combine signal and content sensors to try to identify complete gene structure. Such systems are capable, in principle, of handling more complex interdependencies between gene features. A linguistic metaphor is sometimes applied here, likening the process of breaking down a sequence of DNA into genes, each of which is a series of exons and introns, to the process of parsing a sentence by breaking it down into its constituent grammatical parts. Indeed this parsing metaphor can be pushed deeper. Searls[60,16] was the first major proponent of describing gene structure in linguistic terms using a formal grammar. His genefinding program, GenLang, was one of the earliest integrated genefinders, following on the pioneering work of Gelfand [26], Gelfand and Roytberg [30], Fields and Soderlund [24], and Phil Green's GeneFinder[69], and was one of the inspirations for significant later work (e.g. [6,5] and the HMM methods described below.)

Nearly all integrated genefinders use dynamic programming to combine candidate exons and other scored regions and sites into an complete gene prediction with maximal total score. A brief and lucid tutorial on this topic can be found in [41] and a more detailed exposition in [17]. Gelfand, et al, proposed a dynamic programming scheme, embodied in the genefinder GREAT[29], that calculates the set of all so-called Pareto-optimal gene structure predictions, which include the optimal predictions for a wide variety of different scoring functions. Dynamic programming methods are also used in Grail II [73], GeneParser [62], FGENEH [63], and recent versions of GeneID [31].

Dynamic programming methods find the candidate gene structure with the best overall score. The key to success in these methods is developing the right score function. A fruitful approach here has been to define a statistical model of genes that includes parameters describing codon dependencies in exons, characteristics of splice sites (e.g. the parameters of a weight matrix for splice sites), as well as ``linguistic" information on what functional features are likely to follow other features (see Figure 1). In this approach the observed DNA sequences are actually modeled as if they were manifestations of a stochastic process that generates gene-containing DNA. This process includes a latent (or ``hidden") variable associated with each nucleotide that represents the functional role or position of that nucleotide, e.g. a G residue might be part of a GT consensus donor splice site or it might be in the third position of a start codon. Taken together, the states of these hidden variables define a candidate gene structure. The linguistic rules for what functional features follow what other features are expressed by the parameters of a Markov process on the hidden variables. For this reason, these models are called hidden Markov models, or HMMs. Because a Markov process is just a finite state machine with probabilities on the state transitions, genefinding HMMs are merely a stochastic version of the genefinding finite state machines (regular grammars) introduced by Searls.


 
Figure: A simplified diagram representing the liguistic rules for what might follow what when parsing a sequence consisting of a multiple exon gene. The arcs represent contents and the nodes represent signals. The contents are J5' : 5' UTR, EI : Initial Exon, E : Exon, I : Intron, E : Internal Exon, EF: Final Exon, ES : Single Exon, and J3' : 3' UTR. The signals are B : Begin sequence, S : Start Translation, D : Donor splice site, A : Acceptor splice site, T : Stop Translation, F : End sequence. A candidate gene structure is created by tracing a path in this figure from B to F. An HMM (GHMM) is defined by attaching stochastic models to each of the arcs and nodes. Figure taken from [44].  
\begin{figure*}
\begin{center}
\parbox{0pt}{
\epsfxsize=0.75\textwidth
\epsfbox{genesigHMM.eps}
}\end{center} \end{figure*}

The advantage of HMMs is that, being probabilistic models, they define a natural score function. Let X denote the DNA sequence, Q denote a possible sequence of hidden states, one for each nucleotide in X, and $\theta$ denote the parameters of the HMM. Since Q represents a candidate gene structure for X, to find the genes in X, we want to find the Q that is most likely given the sequence X, i.e., we want to find the Q that maximizes $P(Q\vert X,\theta)$, the probability of the gene structure Q given the DNA sequence X and the parameters $\theta$. Equivalently, we can maximize $\log P(Q\vert X,\theta)$. This is the score function that is optimized in a genefinding HMM. It can be optimized using standard dynamic programming methods.

Early genfinding HMMs were EcoParse (for E. coli [42], also recently used in the annotation of the M. Tuberculosis genome [14]) and Xpound (for human) [70]. More recent programs are GeneMark-HMM (for bacterial genomes) [48] Veil [33] and HMMgene (for human) [41]. A somewhat more general class of probabilistic models, called generalized HMMs (GHMMs) or (hidden) semi-Markov models, have their roots in GeneParser [62], and were more fully developed in Genie [44,57,45] and then GenScan [9] (see also [72]).

The probabilistic approach has further advantages. For example, for any given feature, such as a 5' splice site, and any position in the DNA sequence X, we can calculate the probability that that feature occurs at that position. If we do this for separately for each feature of our overall predicted gene structure, then this gives us a kind of individual ``confidence" value for each part of our prediction. GeneParser [62] pioneered this methodology (see further theoretical discussion in [68]), and it is used to give highly accurate confidence values for predicted exons in Genscan [9]. In addition, the probabilistic formulation provides various new ways to estimate the parameters $\theta$ of the gene-finding model. Given a large ``training" DNA contig (or set of contigs) X and its correct state sequence annotation Q, we can find $\theta$ to maximize $P(X,Q\vert\theta)$(the maximum likelihood approach), $P(\theta\vert X,Q)$ (the maximum a posteriori approach), or $P(Q\vert X,\theta)$ (the conditional maximum likelihood approach) [40]. It is even possible to estimate the parameters $\theta$from partially annotated training sequences using the expectation-maximization method [17].

So far we have focused on genefinders that predict gene structure based only on general features of genes, rather than using explicit comparisons to other, previously known genes, or auxiliary information such as expressed sequence tag (EST) matches. One way to include information about previously known genes is to use the database of known proteins as a basis for gene prediction. Current state-of-the-art genefinding systems combine multiple statistical measures with database homology searches, obtained by translating the DNA to protein in all possible reading frames, and then searching the protein databases for similar protein sequences. Examples are Genie [45], GeneID+ [10], GeneParser3 [62], and recent versions of Grail [75]. The program AAT [34] and new versions of Grail also take into account EST information [74]. Database homology has long been used as a post hoc method to validate gene predictions, but these systems were among the first to integrate database homology directly into the genefinding algorithm itself. This approach has been taken to its extreme limit in a genefinding program developed by Gelfand, Mironov, and Pevzner[28]. This system, called Procrustes, requires the user to provide a close protein homolog of the gene to be predicted. Then a ``spliced alignment'' algorithm, similar to a Smith-Waterman[61] alignment, is used to derive a putative gene structure by aligning the DNA to the homolog. The major disadvantage to this method is the requirement of a close homolog. It is often the case that homologs are unknown or are remote, in which case this system would be inappropriate. Nevertheless, in the presence of a very close homolog, Procrustes is an extremely effective gene finding method. Recent related methods, based on HMM models, have been developed by Birney and Durbin [5] and are currently being developed by Kulp [43].

In 1995, a number of different integrated genefinders were tested on a benchmark set of 570 vertebrate genes by Burset and Guigó [10]. They looked at not only how many bases were predicted correctly as either coding or non-coding, but how many exons were predicted exactly, with both splice sites located correctly. In the former case, accuracy was about 75-80%. In the latter it was about 40-60%. These numbers are for systems that do not employ protein database homology searches. When database homology is employed, the upper limit for the accuracy increases about 10% in both categories. Integrated eukaryotic genefinding systems based on HMM and GHMM models, starting with Genie, and followed by Veil, Genscan and HMMgene have pushed beyond these early performance numbers, with the latter two programs now obtaining upwards of 90% accuracy at the level of individual nucleotides and 80% for exact exon prediction, without the use of database homologies. A new category of completely correct gene prediction has been added to the list of performance measurements, and Genscan achieves an accuracy of about 40% on the Burset and Guigó dataset in this category [9]. Tests have also been conducted on the identification of promoters, showing that the accuracy of currently available methods is much lower on this task [22].

The currently available genefinding performance results must be approached with extreme caution. The primary reason is that they depend very strongly on the difficulty of the genes in the test set, and for some genefinders, on the homology overlap between the genes in the test set and those in the training set that is used to optimize the parameters of the models [31,41]. The latter is a factor even when no homology is explicitly used by the genefinding method. To avoid this problem, it is best to compare genefinders by training and testing on the same genes, and to avoid homologies between genes used for training and testing. Reese has constructed benchmark sets for human and for Drosophila genes of this type that are randomly partitioned into specified parts for use in cross-validated train-test experiments. These have been used by Genie, Genscan and HMMgene (See LBNL site). Reese's human dataset is a bit harder than the original Burset and Guigó dataset as well, so genefinding programs get overall lower scores on it. Furthermore, the variance in performance from one train-test partition to another is quite high, since some parts by chance ended up with more ``hard-to-predict" genes (usually genes with many exons and or long introns) than others. This graphically demonstrates the unreliability of the currently available genefinding performance figures: if by chance a different set of human genes had been included in Genbank, the numbers would have been quite different, and probably lower, since Genbank is biased towards genes with fewer exons and shorter introns. We need a much larger sample of human genes before we can get stable performance numbers.

Reese's datasets, like those of Burset and Guigó, contain exactly one gene per sequence. Little is known about the accuracy of genefinders on large genomic sequences containing multiple genes. Some harder and more realistic human genomic data, consisting of large annotated contigs, is available; see URLs. The Sanger center also proposes a standardized format, Gene Finding Format or GFF, for both gene annotation and comparing the results of various genefinders. It would greatly aid the maturation of this field if we could agree on a simple standard data interchange format like this. Once this is established, we could then share a set of tools for the display, comparison, analysis and combination of different gene predictions, along with auxiliary sequence annotation.


Next: Discussion Up: No Title Previous: Content Sensors
David Haussler
10/14/1998