Next: Acknowledgments Up: No Title Previous: Integrated Gene Finding Methods

Discussion

It is important to distinguish two different goals in genefinding research. The first goal is to provide computational methods to aid in the annotation of the large volume of genomic data that is produced by genome sequencing efforts. The second goal is to provide a computational model to help elucidate the mechanisms involved in transcription, splicing, polyadenylation and other critical processes in the pathway from genome to proteome. While there is some overlap in these goals, there is also some conflict. No one computational genefinding approach will be optimal for both goals. A ``purist" system that mimics the cellular processes cannot take advantage of homologies with other proteins and matches to EST sequences when deciding where to splice. It presumably should not use codon statistics, frame consistency between exons, or lack of in-frame stop codons to predict overall gene structure, although there is some evidence that absence of early in-frame stop codons may be involved in biological start site selection [39]. One would think that these restrictions would completely cripple computational genefinding methods, however Guigó has shown that just using simple weight matrices to find the best combination of splice site signals, translation start and stop signals, along with the standard syntactic constraints on gene structure (frame consistency, no in-frame stop codons, minimum intron size), gives results on his benchmark data set that are comparable to those obtained by most of the genefinders he and Burset tested in 1995 [31]. These results are not competitive with the older genefinders that use protein homology, nor with the newer methods that use exon coding potential but not homology, but they nevertheless indicate a surprising potential for purist genefinding models. More detailed models of the splicing process, the selection of translation start and the process of polyadenylation may significantly improve such purist models. These models may prove useful in human genome annotation for finding rapidly evolving and rarely expressed genes, especially those with unusual codon usage. However, if we simply want to produce genefinders that give the most reliable annotation in ``everyday" genome center annotation efforts, it is clear that more work needs to be done to incorporate EST information along with protein homology and powerful statistical models.

There are other key issues that will effect future research in both of the above computational genefinding paradigms. One is the issue of alternative splicing. No currently available genefinders handle alternative splicing in an effective manner. Intimately tied with this issue is that of gene regulation. The abundant regulatory signals flanking genes, and appearing in introns (and sometimes in exons [52]), combined with regulatory proteins specific to the cell type and cell state, determine the expression of the gene. Gene annotation is not complete until these signals are identified, and the cellular conditions that give rise to differing expression levels for different transcripts are worked out. This implies, among other things, that future genefinders will need to explicitly take into account experimental data relating to differential expression, along with the other types of data we have discussed (see e.g. [38]). It may be anticipated that this task will occupy genefinding researchers for some years to come.


Next: Acknowledgments Up: No Title Previous: Integrated Gene Finding Methods
David Haussler
10/14/1998