Next: Acknowledgments
Up: No Title
Previous: Integrated Gene Finding Methods
It is important to distinguish two different goals in genefinding research.
The first goal is to provide computational methods to aid in the
annotation of the large volume of genomic data that is produced by
genome sequencing efforts. The second goal is to provide a computational
model to help elucidate the mechanisms involved in transcription, splicing,
polyadenylation and other critical processes in the pathway from genome to proteome.
While there is some overlap in these goals, there is also some conflict. No one
computational genefinding approach will be optimal for both goals. A ``purist" system that
mimics the cellular processes cannot take advantage of homologies with other proteins
and matches to EST sequences when deciding where to splice. It presumably should not
use codon statistics,
frame consistency between exons, or lack
of in-frame stop codons to predict overall gene structure, although there is
some evidence that absence of early in-frame stop codons may be involved in biological start site selection
[39]. One would think that these restrictions would completely cripple
computational genefinding methods, however Guigó has shown that just using simple weight matrices
to find the best combination of splice site signals, translation start and stop signals, along with
the standard syntactic constraints on gene structure (frame consistency, no in-frame stop
codons, minimum intron size), gives results on his benchmark data set that are
comparable to those obtained by most of the genefinders he and Burset tested in 1995 [31].
These results are not competitive with the older genefinders that use protein homology,
nor with the newer methods that use exon coding potential but not
homology, but they nevertheless indicate a surprising potential for purist
genefinding models. More detailed models of the splicing process, the selection
of translation start and the process of polyadenylation may significantly
improve such purist models.
These models may prove useful in human genome annotation for finding
rapidly evolving and rarely expressed genes, especially those with
unusual codon usage.
However, if we simply want to produce genefinders that
give the most reliable annotation in ``everyday" genome
center annotation efforts, it is clear that more work needs to
be done to incorporate EST information along with protein homology and powerful
statistical models.
There are other key issues that will effect future research in both of the
above computational genefinding paradigms. One is the issue of alternative splicing.
No currently available genefinders handle alternative splicing in an effective
manner. Intimately tied with this issue is that of gene regulation. The abundant
regulatory signals flanking genes, and appearing in introns (and sometimes in exons
[52]), combined with
regulatory proteins specific to the cell type and cell state, determine
the expression of the gene. Gene annotation is not complete until these signals are
identified, and the cellular conditions that give rise to differing expression
levels for different transcripts are worked out. This implies, among
other things, that future
genefinders will need to explicitly take into account experimental data relating
to differential expression, along with the other types of data we have discussed
(see e.g. [38]).
It may be anticipated that this task will occupy genefinding researchers
for some years to come.
Next: Acknowledgments
Up: No Title
Previous: Integrated Gene Finding Methods
David Haussler
10/14/1998