Next: Integrated Gene Finding Methods
Up: No Title
Previous: Signal Sensors
The most important and most studied content sensor is the sensor that predicts coding
regions. An extensive review of computational methods to detect coding regions is
given by Fickett and Tung [23] (see also [20,21]). In prokaryotes,
it is still common to locate genes by simply looking for long open
reading frames (ORFs); this is certainly not adequate for higher eukaryotes.
To discriminate
coding from non-coding regions in eukaryotes, exon content sensors often
use in-frame hexamer counts
or, what is nearly equivalent, a set of 3 fifth-order Markov models, one for each
of the three nucleotide positions within a codon, as pioneered in the genefinder GeneMark
[7]. It is also
important to consider local compositional biases, as the codon preferences are quite
different between genes in G+C rich regions and genes in A+T rich regions [55,18,7].
While many other measures of coding potential have been investigated (Fickett tested 19
different measures, which he took from the literature [21]), few others have
been proven to be as effective.
However, combinations of several measures can be effective,
as in the popular GRAIL exon detector, in which several coding measures are combined
along with base composition and signal sensor output for flanking splice sites, and fed into a neural
net to predict exons [71].
Other content sensors include sensors for CpG islands, which are regions that often
occur near the beginnings of genes where the frequency
of the dinucleotide CG is not as low as it typically is in the rest of the genome [4,25,47],
and sensors for repetitive DNA, such as ALU sequences [36,35,51].
The latter sensors are often
used as masks or filters that completely remove the repetitive DNA, leaving the remaining DNA to be
analyzed.
Next: Integrated Gene Finding Methods
Up: No Title
Previous: Signal Sensors
David Haussler
10/14/1998