Next: Integrated Gene Finding Methods Up: No Title Previous: Signal Sensors

Content Sensors

The most important and most studied content sensor is the sensor that predicts coding regions. An extensive review of computational methods to detect coding regions is given by Fickett and Tung [23] (see also [20,21]). In prokaryotes, it is still common to locate genes by simply looking for long open reading frames (ORFs); this is certainly not adequate for higher eukaryotes. To discriminate coding from non-coding regions in eukaryotes, exon content sensors often use in-frame hexamer counts or, what is nearly equivalent, a set of 3 fifth-order Markov models, one for each of the three nucleotide positions within a codon, as pioneered in the genefinder GeneMark [7]. It is also important to consider local compositional biases, as the codon preferences are quite different between genes in G+C rich regions and genes in A+T rich regions [55,18,7]. While many other measures of coding potential have been investigated (Fickett tested 19 different measures, which he took from the literature [21]), few others have been proven to be as effective. However, combinations of several measures can be effective, as in the popular GRAIL exon detector, in which several coding measures are combined along with base composition and signal sensor output for flanking splice sites, and fed into a neural net to predict exons [71].

Other content sensors include sensors for CpG islands, which are regions that often occur near the beginnings of genes where the frequency of the dinucleotide CG is not as low as it typically is in the rest of the genome [4,25,47], and sensors for repetitive DNA, such as ALU sequences [36,35,51]. The latter sensors are often used as masks or filters that completely remove the repetitive DNA, leaving the remaining DNA to be analyzed.


Next: Integrated Gene Finding Methods Up: No Title Previous: Signal Sensors
David Haussler
10/14/1998