"A Hidden Markov Model that finds genes in E. coli DNA"
Anders Krogh, I. Saira Mian and David Haussler
Abstract:
A hidden Markov model (HMM) has been developed to find protein coding genes
in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database
maintained by Kenn Rudd. This HMM includes states that model the codons and
their frequencies in E. coli genes, as well as the patterns found in the
intergenic region, including repetitive extragenic palindromic sequences and
the Shine-Delgarno motif. To account for potential sequencing errors and or
frameshifts in raw genomic DNA sequence, it allows for the (very unlikely)
possiblity of insertions and deletions of individual nucleotides within a
codon. The parameters of the HMM are estimated using approximately one
million nucleotides of annotated DNA in EcoSeq6 and the model tested on a
disjoint set of contigs containing about 325,000 nucleotides. The HMM finds
the exact locations of about 80% of the known E. coli genes, and approximate
locations for about 10%. It also finds several potentially new genes, and
locates several places were insertion or deletion errors and or frameshifts
may be present in the contigs.
In Nucleic Acids Research, v. 22 (1994) p. 4768-4778