"Dirichlet Mixtures: A Method for Improving Detection of Weak but
Significant Protein Sequence Homology"
Kimmen Sjölander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I.S. Mian,
and D. Haussler [PDF]
[postscript]
Abstract:
This paper presents the mathematical foundations of Dirichlet mixtures,
which have been used to improve database search results for homologous
sequences, when a variable number of sequences from a protein family
or domain are known. We present a method for condensing the information
in a protein database into a mixture of Dirichlet densities. These
mixtures are designed to be combined with observed amino acid frequencies,
to form estimates of expected amino acid probabilities at each position
in a profile, hidden Markov model, or other statistical model. These
estimates give a statistical model greater generalization capacity, such
that remotely related family members can be more reliably recognized by
the model. Dirichlet mixtures have been shown to outperform substitution
matrices and other methods for computing these expected amino acid
distributions in database search, resulting in fewer false positives
and false negatives for the families tested. This paper corrects a
previously published formula for estimating these expected probabilities,
and contains complete derivations of the Dirichlet mixture formulas,
methods for optimizing the mixtures to match particular databases, and
suggestions for efficient implementation.