Arend Sidow
Departments of Pathology and Genetics, Stanford University
“Quantification of predictive constraints on the structure
and function of proteins”
Tuesday, January 7, 2003
2-3:45PM
Baskin Engineering 330
Abstract:
Molecular evolutionary analyses allow inference of
local constraints in protein sequences. These constraints
are predictive of regions of structural and functional
importance. I will present methodology and case studies
of the inference of constraints at the level of domains,
small regions, and individual amino acids.
Katherine Pollard
UC Berkeley Biostatistics
Postdoctoral Candidate, Haussler Lab
“Computationally intensive statistical methods for
analysis of gene expresssion data”
Tuesday, January 14, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Exploratory methods are being widely applied to the
high dimensional data structures produced by genome
scale research. For example, our methodological research
is focused on gene expression data analysis where
inferences can be made for thousands of genes simultaneously.
In such settings, it is essential that exploratory
analyses be accompanied by assessments of reliability
and reproducibility. These statistical assessments
are particularly crucial with the huge number of genes
being studied based on relatively small samples. To
this end, we have proposed a statistical framework
for the analysis of gene expression data. By viewing
quantities of interest (e.g.: differences in gene
expression between populations, gene cluster membership,
associations between gene expression and survival)
as parameters of an underlying data generating distribution
and their observed values as parameter estimates,
we are able to formally study statistical properties,
such as consistency and confidence, of these quantities.
I will illustrate the utility of this statistical
framework for assessing the reliability of gene and
sample clustering methods, including novel clustering
algorithms we have proposed, using bootstrap resampling.
Other interesting applications include (i) multiple
testing to identify significantly differently expressed
genes, (ii) sample size calculations for gene expression
studies, and (iii) clustering genes based on patterns
of association between gene expression and an outcome
or DNA sequence motif.
Bob
Edgar
Independent Scientist
“A new approach to sequence alignment and tree construction
using profile HMMs”
Tuesday, January 21, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Aligning multiple proteins based on sequence information
alone is challenging if sequence identity is low or
there is a significant degree of structural divergence.
We present a novel algorithm (SATCHMO), developed
in collaboration with Kimmen Sjolander, that is designed
to address this challenge. SATCHMO simultaneously
constructs a tree and a set of multiple sequence alignments,
one for each internal node of the tree. The alignment
at a given node contains all sequences within its
sub-tree, and predicts which positions in those sequences
are alignable and which are not. Aligned regions therefore
typically get shorter on a path from a leaf to the
root as sequences diverge in structure. Current methods
either regard all positions as alignable (e.g., ClustalW),
or align only those positions believed to be homologous
across all sequences (e.g. profile HMM methods); by
contrast SATCHMO makes different predictions of alignable
regions in different subgroups. SATCHMO generates
profile hidden Markov models at each node; these are
used to determine branching order, to align sequences
and to predict structurally alignable regions. In
experiments on the BAliBASE benchmark alignment database,
SATCHMO is shown to produce alignments that, on average,
have the same accuracy as ClustalW and the UCSC SAM
HMM software. Initial trials suggest that the structural
classification produced by the SATCHMO tree closely
approximates those in SCOP and CATH.
Christopher
Workman
Scientific Consultant, GeneData AG, Switzerland
Postdoctoral Candidate, Haussler Lab
“Gene Expression Analysis: from DNA microarrays to
comparative genomics”
Tuesday, January 28, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract
Global gene expression analysis can be used to study
the molecular basis for disease and implicate genes
involved in various diseases. In this talk, specific
aspects of high density oligonucleotide and cDNA microarray
data analysis are addressed and some example analyzes
are presented for human bladder and colon cancer.
Expression profiling and cluster analysis studies
also suggest sets of co-regulated genes, and their
promoter regions, which may contain common regulatory
elements such as transcription factor binding sites.
Computational pattern finding methods are capable
of discovering these DNA elements though the large
intergenic regions in higher Eukaryotes makes the
search for these patterns quite difficult. Comparative
genomics studies focusing on conserved non-coding
DNA will prove helpful in defining smaller regions
in which to search for regulatory patterns. Initial
results are presented from a comparison of human and
mouse intergenic regions and suggest that conserved
regions are more likely to contain regulatory elements
than non-conserved intergenic sequence.
Richard
Myers
Department of Genetics, Stanford University School
of Medicine
“Large-scale identification and analysis of functional
elements in the human genome”
Tuesday, February 4, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract
Our group has been both producers and users of human
genomic sequence. Working closely with the Joint Genome
Institute at Walnut Creek, California, we have produced
finished sequence of more than 300 Mb of the human
genome, including chromosomes 5 and 19 and much of
chromosome 16. Our finishing pipeline has produced
sequence of quality far exceeding the standards set
by the international public sequencing community,
and is not being applied to additional genomes. We
recently used the same experimental and computational
tools that we apply to sequence finishing to perform
an independent quality assessment of finished human
sequence generated by the other large public sequencing
groups.
In addition to producing this large body of finished
genomic sequence, as well as more than 8,000 full-length
cDNA sequences, we have applied a combination of sequence
analysis and experiments, mostly in the form of cultured
cell transfections of reporter constructs, to identify
and study cis-acting transcriptional control regions
on a genome-wide scale in the human genome. In the
most mature of these projects, we have identified
more than 18,000 human transcriptional promoters and
have analyzed their distributions and behaviors in
the genome. We have observed a number of frequent
occurrences of features that were previously thought
to be unusual. One such example is the identification
of more than 625 pairs of genes that are arranged
head-to-head, in a "bidirectional" orientation,
such that their 5' flanking segments are abutted,
on average within less than 200 bp from one another.
We are currently determining whether these genes belong
to particular biological classes, whether they are
co-expressed in various human tissues or whether they
are expressed exclusive of one another. In addition,
we have found evidence suggesting that more than 20%
of the genes transcribed by the set of promoters we
have identified have one or more additional, alternative
promoters, sometimes producing mRNAs that generate
different versions of the protein encoded by the gene.
Finally, we have used chromatin immunoprecipitation
methods, combined with real-time PCR or DNA microarrays
containing human promoters, to identify the genomic
targets bound by transcription factos in living cells.
We have applied this approach especially towards identifying
targets of the two human heat shock factor transcriptional
regulators, and have found new targets that are regulated
by these proteins, as well as differences in their
responses to cell differentiation compared to heat
shock. We have begun to build microarrays containing
18,000 human promoters to use as a tool for identifying
genes that are bound and regulated by these, as well
as other, transcription factors.
Relevant citations:
MGC Program Team. (2002). Generation and initial analysis
of more than 14,000 non-redundant, full-length human
and mouse cDNA sequences by the NIH Mammalian Gene
Collection Program. Proc. Natl. Acad. Sci. USA. 99:
16899-16903.
Trinklein, N., Force Aldred, S., Saldanha, A. and
Myers, R. M. (2002). Identification and functional
analysis of human transcriptional promoters. Genome
Res. In press.
Noonan, J. P., Li, J., Nguyen, L., Caoile, C., Dickson,
M., Grimwood, J., Schmutz, J., Feldman, M. W. and
Myers, R. M. (2002). Extensive linkage disequilibrium,
a common 16.7 kb deletion and evidence of balancing
selection in the human protocadherin alpha cluster.
Amer. J. Hum. Genet. In press.
Trinklein, N., Chen, W. C., Kingston, R. E. and Myers,
R. M. (2002). Transcriptional regulation and binding
of HSF1 and HSF2 to 32 human heat shock genes during
thermal stress and differentiation. Mol. Cell. Biol.
Submitted.
Jeremy
Minshull
Vice-President, Core Technology, Maxygen Inc.
“Better Engineering Through Sex”
Tuesday, February 11, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Recombination-based directed evolution methods are
analogous to sex in the way that they explore phenotypic
diversity. The high throughput screens used to measure
changes in protein properties during directed evolution
can be laborious and expensive to develop and implement,
and may not accurately measure the protein properties
that are ultimately required. By combining combinatorial
molecular biology techniques with more traditional
engineering approaches we have greatly reduced the
number of variants that need to be screened in iterative
protein optimization cycles. Statistical and structural
approaches to designing protein function will be compared,
with a passing reference to the headpiece of the staff
of Ra.
Soumya
Raychaudhuri
Joint M.D./Ph.D. Program, Stanford University School
of Medicine
BME Faculty Recruitment Candidate
“Towards incorporating free text scientific literature
into the analysis of biological data”
Tuesday, February 25, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
With the completion of the genome projects, a complete
list of genes in many organisms is becoming rapidly
available. Additionally, high throughput technologies
permit rapid characterization of many genes simultaneously;
methods include gene expression measurements by microarray
assay and identification of protein interactions by
two-hybrid screens. The current challenge in bioinformatics
is to devise methods to interpret the results of such
large-scale experimental assays so that the properties
and interactions of individual genes can be identified.
To do this effectively, computational methods must
integrate significant background information such
as sequence information, gene annotations, other experimental
results, and knowledge from the published literature.
Since all biological discoveries are recorded in the
scientific literature, the corpus of scientific text
may constitute the most valuable knowledge resource.
Here I discuss computational approaches that automatically
access the corpus of scientific literature to analyze
large-scale biological data. These methods draw on
concepts from machine learning and natural language
processing. Specifically, I will focus on examples
from my work in gene expression analysis. The methods
presented here are relevant to the analysis of any
data for which significant textual documentation is
available.
Soumya Raychaudhuri spent
his formative years growing up in Rochester, NY. He
completed his undergraduate degrees in mathematics
and biophysics from SUNY Buffalo in 1997 as a Goldwater
Fellow and was the Outstanding Student in the Natural
Sciences that year. In August 2002, he completed a
doctoral degree from Stanford University in Russ Altman’s
lab, funded by an NIH pre-doctoral fellowship. Currently
he is working as a post-doctoral fellow in the same
lab and supervising a student. He is concurrently
finishing his medical degree. His research interests
are in the area of computational biology and large-scale
data analysis. Lately he has focused his efforts on
integrating knowledge from free text sources into
biological data analysis. In addition, Soumya has
worked in diverse areas of computational biology including
gene expression analysis and protein structure analysis.
Recently he has presented papers and tutorials at
Intelligent Systems in Biocomputing, Pacific Symposium
on Biocomputing, European Molecular Biology Organization,
and the Joint Statistical Meeting.
Willliam
J. Bruno
Theoretical Biology & Biophysics
Los Alamos National Laboratory
Biomolecular Engineering Faculty Recruitment Candidate
“Statistics in Tree Space: True P-Values from Bootstrap
Fractions”
Tuesday, March 4, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Pseudogenes are copies of genes that cannot produce
a protein. They are either made through duplication
or as retrotransposed copies of mRNAs (the latter
type is termed a processed pseudogene). We discuss
the assignment of pseudogenes in the genomes of four
eukaryotes (budding yeast, fruit fly, nematode andhuman),
and the implications of these pseudogene populations
for gene annotation and for the evolution of the proteomes
of eukaryotes. For example, we find that, in budding
yeast, duplicated pseudogenes tend to cluster near
the telomeres of chromosomes, and occur for families
that are more specific to the budding yeast proteome.
Also, in the human genome, we detail a study of over
2,000 processed pseudogenes for ribosomal proteins,
where we found that they are strikingly unevenly distributed
amongst the different types of ribosomal protein.
We conclude with some discussion of over-arching themes
for pseudogene populations and proteome evolution.
Paul
Harrison
MB&B Department, Yale University
BME Faculty Recruitment Candidate
“Insights into proteome evolution in eukaryotes from
genome-scale analysis of gene and pseudogene populations”
Wednesday, March 5, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Evolutionary biologists have long used bootstrapping
or jackknifing (a simple procedure of iterating an
analysis on random portions of the dataset) to assign
confidence measures to evolutionary trees. Evolutionary
trees are critically relevant to many tasks in bioinformatics.
Two examples are proper identification of orthologs,
and deriving a position-specific scoring matrix from
an alignment ("the sequence weighting problem").
Meanwhile, there are other hierarchical clustering
problems (e.g., gene clustering based on expression
microarray data) where bootstrapping would also be
very useful. There has been some disagreement in the
literature over the interpretation of bootstrap scores.
I will show that, under certain general conditions,
the bootstrap score and the desired p-value for a
tree branch are related by a nonlinear function that
can be computed by numerical integration and well
approximated by a no-parameter formula derived from
the asymptotics. The result is that the p-value for
a bootstrap support of 99.9% is an order of magnitude
smaller than previously thought, and this ratio of
corrected p-value to raw p-value goes to zero for
small p. One of the prerequisites for applying this
formula, or for bootstrapping to have any meaning,
is that the reconstruction method be unbiased. I will
show that corrected bootstrapping statistics are accurate
if the trees are built using Weighbor or DNAml, but
not Neighbor Joining trees are used.
Michael
I. Jordan
Department of Electrical Engineering and Computer
Science / Department of Statistics, UC Berkeley
“Machine learning and the integration of multiple
data sources”
Tuesday, March 11, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Machine learning systems are increasingly being called
upon to integrate across data sources having varying
formats, semantics and degrees of reliability. I describe
two classes of techniques (one “generative” and the
other “discriminative”) that aim at solving large-scale
data integration problems. The first class makes use
of probabilistic graphical models, a formalism that
exploits the conjoined talents of graph theory and
probability theory to build complex models out of
simpler pieces. I describe graphical model algorithms
that implement a general “empirical Bayesian” approach
to data integration. The second class is based on
“kernel methods,” an area of machine learning that
makes significant use of convex optimization techniques.
I show how multiple kernels can be combined, yielding
a problem that is still within the scope of convex
optimization. I illustrate these ideas with examples
from information retrieval and bioinformatics.
(with Peter Bartlett, David Blei, Nello Cristianini,
Laurent El Ghaoui, Gert Lanckriet, and Andrew Ng)
Shirley
Pepke
Computational Biology Postdoctoral Fellow, Berlex
Biosciences
“Computation and Utility of Word Frequency Annotations
for the Human Genome”
Tuesday, March 18, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330
Abstract:
Whole genome sequence data for organisms presents
a rich opportunity for application of statistical
measures to DNA functional classification. A computational
approach for classification may be particularly useful
for noncoding DNA sequence such as long range regulatory
elements (LCRs), as the functional behavior of these
are sensitive to experimental context. In particular,
I will discuss distributed suffix arrays for efficient
computation of measures based upon word counts in
a region or across the whole genome. The resulting
data for the human genome of word counts clearly reflects
long length scale DNA sequence correlations. More
intriguing is the possibility of using such simple
statistical measures to distinguish LCR functional
regions from other DNA. One approach to doing this
will be examined in the context of the human beta
globin LCR.
|