UCSC

Center for Biomolecular Science & Engineering

WINTER 2003 BIOINFORMATICS SEMINARS

.

Tuesday, January 7
Arend Sidow: Quantification of predictive constraints on the structure and function of proteins

Tuesday, January 14
Katherine Pollard: Computationally intensive statistical methods for analysis of gene expresssion data

Tuesday, January 21
Bob Edgar: A new approach to sequence alignment and tree construction using profile HMMs

Tuesday, January 28
Christopher Workman: Gene Expression Analysis: from DNA microarrays to comparative genomics

Tuesday, February 4
Richard Myers: Large-scale identification and analysis of functional elements in the human genome

Tuesday, February 11
Jeremy Minshull: Better Engineering Through Sex

Tuesday, February 25
Soumya Raychaudhuri: Towards incorporating free text scientific literature into the analysis of biological data

Tuesday, March 4
Bill Bruno: Statistics in Tree Space: True P-Values from Bootstrap Fractions

Wednesday, March 4
Paul Harrison: Insights into proteome evolution in eukaryotes from genome-scale analysis of gene and pseudogene populations

Tuesday, March 11
Michael Jordan: Machine learning and the integration of multiple data sources

Tuesday, March 18
Shirley Pepke: Computation and Utility of Word Frequency Annotations for the Human Genome

BIOs & ABSTRACTS


Arend Sidow
Departments of Pathology and Genetics, Stanford University
“Quantification of predictive constraints on the structure and function of proteins”
Tuesday, January 7, 2003
2-3:45PM
Baskin Engineering 330

Abstract:
Molecular evolutionary analyses allow inference of local constraints in protein sequences. These constraints are predictive of regions of structural and functional importance. I will present methodology and case studies of the inference of constraints at the level of domains, small regions, and individual amino acids.



Katherine Pollard
UC Berkeley Biostatistics
Postdoctoral Candidate, Haussler Lab
“Computationally intensive statistical methods for analysis of gene expresssion data”
Tuesday, January 14, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Exploratory methods are being widely applied to the high dimensional data structures produced by genome scale research. For example, our methodological research is focused on gene expression data analysis where inferences can be made for thousands of genes simultaneously. In such settings, it is essential that exploratory analyses be accompanied by assessments of reliability and reproducibility. These statistical assessments are particularly crucial with the huge number of genes being studied based on relatively small samples. To this end, we have proposed a statistical framework for the analysis of gene expression data. By viewing quantities of interest (e.g.: differences in gene expression between populations, gene cluster membership, associations between gene expression and survival) as parameters of an underlying data generating distribution and their observed values as parameter estimates, we are able to formally study statistical properties, such as consistency and confidence, of these quantities. I will illustrate the utility of this statistical framework for assessing the reliability of gene and sample clustering methods, including novel clustering algorithms we have proposed, using bootstrap resampling. Other interesting applications include (i) multiple testing to identify significantly differently expressed genes, (ii) sample size calculations for gene expression studies, and (iii) clustering genes based on patterns of association between gene expression and an outcome or DNA sequence motif.


Bob Edgar
Independent Scientist
“A new approach to sequence alignment and tree construction using profile HMMs”
Tuesday, January 21, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Aligning multiple proteins based on sequence information alone is challenging if sequence identity is low or there is a significant degree of structural divergence. We present a novel algorithm (SATCHMO), developed in collaboration with Kimmen Sjolander, that is designed to address this challenge. SATCHMO simultaneously constructs a tree and a set of multiple sequence alignments, one for each internal node of the tree. The alignment at a given node contains all sequences within its sub-tree, and predicts which positions in those sequences are alignable and which are not. Aligned regions therefore typically get shorter on a path from a leaf to the root as sequences diverge in structure. Current methods either regard all positions as alignable (e.g., ClustalW), or align only those positions believed to be homologous across all sequences (e.g. profile HMM methods); by contrast SATCHMO makes different predictions of alignable regions in different subgroups. SATCHMO generates profile hidden Markov models at each node; these are used to determine branching order, to align sequences and to predict structurally alignable regions. In experiments on the BAliBASE benchmark alignment database, SATCHMO is shown to produce alignments that, on average, have the same accuracy as ClustalW and the UCSC SAM HMM software. Initial trials suggest that the structural classification produced by the SATCHMO tree closely approximates those in SCOP and CATH.


 Christopher Workman
Scientific Consultant, GeneData AG, Switzerland
Postdoctoral Candidate, Haussler Lab
“Gene Expression Analysis: from DNA microarrays to comparative genomics”
Tuesday, January 28, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract
Global gene expression analysis can be used to study the molecular basis for disease and implicate genes involved in various diseases. In this talk, specific aspects of high density oligonucleotide and cDNA microarray data analysis are addressed and some example analyzes are presented for human bladder and colon cancer. Expression profiling and cluster analysis studies also suggest sets of co-regulated genes, and their promoter regions, which may contain common regulatory elements such as transcription factor binding sites. Computational pattern finding methods are capable of discovering these DNA elements though the large intergenic regions in higher Eukaryotes makes the search for these patterns quite difficult. Comparative genomics studies focusing on conserved non-coding DNA will prove helpful in defining smaller regions in which to search for regulatory patterns. Initial results are presented from a comparison of human and mouse intergenic regions and suggest that conserved regions are more likely to contain regulatory elements than non-conserved intergenic sequence.


 Richard Myers
Department of Genetics, Stanford University School of Medicine
“Large-scale identification and analysis of functional elements in the human genome”
Tuesday, February 4, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract
Our group has been both producers and users of human genomic sequence. Working closely with the Joint Genome Institute at Walnut Creek, California, we have produced finished sequence of more than 300 Mb of the human genome, including chromosomes 5 and 19 and much of chromosome 16. Our finishing pipeline has produced sequence of quality far exceeding the standards set by the international public sequencing community, and is not being applied to additional genomes. We recently used the same experimental and computational tools that we apply to sequence finishing to perform an independent quality assessment of finished human sequence generated by the other large public sequencing groups.
In addition to producing this large body of finished genomic sequence, as well as more than 8,000 full-length cDNA sequences, we have applied a combination of sequence analysis and experiments, mostly in the form of cultured cell transfections of reporter constructs, to identify and study cis-acting transcriptional control regions on a genome-wide scale in the human genome. In the most mature of these projects, we have identified more than 18,000 human transcriptional promoters and have analyzed their distributions and behaviors in the genome. We have observed a number of frequent occurrences of features that were previously thought to be unusual. One such example is the identification of more than 625 pairs of genes that are arranged head-to-head, in a "bidirectional" orientation, such that their 5' flanking segments are abutted, on average within less than 200 bp from one another. We are currently determining whether these genes belong to particular biological classes, whether they are co-expressed in various human tissues or whether they are expressed exclusive of one another. In addition, we have found evidence suggesting that more than 20% of the genes transcribed by the set of promoters we have identified have one or more additional, alternative promoters, sometimes producing mRNAs that generate different versions of the protein encoded by the gene.
Finally, we have used chromatin immunoprecipitation methods, combined with real-time PCR or DNA microarrays containing human promoters, to identify the genomic targets bound by transcription factos in living cells. We have applied this approach especially towards identifying targets of the two human heat shock factor transcriptional regulators, and have found new targets that are regulated by these proteins, as well as differences in their responses to cell differentiation compared to heat shock. We have begun to build microarrays containing 18,000 human promoters to use as a tool for identifying genes that are bound and regulated by these, as well as other, transcription factors.

Relevant citations:
MGC Program Team. (2002). Generation and initial analysis of more than 14,000 non-redundant, full-length human and mouse cDNA sequences by the NIH Mammalian Gene Collection Program. Proc. Natl. Acad. Sci. USA. 99: 16899-16903.
Trinklein, N., Force Aldred, S., Saldanha, A. and Myers, R. M. (2002). Identification and functional analysis of human transcriptional promoters. Genome Res. In press.
Noonan, J. P., Li, J., Nguyen, L., Caoile, C., Dickson, M., Grimwood, J., Schmutz, J., Feldman, M. W. and Myers, R. M. (2002). Extensive linkage disequilibrium, a common 16.7 kb deletion and evidence of balancing selection in the human protocadherin alpha cluster. Amer. J. Hum. Genet. In press.
Trinklein, N., Chen, W. C., Kingston, R. E. and Myers, R. M. (2002). Transcriptional regulation and binding of HSF1 and HSF2 to 32 human heat shock genes during thermal stress and differentiation. Mol. Cell. Biol. Submitted.


Jeremy Minshull
Vice-President, Core Technology, Maxygen Inc.
“Better Engineering Through Sex”
Tuesday, February 11, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Recombination-based directed evolution methods are analogous to sex in the way that they explore phenotypic diversity. The high throughput screens used to measure changes in protein properties during directed evolution can be laborious and expensive to develop and implement, and may not accurately measure the protein properties that are ultimately required. By combining combinatorial molecular biology techniques with more traditional engineering approaches we have greatly reduced the number of variants that need to be screened in iterative protein optimization cycles. Statistical and structural approaches to designing protein function will be compared, with a passing reference to the headpiece of the staff of Ra.


 Soumya Raychaudhuri
Joint M.D./Ph.D. Program, Stanford University School of Medicine
BME Faculty Recruitment Candidate
“Towards incorporating free text scientific literature into the analysis of biological data”
Tuesday, February 25, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
With the completion of the genome projects, a complete list of genes in many organisms is becoming rapidly available. Additionally, high throughput technologies permit rapid characterization of many genes simultaneously; methods include gene expression measurements by microarray assay and identification of protein interactions by two-hybrid screens. The current challenge in bioinformatics is to devise methods to interpret the results of such large-scale experimental assays so that the properties and interactions of individual genes can be identified. To do this effectively, computational methods must integrate significant background information such as sequence information, gene annotations, other experimental results, and knowledge from the published literature. Since all biological discoveries are recorded in the scientific literature, the corpus of scientific text may constitute the most valuable knowledge resource. Here I discuss computational approaches that automatically access the corpus of scientific literature to analyze large-scale biological data. These methods draw on concepts from machine learning and natural language processing. Specifically, I will focus on examples from my work in gene expression analysis. The methods presented here are relevant to the analysis of any data for which significant textual documentation is available.

Soumya Raychaudhuri spent his formative years growing up in Rochester, NY. He completed his undergraduate degrees in mathematics and biophysics from SUNY Buffalo in 1997 as a Goldwater Fellow and was the Outstanding Student in the Natural Sciences that year. In August 2002, he completed a doctoral degree from Stanford University in Russ Altman’s lab, funded by an NIH pre-doctoral fellowship. Currently he is working as a post-doctoral fellow in the same lab and supervising a student. He is concurrently finishing his medical degree. His research interests are in the area of computational biology and large-scale data analysis. Lately he has focused his efforts on integrating knowledge from free text sources into biological data analysis. In addition, Soumya has worked in diverse areas of computational biology including gene expression analysis and protein structure analysis. Recently he has presented papers and tutorials at Intelligent Systems in Biocomputing, Pacific Symposium on Biocomputing, European Molecular Biology Organization, and the Joint Statistical Meeting.


 Willliam J. Bruno
Theoretical Biology & Biophysics
Los Alamos National Laboratory
Biomolecular Engineering Faculty Recruitment Candidate
“Statistics in Tree Space: True P-Values from Bootstrap Fractions”
Tuesday, March 4, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Pseudogenes are copies of genes that cannot produce a protein. They are either made through duplication or as retrotransposed copies of mRNAs (the latter type is termed a processed pseudogene). We discuss the assignment of pseudogenes in the genomes of four eukaryotes (budding yeast, fruit fly, nematode andhuman), and the implications of these pseudogene populations for gene annotation and for the evolution of the proteomes of eukaryotes. For example, we find that, in budding yeast, duplicated pseudogenes tend to cluster near the telomeres of chromosomes, and occur for families that are more specific to the budding yeast proteome. Also, in the human genome, we detail a study of over 2,000 processed pseudogenes for ribosomal proteins, where we found that they are strikingly unevenly distributed amongst the different types of ribosomal protein. We conclude with some discussion of over-arching themes for pseudogene populations and proteome evolution.


 Paul Harrison
MB&B Department, Yale University
BME Faculty Recruitment Candidate
“Insights into proteome evolution in eukaryotes from genome-scale analysis of gene and pseudogene populations”
Wednesday, March 5, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Evolutionary biologists have long used bootstrapping or jackknifing (a simple procedure of iterating an analysis on random portions of the dataset) to assign confidence measures to evolutionary trees. Evolutionary trees are critically relevant to many tasks in bioinformatics. Two examples are proper identification of orthologs, and deriving a position-specific scoring matrix from an alignment ("the sequence weighting problem"). Meanwhile, there are other hierarchical clustering problems (e.g., gene clustering based on expression microarray data) where bootstrapping would also be very useful. There has been some disagreement in the literature over the interpretation of bootstrap scores. I will show that, under certain general conditions, the bootstrap score and the desired p-value for a tree branch are related by a nonlinear function that can be computed by numerical integration and well approximated by a no-parameter formula derived from the asymptotics. The result is that the p-value for a bootstrap support of 99.9% is an order of magnitude smaller than previously thought, and this ratio of corrected p-value to raw p-value goes to zero for small p. One of the prerequisites for applying this formula, or for bootstrapping to have any meaning, is that the reconstruction method be unbiased. I will show that corrected bootstrapping statistics are accurate if the trees are built using Weighbor or DNAml, but not Neighbor Joining trees are used.


Michael I. Jordan
Department of Electrical Engineering and Computer Science / Department of Statistics, UC Berkeley
“Machine learning and the integration of multiple data sources”
Tuesday, March 11, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Machine learning systems are increasingly being called upon to integrate across data sources having varying formats, semantics and degrees of reliability. I describe two classes of techniques (one “generative” and the other “discriminative”) that aim at solving large-scale data integration problems. The first class makes use of probabilistic graphical models, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. I describe graphical model algorithms that implement a general “empirical Bayesian” approach to data integration. The second class is based on “kernel methods,” an area of machine learning that makes significant use of convex optimization techniques. I show how multiple kernels can be combined, yielding a problem that is still within the scope of convex optimization. I illustrate these ideas with examples from information retrieval and bioinformatics.
(with Peter Bartlett, David Blei, Nello Cristianini, Laurent El Ghaoui, Gert Lanckriet, and Andrew Ng)


 Shirley Pepke
Computational Biology Postdoctoral Fellow, Berlex Biosciences
“Computation and Utility of Word Frequency Annotations for the Human Genome”
Tuesday, March 18, 2003
2-3:00PM, Q&A period follows
Baskin Engineering 330

Abstract:
Whole genome sequence data for organisms presents a rich opportunity for application of statistical measures to DNA functional classification. A computational approach for classification may be particularly useful for noncoding DNA sequence such as long range regulatory elements (LCRs), as the functional behavior of these are sensitive to experimental context. In particular, I will discuss distributed suffix arrays for efficient computation of measures based upon word counts in a region or across the whole genome. The resulting data for the human genome of word counts clearly reflects long length scale DNA sequence correlations. More intriguing is the possibility of using such simple statistical measures to distinguish LCR functional regions from other DNA. One approach to doing this will be examined in the context of the human beta globin LCR.

 


©2003 Center for Biomolecular Science & Engineering
Baskin School of Engineering, University of California, Santa Cruz
1156 High St., Room 373, Santa Cruz, CA 95064
(831) 459-1544 swalton@soe.ucsc.edu
Last modified June 2003