Center for Biomolecular Science & Engineering: Promoting discovery and invention in the post-genomic age
Baskin School of Engineering
UCSC Home
Home People Research News & Events Academics Outreach Jobs
   You Are Here: Home > Research > Bioinformatics > UCSC and the Human Genome Project
Tag: Research
Research Areas
Research Facilities
Funding Opportunities
Ethics
Button: UCSC Genome Browser
Tag: Related Links
 

Race to complete the first working draft

Genome research primer

Comparative genomics

Read about the entire Human Genome Project on the NHGRI website

David Haussler

Jim Kent

Header: UCSC and the Human Genome Project
 
  Tag: On this page  
  The first working draft  
Element (white line)
  The finished sequence  
Element (white line)
  How are genomes sequenced?  
Element (white line)
Element (shadow)

The International Human Genome Project (IHGP) came to UC Santa Cruz in December 1999 when Eric Lander, the director of the Whitehead sequencing center (Whitehead Institute/MIT Center for Genome Research), invited David Haussler to help annotate the human genome. In particular, Lander wanted help in discovering the locations of the genes, which make up only approximately 1.5% of the sequence.

Haussler had previously applied a mathematical technique known as hidden Markov models (HMMs) to the task of computer gene-finding.This application of HMMs had quickly become the dominant gene-finding methodology and had been used successfully on the Drosophila melanogaster (fruit fly) genome.

Haussler enlisted Jim Kent, then a graduate student in UCSC’s Department of Molecular, Cell, & Developmental Biology, along with systems engineer Patrick Gavin and graduate students Terrence Furey and David Kulp (who had led the gene-finding effort on the Drosophila genome). This was the birth of the UCSC Genome Bioinformatics Group.

THE FIRST WORKING DRAFT
It was a crucial time for the international project. The private company Celera Genomics had announced its intention to assemble the human genome sequence well in advance of the public effort, raising the fear that the sequence would be protected by patents and thus not freely available to scientists. At this point, a number of groups within the IHGP were trying to assemble the genome sequence, which turned out to be like an extremely difficult jigsaw puzzle having many similar-looking, noncontiguous, overlapping pieces. The progress was slow and arduous.

Motivated to prevent Celera and its clients from locking up significant portions of the human genome in patents, Kent dropped his other work in May of 2000 to focus on the assembly problem. Within 4 weeks, he developed a 10,000 line computer program that assembled the working draft of the human genome. The program, called GigAssembler, finished the job on June 22, 2000, just days before Celera completed its first assembly.

On July 7, 2000, after further examination by the principal scientists of the public genome project, the UCSC Genome Bioinformatics Group released this first working draft on the web at http://genome.ucsc.edu. The scientific community downloaded one-half trillion bytes of information from the UCSC genome server in the first 24 hours of free and unrestricted access to the assembled blueprint of our human species.

Element (arrow)MORE about the race to complete the first working draft

Image: Nature Magazine cover - The Human Genome
With the gene assembly 90% complete, the assembled genome was published along with the findings of hundreds of researchers worldwide in the February 15, 2001 issue of Nature, which was largely devoted to the human genome

 
 
 

How are genomes sequenced?
There isn't a laboratory system available to read along the entire length of a DNA strand to determine the order of nucleotide bases (A, G, T, and C for adenine, guanine, cytosine, and thymine). DNA sequences are determined by a variety of methods, some automated. They all involve breaking DNA into fragments by some chemical method such as the use of enzymes and then determining the order of the nucleotides in the fragments.

A single sequencing experiment will yield at most a few hundred base pairs (each A pairs with a T, and each G pairs with a C on the opposite strand of a DNA double helix). The human genome contains 3 billion nucleotides (bases), so sequencing experiments alone will not reveal the genome. Sequences need to be assembled, which is much like solving a colossal jigsaw puzzle where the parts overlap. One way to solve the puzzle is called the shotgun approach. You break the genome up into random, overlapping segments, sequence the segments, and stitch them back together using bioinformatics. This can be done even if you know nothing about the genome and the locations of specific genes and elements. But with a set of 3 billion nucleotides, it's a huge task.

The task is further complicated by the fact that to get an accurate map, you need considerable redundancy in the sequenced segments. So the sequenced segments contain several times the number of bases in the genome being studied. A supercomputer (such as UCSC’s PitaKluster) tackling this task will spit out a series of longer assembled segments that are contiguous and represent non-overlapping portions of the genome. These are called contigs. To join the contigs together, researchers must go back to the wet lab and get sequences of the gaps between the contigs. They home in on the missing sequences using the ends of the existing ones.

The shotgun approach can be more effective if it is informed by other knowledge of the genome that is already available. The human genome resides on 23 chromosomes. The locations of many genes on these chromosomes are already known, so this allows some sequences to be placed on the map. Then the genome can be pieced together from these fixed segments. This is a bit like solving a jigsaw puzzle using the picture on the cover of the box as a guide.

 
Element (spacer)
Element (shadow)

 

 

 

THE FINISHED SEQUENCE
The initial assembled human genome sequence was referred to as a working draft, because there remained gaps where DNA sequence was missing, due either to lack of raw sequence data or ambiguities in the positions of the fragments. In the months following the release of the working draft, the UCSC team worked with other researchers worldwide to fill in the gaps. The resulting finished sequence made its debut in April of 2003. It encompasses 99% of the gene-containing regions of the human genome and is 99.99% accurate

 

 
UCSC Home

© January 2005,
CBSE

Updated 5/2008