|
Haussler
had previously applied a mathematical
technique known as hidden
Markov models (HMMs) to the
task of computer gene-finding.This
application of HMMs had quickly
become the dominant gene-finding
methodology and had been used
successfully on the Drosophila
melanogaster (fruit fly)
genome.
Haussler
enlisted Jim Kent, then a graduate student in UCSC’s
Department
of Molecular, Cell, & Developmental Biology, along with systems engineer
Patrick Gavin and graduate students Terrence Furey and David Kulp (who had
led the gene-finding effort on the Drosophila genome). This was the birth of
the
UCSC Genome Bioinformatics Group. THE
FIRST WORKING DRAFT
It
was a crucial time for the
international project. The
private company Celera Genomics
had announced its intention
to assemble the human genome
sequence well in advance of
the public effort, raising
the fear that the sequence
would be protected by patents
and thus not freely available
to scientists. At this point,
a number of groups within
the IHGP were trying to assemble
the genome sequence, which
turned out to be like an
extremely difficult jigsaw puzzle having
many similar-looking, noncontiguous,
overlapping pieces. The progress
was slow and arduous.
Motivated
to prevent Celera and
its clients from locking up significant portions of
the human genome in
patents, Kent dropped his other
work in May of 2000 to focus
on the assembly problem.
Within 4 weeks, he developed a
10,000 line computer
program that assembled the working draft
of the human genome.
The program, called GigAssembler,
finished the job on
June 22, 2000, just days before Celera
completed its first
assembly.
On July 7, 2000, after
further examination
by the principal scientists of the public
genome project, the
UCSC Genome Bioinformatics Group released this first
working draft on the
web at http://genome.ucsc.edu.
The scientific community
downloaded one-half trillion bytes
of information from
the UCSC genome server in the first
24 hours of free and
unrestricted access to the assembled
blueprint of our human
species.
MORE about the
race to complete the first working draft
|
|
 |
|
| |
How
are genomes sequenced?
There
isn't a laboratory system available to
read along the entire length
of a DNA strand to determine
the order of nucleotide bases
(A, G, T, and C for adenine,
guanine, cytosine, and thymine).
DNA sequences are determined
by a variety of methods, some
automated. They all involve
breaking DNA into fragments
by some chemical method such
as the use of enzymes and then
determining the order of the
nucleotides in the fragments.
A
single
sequencing
experiment
will
yield
at
most
a
few
hundred
base
pairs
(each
A
pairs
with
a
T,
and
each
G
pairs
with
a
C
on
the
opposite
strand
of
a
DNA
double
helix).
The
human
genome
contains
3
billion
nucleotides
(bases),
so
sequencing
experiments
alone
will
not
reveal
the
genome.
Sequences
need
to
be
assembled,
which
is
much
like
solving
a
colossal
jigsaw
puzzle
where
the
parts
overlap.
One
way
to
solve
the
puzzle
is
called
the
shotgun
approach.
You
break
the
genome
up
into
random,
overlapping
segments,
sequence
the
segments,
and
stitch
them
back
together
using
bioinformatics.
This
can
be
done
even
if
you
know
nothing
about
the
genome
and
the
locations
of
specific
genes
and
elements.
But
with
a
set
of
3
billion
nucleotides,
it's
a
huge
task.
The
task is further complicated by the fact that
to get
an accurate map, you
need considerable redundancy
in the sequenced segments.
So the sequenced segments contain several times the number
of
bases in the genome being studied. A supercomputer (such
as UCSC’s PitaKluster)
tackling this task will spit out a series of longer assembled
segments that are contiguous and represent non-overlapping
portions of the genome.
These are called contigs.
To join the contigs
together, researchers must go back to the wet lab and get
sequences of the gaps between the contigs.
They home in on the missing
sequences using the ends of the existing ones.
The
shotgun approach can be more effective if
it is informed by other knowledge
of the genome that
is already
available. The
human genome resides on 23 chromosomes. The locations of
many genes on these chromosomes
are already known, so this allows
some sequences to be placed on the map. Then the genome
can be pieced together from
these fixed segments.
This is a bit like
solving a jigsaw puzzle using the picture on the cover
of the box as a guide.
|
|
 |
 |
|
|
|