Introduction (2024)

All living organisms are composed of cells, each no wider than a human hair. Each of our cells contains the same complement of DNA constituting the human genome (Figure 1-1.) The DNA sequence of every person's genome is the blueprint for his or her development from a single cell to a complex, integrated organism that is composed of more than 10¹³ (10 million million) cells. Encoded in the DNA sequence are fundamental determinants of those mental capacities—learning, language, memory—essential to human culture. Encoded there as well are the mutations and variations that cause or increase susceptibility to many diseases responsible for much human suffering. Unprecedented advances in molecular and cellular biology, in biochemistry, in genetics, and in structural biology—occurring at an accelerating rate over the past decade—define this as a unique and opportune moment in our history: For the first time we can envision obtaining easy access to the complete sequence of the 3 billion nucleotides in human DNA and deciphering much of the information contained therein. Converging developments in recombinant DNA technology and genetics make obtaining a complete ordered DNA clone collection indexed to the human genetic linkage map a realistic immediate goal. Even determination of the complete nucleotide sequence is attainable, although ambitious. The DNA in the human genome is remarkably stable, as it must be to provide a reliable blueprint for building a new organism. For this reason, obtaining complete genetic linkage and physical maps and deciphering the sequence will provide a permanent base of knowledge concerning all human beings—a base whose utility for all activities of biology and medicine will increase with future analysis, research, and experimentation.

Even the complete sequence of DNA in the human genome will not by itself explain human biology. It will, however, serve as a great resource, an essential data bank, facilitating future research in mammalian biology and medicine. Humans, like all living organisms, are composed largely of proteins. For humans these are roughly estimated to be of 100,000 different kinds. In general, each gene codes for the production of a single protein, and a gene and its protein can be related to each other by means of the genetic code. Therefore, scientists will be able to turn to the DNA sequence of the human genome and obtain detailed information on both the structure and function of any gene or protein of interest. In addition, all genes and proteins will be classified into large family groups that provide valuable clues to their functions. In this way, many previously unknown human genes and proteins will become available for biochemical, physiological, and medical studies. The knowledge gained will have a major impact on health care and disease prevention; it will also raise challenging issues regarding rational, wise, and ethical uses of science and technology.

Genomes, Genes, and Genomic Maps

To understand the importance of knowledge about the human genome, one must first understand the genome's functions.

Genomes Consist of DNA Molecules That Contain Many Genes

The genome of all living organisms consists of DNA, a very long two-stranded chemical polymer (Figure 2-1). Each DNA strand is composed of four different units, called nucleotides, that are linked end to end to form a long chain (Figure 2-2). These four nucleotides are symbolized as A, G, C, and T, which stand for the four bases—adenine, guanine, cytosine, and thymine—that are parts of the nucleotides. One DNA molecule, which together with some associated proteins constitutes a chromosome, differs from another in its length and in the order of its nucleotides. Each DNA molecule contains many genes, which are its functional units. These genes are arranged in a defined order along the DNA molecule. Most genes code for protein molecules—enzymes or structural elements—that determine the characteristics of a cell. In bacteria, the coding sequences of a gene—are continuous strings of nucleotides, but in mammals the coding segments in a gene (called exons) are generally separated from one another by noncoding segments (called introns) (Figure 2-3). Often each exon will encode a different structural region (or domain) of a larger protein molecule. Many exons have been found to be part of a family of related coding sequences that are used in the construction of many different genes (Doolittle et al., 1986). Because of the many introns in mammalian genes, a single gene is often more than than 10,000 nucleotides long, and genes that span 100,000 nucleotides are not uncommon (Table 2-1).

Figure 2-1

Two ways of representing the DNA double helix. Diagrams are of a very short section of the DNA molecule in each chromosome. The human genome contains about 200 million times the amount of DNA shown. The two strands of the DNA double helix run in opposite (more...)

Figure 2-2

The nucleotides that form a DNA molecule. (A) Specific hydrogen bond interactions between G and C and between A and T bases generate complementary nucleotide pairs (that is, G always bonds with C and A always bonds with T). A haploid human genome contains (more...)

Figure 2-3

How genes are expressed in human cells. Each gene can specify the synthesis of a particular protein. Whether a gene is off or on depends on signals that act on the regulatory region of the gene. When the gene is on, the entire gene is transcribed into (more...)

TABLE 2-1

The Size of Some Human Genes.

For the information in the coding sequences of a gene to be expressed, the DNA of a gene must first be transcribed into an RNA molecule (Figure 2-3). Before the RNA strand leaves the cell's nucleus, the intron sequences are cut out of this RNA strand by a process called RNA splicing, thereby bringing the exon sequences into contiguity. Then the RNA can be translated into a protein molecule according to the genetic code (every group of three nucleotides codes for one amino acid). Nucleotide sequences adjacent to the coding sequences in each gene encode regulatory signals for activating or inactivating transcription of the gene. Gene activity is a dynamic process; at any given time and in any given cell type, only a subset of genes is active. These active genes determine the course of embryological development and the characteristics of cells and organisms.

The Human Genome Is Composed of 24 Different Types of DNA Molecules

Human DNA is packaged into physically separate units called chromosomes. Humans are diploid organisms, containing two sets of genetic information, one set inherited from the mother and one from the father. Thus, each somatic cell has 22 pairs of chromosomes called autosomes (one member of each pair from each parent) and two sex chromosomes (an X and a Y chromosome in males and two X chromosomes in females). Each chromosome contains a single very long, linear DNA molecule. In the smallest human chromosomes this DNA molecule is composed of about 50 million nucleotide pairs; the largest chromosomes contain some 250 million nucleotide pairs.

The diploid human genome is thus composed of 46 DNA molecules of 24 distinct types. Because human chromosomes exist in pairs that are almost identical, only 3 billion nucleotide pairs (the haploid genome) need to be sequenced to gain complete information concerning a representative human genome. The human genome is thus said to contain 3 billion nucleotide pairs, even though most human cells contain 6 billion nucleotide pairs.

DNA is a double helix: Each nucleotide on a strand of DNA has a complementary nucleotide on the other strand. The information on one DNA strand is therefore redundant to that on the other (that is because of complementary base pairing (Figure 2-2A), one can in principle determine the nucleotide sequence of one strand from the other). However, it is currently necessary to determine the sequences of the nucleotides on the two DNA strands separately to achieve the desired accuracy of any DNA sequence, with the sequence of each strand being used as a check on the other. For this reason, a total of 6 billion nucleotides must actually be sequenced to order the 3 billion nucleotide pairs in the haploid human genome.

The average size of a protein molecule allows one to predict that there are approximately 1,000 nucleotide pairs of coding sequence per gene. Since humans are thought to have about 100,000 genes, a total of about 100 million nucleotide pairs of coding DNA must be present in the human genome. That this is only about 3 percent of the total size of the genome leads one to conclude that less than 5 percent of the human genome codes for proteins. The vast bulk of human DNA lies between genes and in the introns. Some of the noncoding DNA plays a role in regulating gene activity, while other portions are believed to be important for organizing the DNA into chromosomes and for chromosome replication (Alberts et al., 1983; Lewin, 1987). The function of most noncoding regions of the human genome, however, is unknown; much of this DNA may have no function at all.

The Human Genome Can Be Mapped in Many Different Ways

It would be enormously useful to determine the order and spacing of all the genes that make up the genome. Such information is said to constitute a gene or genome map. Since there are 24 different DNA molecules in the human genome, a complete human gene map consists of 24 maps, each in the linear form of the DNA molecule itself.

Figure 2-4

The DNA sequence of the human gene for beta-globin (a protein of 146 amino acids that forms part of the hemoglobin molecule that carries oxygen in the blood). The sequence of only one of the two DNA strands is given since the other one has a precisely (more...)

Medical Implications of Detailed Human Genome Maps

Advances in molecular genetics made over the past two decades are already having a major impact on medical research and clinical care. The ability to clone and analyze individual genes and to deduce the amino acid sequences of encoded proteins has greatly increased our understanding of genetic disorders, the immune system, endocrine abnormalities, coronary artery disease, infectious diseases, and cancer. A few proteins produced on a commercial scale by recombinant DNA methods are available for therapeutic use or in clinical trials, and many more are in earlier developmental stages. Recent progress in determining the genetic basis for such neurological and behavioral disorders as Huntington's disease (Gusella et al., 1983), Alzheimer's disease (St George-Hyslop et al., 1987), and manic-depressive illness (Egeland et al., 1987) promises new insights into these common and serious conditions. Higher resolution maps of the human genome will accelerate progress in understanding disease pathogenesis and in developing new approaches to diagnosis, treatment, and prevention in many areas of medicine. In Chapter 3 the potential medical impact of a detailed human genomic map is discussed further.

Implications for Basic Biology

The generation of a physical map of the human genome and the determination of its nucleotide sequence will provide an important research tool for basic biology. This is especially true because we expect a human genome project to support mapping and sequencing investigations that are carried out concurrently in other extensively studied organisms, including the Escherichia coli bacterium, the lower eukaryote Saccharomyces cerevisiae (a yeast), the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster, the mouse Mus musculus, and possibly also a plant such as maize or Arabidopsis. Analyzing these genomes will approximately double the total amount of DNA to be mapped and sequenced. But the additional effort will make it possible to test the function of genes that have been identified in humans in other organisms that are experimentally accessible and for which powerful genetic techniques exist. It will thereby be possible to firmly establish the exact role of these genes in important biological processes. Conversely, proteins that are discovered to be of special interest in any of these other organisms can be immediately identified by amino acid hom*ology in the human, thereby enabling investigators to conduct well-focused studies of the function of the corresponding human protein and its gene. The extensive DNA sequence and functional comparisons that are generated will also represent an invaluable resource for evolutionary biologists. These and other implications for basic biology are discussed in greater detail in Chapter 3.

Expected Gechnological Developments Generated by a Human Genome Project and their Impact on Biological Research

The process of mapping and sequencing the human genome is likely to have important spin-offs in the form of new technologies with broad applicability in both basic and applied biological research. For example, efficient methods for mapping complex genomes are still being developed, and a human genome project would accelerate this process. Such methods include improvements in the production, separation, and cloning of large pieces of DNA and methods for constructing an ordered set of genomic clones (see Chapter 4). This methodology will be directly applicable to the development of a physical map of the genomes of many experimentally and commercially important animals and plants.

Similarly, an effort to sequence the human genome will require much more efficient nucleotide sequencing technology than now exists (see Chapter 5). These improvements will greatly reduce the time spent on DNA sequencing in individual research laboratories. In the future, the development of institution-wide or regional sequencing facilities equipped with highly automated instruments could serve a large number of scientists, freeing them to concentrate on more advanced stages of their research problems.

Finally, the generation of a detailed map of the human genome will require new computer-based methods for collecting, storing, and analyzing the large amount of information expected (see Chapter 6). These methods can easily be adapted to handling analogous data from other organisms. Scientists will thus have immediately available through computer networks an enormous store of biological information supported by methods for using it, such as clone collections; these resources are likely to have a major beneficial impact on the way that individual scientists do research.

Impact on the Research by Small Groups

One of the key features and attractions of biomedical research today is that it is based primarily on the efforts of small, independent groups of scientists. The major advances of the past decades can be traced to the creativity of these groups, or even to single individuals, often near the beginnings of their careers. Mapping and sequencing the human genome, on the other hand, is likely to require organizational arrangements on a considerably larger scale than is customary in other biological research. Some see this as a threat to the independence of individual investigators. In the committee's view, however, a mapping and sequencing project should have as its primary goal an increase in the power and range of the research potential of small groups of individuals.

The complete nucleotide sequences of the genomes of the several organisms of major experimental interest will provide a critical reference data base for interpreting and studying the many human genes that will be discovered. To take just one example, an individual cancer researcher who discovers a new oncogene in a human tumor will have immediate access by computer search to all the proteins that are likely to have a related function in lower organisms. Since these genes can be experimentally manipulated in ways that are impossible in humans, the function of the corresponding gene can be determined much more readily in a fruit fly, a nematode worm, or a yeast cell. The results are certain to provide important insights into human cancer that could not be obtained by direct research on humans. Conversely, researchers interested primarily in yeast cells will benefit from the information about yeast genes that can be derived from studies on its hom*ologues that are initially conducted with another organism.

Even among researchers whose efforts are confined exclusively to humans, small group efforts will be encouraged. The human genome map and an ordered set of human DNA clones will be available as a resource for the use of all investigators, enabling them to concentrate on the most interesting parts of their research. In addition, new areas of research are likely to emerge as a result of this resource, particularly in relation to human health. In short, the committee believes that the mapping and sequencing project will make an important contribution to primary research conducted by small groups of independent investigators, extending their reach into currently inaccessible problems.

A project to map and sequence the human genome has many different components. In the following sections of this report, we examine implications for medicine and science (Chapter 3), mapping (Chapter 4), sequencing (Chapter 5), data handling and analysis (Chapter 6), implementation and management strategies (Chapter 7), and commercial, legal, and ethical implications (Chapter 8).

References

Alberts, B., D. Bray, J. Lewis, M. Raff, K. Roberts, and J. D. Watson. 1983. Molecular Biology of the Cell. Garland, New York. 1146 pp.
Alberts, B., D. Bray, J. Lewis, M. Raff, K. Roberts, and J. D. Watson. 1989. Molecular Biology of the Cell, 2nd edition, editor. , Garland, New York, in press.
Doolittle, R. F., D. F. Feng, M. S. Johnson, and M. A. McClure. 1986. Relationships of human protein sequences to those of other organisms. Cold Spring Harbor Symp. Quant. Biol. 51:447–455. [PubMed: 3472734]
Egeland, J. A., D. S. Gerhard, D. L. Pauls, J. N. Sussex, K. K. Kidd, C. Allen, A. M. Hostetter, and D. E. Housman. 1987. Bipolar affective disorders linked to DNA markers on chromosome 11.Nature 325:783–787. [PubMed: 2881209]
Gusella, J. F., N. S. Wexler, P. M. Conneally, S. L. Naylor, M. A. Anderson, R. E. Tanzi, P. C. Watkins, K. Ottina, M. R. Wallace, A. Y. Sakaguchi, A. B. Young, I. Shoulson, E. Bonilla, and J. B. Martin. 1983. A polymorphic DNA marker genetically linked to Huntington's disease. Nature 306:234–238. [PubMed: 6316146]
Lewin, B.1987. Genes, 3rd ed.John Wiley & Sons, New York. 737 pp.
St George-Hyslop, P. H., R. E. Tanzi, R. J. Polinsky, J. L. Haines, L. Nee, P. C. Watkins, R. H. Myers, R. G. Feldman, D. Pollen, D. Drachman, J. Growdon, A. Bruni, J.-F. Foncin, D. Salmon, P. Frommelt, L. Amaducci, S. Sorbi, S. Piacentini, G. D. Stewart. W. J. Hobbs, P. M. Conneally, J. F. Gusella. 1987. The genetic defect causing familial Alzheimer's disease maps on chromosome 21. Science 235:885–890. [PubMed: 2880399]
Watson, J. D., J. Tooze, and D. T. Kurtz,1983. Recombinant DNA: A Short Course, W. H. Freeman, San Francisco.