It is also called the ‘classical genomics’. The first step in understanding the genome structure is through genome mapping.
Genome Mapping
Genome maps describe the locations of genes on a chromosome. Genome maps are of three types, namely genetic linkage maps or genetic maps, physical maps and cytologic maps.
Genetic linkage maps or genetic maps: These identify the relative positions of genetic markers on a chromosome. Genetic markers are the regions of chromosomes whose inheritance pattern can be followed for many eukaryotes; genetic markers represent morphologic phenotypes. Genetic maps also reveal how frequently the markers are inherited together. The closer the two genetic markers are, the more likely is that they are inherited together. In addition, they are not separated by a genetic crossing over event. The distance between two genetic markers is measured in centiMorgan (cM). CentiMorgan or map unit (m.u.) is a unit of recombination frequency for measuring genetic linkage (1 cM is approximately 1,000 kb).
Physical maps: These are maps of identifiable regions on the genomic DNA. The distance between genetic markers is measured directly as kilo bases (Kb) or mega bases (Mb). As the distance in this case is expressed in physical units, it is more accurate and reliable than cM, which is used in genetic maps. These physical maps are constructed using chromosome walking techniques. In ‘chromosome walking’, a number of radio-labelled probes are hybridized to a library of DNA clones. By identifying overlapping clones probed, a relative order of the cloned fragments can be established.
Cytologic maps: These refer to the banding patterns of stained chromosomes. These can be directly observed under a microscope. The observed light and dark bands are the markers in this case, i.e., a genetic marker can be associated with a specific chromosomal region or band. The banding patterns are, however, not constant and they varies according to the chromosomal contraction. The distance between two bands is expressed in units called ‘Dustin units’.
Genome Sequencing
DNA sequencing is carried out using the Sanger method (refer to the section ‘DNA Isolation and Sequencing’ of Chapter 9). The fluorescent traces of the DNA sequences are read by a computer program that assigns bases for each peak in a chromatogram. This process is called ‘base calling’. There are two approaches for whole genome sequencing, namely the ‘shotgun approach’ and the ‘hierarchical approach’.
Shotgun approach
This method randomly sequences clones from both ends of cloned DNA. The various steps involved in the process can be discussed as follows (Figure 11.1).
- The genomic DNA of the organism to be sequenced is isolated.
- It is then randomly sheared and restriction digested to yield DNA fragments of about 2 Kb and 10 Kb.
- The smaller (2 Kb) and larger (10 Kb) fragments are then ligated to plasmid vectors and transformed into bacterial cells and cultured. These two collections of plasmids containing the 2-Kb and 10-Kb DNA fragments are known as plasmid libraries.
- The plasmid libraries are then sequenced. Clones of DNA fragments from both the ends are sequenced. Every sequence reaction generates about 500 bp sequence data. Thus, millions of sequence data are generated.
- Overlapping sequence data are identified and the regions of contiguous sequences are assembled.
- Computer algorithms are used to assemble the millions of sequenced fragments into a continuous stretch or map a complete genome.
- Gaps too are identified and the predicted coding regions and regulatory regions are identified.
Figure 11.1 Whole genome shotgun sequencing method

Hierarchical shotgun approach
This is also known as clone-by-clone or BAC to BAC sequencing. This method is slow, but the results are more accurate. The various steps involved in the process are (Figure 11.2).
- DNA is cut into pieces of about 150 Mb and inserted into bacterial artificial chromosome (BAC) vectors, transformed into E. coli where they are replicated and stored. This collection of BAC clones is known as BAC library.
- The BAC inserts are isolated and mapped to determine the order of each cloned 150-Mb fragment. This is referred to as the Golden Tiling Path.
- Each BAC fragment in the Golden Path is fragmented randomly into smaller pieces (1.5 Kb) and each piece is cloned into a M13 vector. A M13 library is thus generated.
- The M13 libraries are sequenced. These sequences are aligned, so that identical sequences are overlapping. These contiguous pieces are then assembled into finished sequence, once each strand has been sequenced about four times to produce 8X coverage of high-quality data.
Figure 11.2 Hierarchical shotgun sequencing

Genome Sequence Assembly
The genome sequencing reaction generates short sequences of about 500 bp. These short fragments are joined to form larger fragments after removing the overlaps. These longer merged sequences are called contigs. These are usually about 5,000–10,000 bp long. Overlapping contigs are then merged to form ‘scaffolds’ (30,000–50,000 bp). These are also called ‘super contigs’. Overlapping scaffolds are then connected to create the map of the genome.
Assembling all shotgun fragments into a full genome is a computationally very challenging step. There are a variety of programs available for processing the raw sequence data. Examples:
- Phrap (www.phrap.org/) is a UNIX program for sequence assembly.
- VecScreen (www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html) is a web-based program that helps to detect contaminating bacterial vector sequences.
- TIGR assembler (www.tigr.org/) is a UNIX program for assembly of large shotgun sequence fragments.
- ARACHNE (www.genome.wi.mit.edu/wga/) is a free UNIX program for the assembly of whole genome shotgun reads.
- EULER (http://nbcr.sdsc.edu/euler) is an assembly algorithm.
Genome Annotation
Before the assembled sequence is deposited into a database, it has to be analysed for useful biological features. The genome annotation provides comments for such features. Annotation in simple terms means the process of identifying the coding regions of genes, their respective locations in a genome and determining the functions of these genes after the genome has been sequenced. Annotation is of two types namely:
- Structural annotation11, identifies genes on genome, which is also called gene finding. This can be done by computer analysis using automatic annotation tools. For example, Open reading frame (ORF) finder, http://www.ncbi.nlm.nih.gov/gorf/gorf.html, Glimmer (Gene Locator and Interpolated Markov Model ER) is a system for finding genes in microbial DNA (http://www.cbcb.umd.edu/software/glimmer/)
- Functional annotation is the process of determining biological information involved in the regulation of the expression of the sequences.
Gene annotation is a combination of theoretical prediction and experimental verification. Gene structures are first predicted by programmes such as GenScan or FgenesH. These predictions are then verified by tools such as BLAST (Basic Local Alignment Search Tool) searches against a sequence database. The predicted genes are further compared with experimentally determined sequences using pairwise alignment programmes such as GeneWise and Spidey. Once all predictions are checked and the ORF are determined, the functional assignment of the encoded proteins is carried out by homology searching using BLAST searches against a protein database (database is an organized collection of data for one or more purposes, usually in digital form). Functional descriptions are then added by searching protein motif and domain databases; for example, Pfam and Interpro.
GenBank
GenBank is a DNA sequence database from NCBI (National Center for Biotechnology Information). This is actually a division of National Library of Medicine, National Institute of Health at Bethesda (Maryland). This is an annotated collection of all publicly available DNA sequences.
DNA sequences can be submitted to a database prior to publication in journals, so that an accession number may appear in the paper. The various options for submitting data to GenBank are:
- Banklt, a WWW-based submission tool for convenient and quick submission of sequence data.
- Sequin, NCBI’s stand-alone submission software.
- tb12asn, a command-line program, automates the creation of sequence records for submission to GenBank. It is used primarily for the submission of complete genomes and large batches of sequences.
- Barcode submission tool, a WWW-based tool for the submission of GenBank sequences and trace data for barcode of life projects.
There are several ways to search data from GenBank:
- Search GenBank for sequence identifiers and annotations with ENtrez nucleotides, which is divided into three divisions namely core nucleotide (the main collection), dbEST (expressed sequence tags) and dbGSS (genome survey sequences).
- Search and align GenBank sequences to a query sequence using BLAST. BLAST searches CoreNucleotide, dbEST and dbGSS independently.
The GenBank database is designed to provide access within the scientific community to the most updated and comprehensive DNA sequence information.
Gene Ontology
The description of gene functions uses natural language which is often not so precise. Scientists working on different organisms tend to apply different terms to the same types of genes or proteins. Therefore, the protein functional descriptions must be standardized. This necessitated the development of ‘gene ontology project’, which utilizes standard vocabulary to describe molecular functions, biological processes and cellular components. Thus, the standardization provides consistency in describing protein functions. The standard vocabulary is organized such that a protein function is linked to the cellular function through a hierarchy of descriptions with increasing specificity. The top of the hierarchy provides a picture of the functional class while the lower level in the hierarchy specifies the functional role.