18.7 Annotating the genome

The process of ‘annotating the genome’ starts once the genome sequence has been established and its assembly completed. Annotation is the association of its component sequences with specific functions, and, if the S. cerevisiae example is a guide, this process can continue for a long time. Annotation requires sophisticated computation, that is: it is an in silico analysis. Gene identification is probably the most difficult problem, and relies on computer programs that align sequences and use ‘gene finder’ programs. Gene finding is easier with bacterial genomes, in which computer programs can find 97-99% of all genes automatically. In eukaryotes both gene finding and gene function assignment remain difficult tasks.

The problem can be likened to identifying the beginning and end of every word in a book when the text has lost all punctuation and you have no clear idea of the language and vocabulary used in the book.

For example, can you identify the structure of the following verse sentence (shown in the form of two overlapping 'contigs')?



It is English, but maybe not as you know it because it’s a quotation from Chaucer’s Canterbury Tales (Prologue, line 733) written in about the year 1390 from which all spaces and punctuation have been removed. Not easy, is it?

Sense is made of genome sequences by annotation in silico to:

  • identify ORFs by their start and finish codons, and allowing for the minimum length of functional proteins (Fig. 17);
  • detect the presence of recognisable functional motifs in segments of the deduced gene or protein;
  • compare against known protein or DNA sequences using homologous genes from the same or other genomes (Fig. 18).
Searching for ORFs in DNA sequences
Fig. 17. Searching for ORFs in DNA sequences, every one of which has 6 reading frames.


Sequence annotation with homologous genes Fig. 18. Sequence annotation with homologous genes from the same or other genomes.

Further annotation is done experimentally by:

  • classical gene cloning and functional analysis;
  • analysis of cDNA clones or EST sequences (an expressed sequence tag or EST is a short component sequence of a transcribed cDNA so it is a portion of an expressed gene), and gene expression data.

No single method of genome annotation is comprehensive; all have their limitations, so they must be used in concert.

Many of the genes identified in sequencing projects will be ‘new’ in the sense that when the sequence is identified the gene function is unknown. Establishing the cellular role of such new ORFs requires a different set of bioinformatics tools that integrate sequence information with the accumulated knowledge of metabolism so that conjectures can be made about likely functions. Those predictions are then tested experimentally by using heterologous expression, gene knockouts, and characterisation of purified proteins. Parallel analysis of phylogenetically diverse genomes can also help in understanding the physiology of the organism whose genome is being sequenced.

When the sequence of the whole genome has been established and annotated, the genome can be compared with others on the databases. Prokaryotic genomes are generally much smaller than those of eukaryotes. The Escherichia coli genome, for example, is composed of 4.64 Mb (megabase pairs) of DNA; that of Streptomyces coelicolor is 8 Mb, while the yeast genome, at 12.1 Mb, is about three times the size of the E. coli genome, and the human genome is 3 300 Mb (see the section entitled Nuclear genetics in Chapter 5 CLICK HERE to view now). The physical organisation is also different, because in prokaryotes the genome is contained in a single, circular, DNA molecule. Eukaryotic nuclear genomes are divided into linear DNA molecules, each contained in a different chromosome. In addition all eukaryotes have mitochondria and these possess small, usually circular, mitochondrial genomes. Photosynthetic eukaryotes (plants, algae, some protists) have a third small genome in their chloroplasts.

The size range of the genome corresponds to some extent with the degree of complexity of the organism, but the fit is not exact by any means because this correlation depends on the structure and organisation of the genes. For example, the Escherichia coli genome has 4 397 genes and the yeast genome comprises about 5 800 genes, so you might feel confident about believing that yeast has more genes because it is a eukaryote, and you can understand why it doesn’t have many more, because it’s a fairly simple eukaryote. However, the genome of the streptomycete bacterium Streptomyces coelicolor contains more than 7 000 genes. This organism is a prokaryote, but it has nearly 30% more genes than the model eukaryote, yeast. Admittedly, Streptomyces is a fairly complex bacterium and highly advanced in an evolutionary sense; but it is a bacterium. The arithmetic difference lies in the fact that the average yeast gene is 2 200 base pairs long, while the average S. coelicolor gene is only 1 200 base pairs long. But we can’t explain why such a difference in gene size exists.

The yeast Saccharomyces cerevisiae is a well established model organism with a long history in physiology, biochemistry and molecular biology (see the section entitled The fungus as a model eukaryote in Chapter 5 CLICK HERE to view now); its genome continues to be a useful model for eukaryotes, comprising a grand total of 12.1 Mb distributed over 16 chromosomes, which range in size between 250 kb and more than 2.5 Mb. The yeast genome-sequencing project was started in 1989. The sequence of chromosome III was the first to be published in 1992, chromosomes II and XI followed in 1994, and the sequence of the entire genome was released in April 1996. Quality control measures ensured a 99.97% level of accuracy of the sequence.

Approximately 70% of the complete genome sequence of S. cerevisiae is taken up by 6 607 open reading frames (ORFs) which possibly encode metabolically active proteins, but about 811 are considered dubious, and there are 21 pseudogenes (pseudogenes are related to known yeast genes but contain internal stop codons) (data as of May 16, 2009 from the Genome Inventory on the Saccharomyces Genome Database website at http://www.yeastgenome.org/). On average, a protein‑encoding gene is found every two kb in the yeast genome. The ORFs vary from 100 to more than 4 000 codons, although two-thirds are less than 500 codons, and they are fairly evenly distributed on the two strands of the DNA. In addition to these, the yeast genome contains 120 rRNA genes in a large tandem array on chromosome XII, 40 genes for small nuclear RNAs, 274 tRNA genes (belonging to 42 codon families) scattered across the chromosomes, and 51 copies of the yeast retrotransposons (Ty elements). There are also non‑chromosomal elements, most notably the yeast mitochondrial genome (80 kb) and the 6 kb 2μ plasmid DNA, but there may be other plasmids, too (Moore & Novak Frazer, 2002).

Fewer than 5% of the protein‑encoding genes are interrupted by introns. There is usually a single intron (only two genes have two introns), which is generally at the extreme 5'‑end of the gene, sometimes even before its coding region. This lack of introns is an exceptional feature of S. cerevisiae; the genes of other fungi, including all filamentous ascomycetes that have been studied, contain more introns. Even the genome of the fission yeast, Schizosaccharomyces pombe, has a lower gene density (one gene per 2.3 kb) than Saccharomyces cerevisiae, and about 40% of fission yeast genes contain introns. Genes of higher eukaryotes can have many introns. For example, in humans muscular dystrophy is caused by a lesion in a gene on the X-chromosome, which contains 80 introns; in this case the gene sequence is 2.3 Mbp long but the mRNA is only 14 kb long, so only 1% of the chromosomal sequence of the gene is found in the mature messenger.

Intergenic regions between consecutive ORFs can be extremely short in S. cerevisiae because of the high gene density. This leaves limited space for regulatory sequences involved in DNA transcription, replication, and chromosome maintenance. Several transcription control elements have been identified, including upstream acting sequences (UAS) and upstream repressing sequences (URS). Some terminator sequences have also been defined, but no consensus sequences are evident. Regulatory elements are not confined to intergenic regions but can also be located within the coding sequences of upstream neighbouring genes. Such an arrangement clearly imposes evolutionary constraint on the sequence because of its dual function: selection must operate on the DNA sequence in relation to both the function of the protein specified by its coding sequence and its ability to regulate a gene of unrelated function some way downstream (Moore & Novak Frazer, 2002).

About 66% of the total ORFs in the yeast genome represent novel genes of previously unknown function; those that remain of undiscovered function are called orphan genes. Approximately 2 300 ORFs (over 40%) specify yeast membrane proteins, and although many of these fall into families (for example, 33 mitochondrial transporters, 200 sugar and amino acid transporters, etc.), about 1 600 of them are unique, with no homologues elsewhere in the genome. Dedication of such a large proportion of the genome to membrane proteins clearly emphasises the importance of membrane-processes in eukaryotes.

Given the small size of the genome, the level of genetic redundancy in yeast is a considerable surprise. Up to 40% of the gene sequences are duplicated. In most cases the duplicated sequences are so similar that their protein products are identical and, presumably, functionally redundant. These redundant proteins can substitute for each other if one is mutated, and this explains why so many experimental single gene disruptions in yeast do not impair growth or cause abnormal phenotypes. A wide variety of these identical genes locate to different chromosomes. Examples include histone genes, genes for ribosomal proteins, ATPases, amino acid and sugar transporters, and genes for enzymes of the glycolytic pathway. However, a sequence difference in the promoters of duplicated genes implies differences in regulation; so expression of the different copies may depend on the nutritional or differentiation status of the yeast cell. None of these duplications appear to be pseudogenes, of which there are rather few in the yeast genome. Chromosome I, which is the smallest eukaryotic chromosome so far known, is exceptional in having four pseudogenes at each end.

More surprising even than the duplicated genes themselves are numerous large segments on two or more chromosomes that share duplicated genes arranged in the same order and with the same transcription orientations. These are called cluster homology regions (CHRs) and there are 50 of them in the yeast genome. Ten of these CHRs (shared with chromosomes II, V, VIII, XII and XIII) are located on chromosome IV, and the whole of chromosome XIV is made up of regions duplicated on other chromosomes. Outside the coding regions of these clusters the DNA sequence has diverged, implying that the duplication events are ancient. The greatest level of duplication occurs in genes of unknown function. Duplication of metabolic proteins has not occurred on a major scale, but genes for proteins involved in membrane processes, control of protein conformation, and in DNA or RNA processing are highly redundant. This might mean that duplications improve environmental fitness by affecting integration and coordination of the major metabolic functions. It is interesting that genes that are crucial to the most basic cell functions (protein conformation, membrane transport and DNA/RNA processing) are also surmised to have arisen from ancient duplications, suggesting that there has been a definite drive to conserve these sequences throughout their evolution (Moore & Novak Frazer, 2002).

It is important point to emphasise that the yeast genome sequence we have been discussed so far is that belonging to a specific laboratory strain, code number αS288C. Being a laboratory strain it may have been affected by unconscious artificial selection during its ‘domestication’ and it may or may not be representative of the natural populations of Saccharomyces cerevisiae. The comparisons made so far between domesticated and wild strains indicate that sequence variation in the coding regions of individual genes is rare and does not contribute significantly to polymorphism between strains of S. cerevisiae. Rather, polymorphisms between yeast strains are particularly caused by differences in the number of gene copies within families of repeated genes, the distribution of Ty elements, and variation in the genetic redundancy and telomeric repeats which are found at all chromosome ends. Chromosome restructuring also differentiates yeast strains. Chromosome breakage is able to cause an altered karyotype, and deletions can give rise to chromosome length polymorphisms.

Finally, the alleged purpose of study of a model organism like yeast is the expectation that its analysis will enable the identification of genes relevant to disease in humans; and this expectation seems to be fulfilled. Comparing the sequences of human genes available in the sequence databases with yeast ORFs shows that over 30% of yeast genes have homologues among the human sequences, most of these representing basic cell functions. Finding this sort of homology can contribute to the understanding of human disease. The first example of this seems to be Friedreich ataxia, which is the most common type of inherited ataxia (loss of control of bodily movements) in humans, the biochemistry of which was uncovered by demonstrating homology to a yeast ORF of known function. Friedreich ataxia is caused by enlargement of a GAA repeat in an intron that results in decreased expression of the frataxin gene. Frataxin is the human mitochondrial protein that has homologues in yeast. In yeast, mutants defective in the frataxin homologue accumulate iron in mitochondria and show increased sensitivity to oxidative stress. This suggests that Friedreich ataxia is caused by mitochondrial dysfunction and may point towards novel methods of treatment Koutnikova et al., 1997). In many ways, this kind of comparison alone can justify all the effort devoted to sequencing the yeast genome.

Functional genomics studies the roles of genes and proteins to define gene/protein function. The outcome is known as the Gene Ontology. Originally, ontology was a branch of metaphysics; a philosophical inquiry into the nature of being. For the computer scientist, ontology is the rigorous collection and organisation of knowledge about a specific feature. The aims of Gene Ontology (GO) are to:

  • develop and standardise the vocabulary about the attributes of genes and gene products that is species-neutral, and equally applicable to prokaryotes and eukaryotes, and uni- and multicellular organisms;
  • annotate genes and gene products within sequences, and coordinate understanding and distribution of annotation data;
  • and provide bioinformatics tools to aid access to all these data.

To achieve all this, there are three organising principles of GO to describe the function of any gene/protein sequence as follows:

  • biological process; effectively the answer to the question why does the sequence exist? This can be cast in very broad terms describing the biological goals accomplished by function of the sequence, for example mitosis, meiosis, mating, purine metabolism, etc.
  • Molecular function; effectively what does the sequence do? The tasks performed by individual gene products, for example transcription factor, DNA helicase, kinase, phosphatase, phosphodiesterase, dehydrogenase, etc.
  • Cellular component; where is that function exercised? The location in subcellular structures and macromolecular complexes. For example, nucleus, telomere, cell wall, plasma membrane, endoplasmic reticulum lumen, etc.

The ontology data are freely available from the Gene Ontology website at this URL: http://www.geneontology.org/ and the most up-to-date information on the genes of any organism in which you are interested can be obtained from the website devoted to that organism (accessible through the Broad Institute’s listings at http://www.broad.mit.edu/). For example, we show in Fig. 19 just a fraction of the information held in the Saccharomyces Genome Database (SGDTM) about the Saccharomyces cerevisiae gene, GDH1, which codes for NADP+-dependent glutamate dehydrogenase. The SGD domain is at the URL http://www.yeastgenome.org.

Iinformation held in the Saccharomyces Genome Database (SGDTM) about the Saccharomyces cerevisiae gene, GDH1
Fig. 19. A fraction of the information held in the Saccharomyces Genome Database (SGDTM) about the Saccharomyces cerevisiae gene, GDH1, which codes for NADP+-dependent glutamate dehydrogenase at this URL: http://www.yeastgenome.org/cgi-bin/locus.fpl?dbid=S000005902 as it stood on May 17, 2009.  This is just the top left corner of the HTML page the database creates in response to a search for the gene name. The rest of the page contains much more detail about the molecular structure of the gene and its product, the location on the chromosome, offers a hyperlink to download the DNA sequence, and hyperlinks to published references. CLICK HERE to view a monitor image of the full page.

Annotation has been automated by annotation programs that quickly identify open reading frames for hypothetical genes in a genome. Many sequences are conserved across large evolutionary distances, so many functional assignments can be inferred from information already available from other organisms; this sequence search and comparison process can also be automated. Annotating the genes of filamentous fungi, even other Ascomycota and close relatives of S. cerevisiae, is more demanding because their their genomes are much larger and their gene structure more complex than those of yeast. In particular, genes of filamentous fungi often contain multiple introns, with introns within the open reading frame of the gene (very few yeast genes contain introns, those that do have a single intron at the start of the coding sequence, often interrupting the initiation codon). The greater complexity of gene structure in filamentous fungi demands independent data on gene expression and the gene sequence be used to make confident functional assignments. Methods have been described that use cDNA or EST sequence alignments, and gene expression data to predict reliably the function of Aspergillus nidulans genes. We recommend you read the discussion and explanation of the approach by Sims et al. (2004).

And, just in case you are still puzzling over that Geoffrey Chaucer quotation; here's the original:

Who so shall telle a tale after a man,
He moste reherse, as neighe as ever he can,
Everich word, if it be in his charge,
All speke he never so rudely and so large;
Or elles he moste tellen his tale untrewe,
Or feinen thinges, or finden wordes newe.

Updated December 17, 2016