18.11 Annotating the genome

18.11 Annotating the genome 

The process of ‘annotating the genome’ starts once the genome sequence has been established and its assembly completed. Annotation is the association of its component sequences with specific functions, and, if the Saccharomyces cerevisiae example is a guide, this process can continue for a long time. Annotation requires sophisticated computation, that is: it is an in silico analysis. Gene identification is probably the most difficult problem and relies on computer programs that align sequences and use ‘gene finder’ programs.

Gene finding is easier with bacterial genomes, in which computer programs can find 97-99% of all genes automatically. In eukaryotes both gene finding, and gene function assignment remain challenging tasks. The problem can be likened to identifying the beginning and end of every word in a book when the text has lost all punctuation and you have no clear idea of the language and vocabulary used in the book.

Sense is made of genome sequences by annotation in silico to:

  • identify ORFs by their start and finish codons, and allowing for the minimum length of functional proteins (Fig. 15);
  • detect the presence of recognisable functional motifs in segments of the deduced gene or protein;
  • compare against known protein or DNA sequences using homologous genes from the same or other genomes (Fig. 16).
Searching for ORFs in DNA sequences
Fig. 15. Searching for ORFs in DNA sequences, every one of which has 6 reading frames.


Annotation with homologous genes
Fig. 16. Sequence annotation with homologous genes from the same or other genomes.

Further annotation is done experimentally by:

  • classical gene cloning and functional analysis;
  • analysis of cDNA clones or EST sequences (an expressed sequence tag or EST is a short component sequence of a transcribed cDNA, so it is a portion of an expressed gene), and gene expression data.

No single method of genome annotation is comprehensive; all have their limitations, so they must be used in concert. Many of the genes identified in sequencing projects will be ‘new’ in the sense that when the sequence is identified the gene function is unknown. Establishing the cellular role of such new ORFs requires a different set of bioinformatics tools that integrate sequence information with the accumulated knowledge of metabolism so that conjectures can be made about likely functions. Those predictions are then tested experimentally by using heterologous expression, gene knockouts, and characterisation of purified proteins. Parallel analysis of phylogenetically diverse genomes can also help in understanding the physiology of the organism whose genome is being sequenced.

When the sequence of the whole genome has been established and annotated, the genome can be compared with others on the databases. Prokaryotic genomes are generally much smaller than those of eukaryotes. The Escherichia coli genome, for example, is composed of 4.64 Mb (megabase pairs) of DNA; that of Streptomyces coelicolor is 8 Mb, while the yeast genome, at 12.1 Mb, is about three times the size of the E. coli genome, and the human genome is 3,300 Mb (see Table 5.2).

The physical organisation is also different, because in prokaryotes the genome is contained in a single, circular, DNA molecule. Eukaryotic nuclear genomes are divided into linear DNA molecules, each contained in a different chromosome. In addition, all eukaryotes have mitochondria, and these possess small, usually circular, mitochondrial genomes. Photosynthetic eukaryotes (plants, algae, some protists) have a third small genome in their chloroplasts.

The size range of the genome corresponds to some extent with the degree of complexity of the organism, but the fit is not exact by any means because this correlation depends on the structure and organisation of the genes. For example, the Escherichia coli genome has 4,397 genes and the yeast genome comprises about 5,800 genes, so you might feel confident about believing that yeast has more genes because it is a eukaryote, and you can understand why it doesn’t have many more, because it’s a simple eukaryote. However, the genome of the streptomycete bacterium Streptomyces coelicolor contains more than 7,000 genes. This organism is a prokaryote, but it has nearly 30% more genes than the model eukaryote, yeast.

Admittedly, Streptomyces is a complex bacterium and highly advanced in an evolutionary sense; but it is a bacterium. The arithmetic difference lies in the fact that the average yeast gene is 2,200 base pairs long, while the average Streptomyces coelicolor gene is only 1,200 base pairs long. But we can’t explain why such a difference in gene size exists.

The yeast Saccharomyces cerevisiae is a well-established model organism with a long history in physiology, biochemistry and molecular biology (see Section 5.2); its genome continues to be a useful model for eukaryotes, comprising a grand total of 12.1 Mb distributed over 16 chromosomes, which range in size between 250 kb and more than 2.5 Mb. The yeast genome-sequencing project was started in 1989. The sequence of chromosome III was the first to be published in 1992, chromosomes II and XI followed in 1994, and the sequence of the entire genome was released in April 1996. Quality control measures ensured a 99.97% level of accuracy of the sequence.

Today, the place to learn about this genome is the Saccharomyces Genome Database (SGB) website at https://www.yeastgenome.org/ and the Yeast Genome Snapshot at https://www.yeastgenome.org/genomesnapshot. As of January 2017, there were 6,572 open reading frames (ORFs) which possibly encode metabolically active proteins, of which 5,138 were verified, 754 were uncharacterised, and 680 were considered dubious.

On average, a protein‑encoding gene is found every two kb in the yeast genome. The ORFs vary from 100 to more than 4,000 codons, although two-thirds are less than 500 codons, and they are evenly distributed on the two strands of the DNA. In addition to these, the yeast genome contains 27 rRNA genes in a large tandem array on chromosome XII, 77 genes for small nuclear RNAs, 277 tRNA genes (belonging to 42 codon families) scattered across the chromosomes, and 51 copies of the yeast retrotransposons (Ty elements).

There are also non‑chromosomal elements, most notably the yeast mitochondrial genome (80 kb) and the 6 kb 2μ plasmid DNA, but there may be other plasmids, too. So, 21 years after the genome was sequenced, only about 80% of the ORFs had been verified; a rate of progress that makes it even more amazing that on April 17, 2018, SGB announced a single publication in the journal Nature by a team of researchers jointly led by Joseph Schacherer and Gianni Liti, that had reported the whole-genome sequences and phenotypes of no fewer than 1,011 different Saccharomyces cerevisiae yeast strains (Peter et al., 2018).

Isolates of Saccharomyces cerevisiae were gathered from many diverse geographical locations and ecological niches; from wine, beer and bread, but also from rotting bananas, sea water, human blood, sewage, termite mounds, and more. The authors then surveyed the evolutionary relationships among the strains to describe the worldwide population distribution of this species and deduce its historical spread. This unusually large-scale population genomic survey demonstrates that the likely geographic origin of S. cerevisiae lies somewhere in East Asia. Budding yeast began spreading around the globe about 15,000 years ago, undergoing several independent domestication events during its worldwide journey. For example, whereas genomic markers of domestication appeared about 4,000 years ago in sake yeast, such markers appeared in wine yeast only 1,500 years ago. While domesticated isolates exhibit high variation in ploidy, aneuploidy and genome content, genome evolution in wild isolates was mainly driven by the accumulation of single nucleotide polymorphisms, most of which are present at very low frequencies.

The alleged purpose of study of a model organism like yeast is the expectation that its analysis will enable the identification of genes relevant to disease in humans; and this expectation seems to be fulfilled. Comparing the sequences of human genes available in the sequence databases with yeast ORFs shows that over 30% of yeast genes have homologues among the human sequences, most of these representing basic cell functions. Finding this sort of homology can contribute to the understanding of human disease.

The first example of this seems to be Friedreich ataxia, which is the most common type of inherited ataxia (loss of control of bodily movements) in humans, the biochemistry of which was uncovered by demonstrating homology to a yeast ORF of known function. Friedreich’s ataxia is caused by enlargement of a GAA repeat in an intron that results in decreased expression of the frataxin gene; frataxin is a highly conserved iron-binding protein present in most organisms, and Friedreich’s ataxia pathology is associated with disruption of iron-sulfur cluster biosynthesis, mitochondrial iron overload, and oxidative stress. Frataxin is the human mitochondrial protein that has homologues in yeast. In yeast, mutants defective in the frataxin homologue accumulate iron in mitochondria and show increased sensitivity to oxidative stress. Biosynthesis of Fe-S clusters in yeast is a vital process involving the delivery of elemental iron and sulfur to scaffold proteins and the architecture of the protein complex to which frataxin contributes is essential to ensure concerted and protected transfer of potentially toxic iron and sulfur atoms to the mitochondrion. This comparison suggests that Friedreich’s ataxia is caused by mitochondrial dysfunction and may point towards novel methods of treatment (Pastore & Puccio, 2013; Ranatunga et al., 2016).

 In many ways, this kind of comparison alone can justify all the effort devoted to sequencing the yeast genome. Functional genomics studies the roles of genes and proteins to define gene/protein function. The outcome is known as the Gene Ontology. Originally, ontology was a branch of metaphysics; a philosophical inquiry into the nature of being. For the computer scientist, ontology is the rigorous collection and organisation of knowledge about a specific feature.

The aims of Gene Ontology (GO) are to:

  • develop and standardise the vocabulary about the attributes of genes and gene products that is species-neutral, and equally applicable to prokaryotes and eukaryotes, and uni- and multicellular organisms;
  • annotate genes and gene products within sequences, and coordinate understanding and distribution of annotation data;
  • and provide bioinformatics tools to aid access to all these data.

To achieve all this, there are three organising principles of GO to describe the function of any gene/protein sequence as follows:

  • Biological process; effectively the answer to the question why does the sequence exist? This can be cast in very broad terms describing the biological goals accomplished by function of the sequence, for example mitosis, meiosis, mating, purine metabolism, etc.
  • Molecular function; effectively what does the sequence do? The tasks performed by individual gene products, for example transcription factor, DNA helicase, kinase, phosphatase, phosphodiesterase, dehydrogenase, etc.
  • Cellular component; where is that function exercised? The location in subcellular structures and macromolecular complexes. For example, nucleus, telomere, cell wall, plasma membrane, endoplasmic reticulum lumen, etc.

The ontology data are freely available from the Gene Ontology Consortium’s website at this URL: http://www.geneontology.org/. General information about genomics is accessible through the Broad Institute’s listings at https://www.broadinstitute.org/.

Annotation has been automated by annotation programs (available online) that quickly identify ORFs for hypothetical genes in a genome. Many sequences are conserved across large evolutionary distances, so many functional assignments can be inferred using information already available from other organisms; this sequence search and comparison process can also be automated.

Annotating the genes of filamentous fungi, even other Ascomycota and close relatives of Saccharomyces cerevisiae, is more demanding because their genomes are much larger and their gene structure more complex than those of yeast. Genes of filamentous fungi often contain multiple introns (Section 18.6), with some within the open reading frame of the gene (very few yeast genes contain introns, those that do have a single intron at the start of the coding sequence, often interrupting the initiation codon). Also, the intron-boundary sequences may not become evident until the transcriptome is analysed, and alternative splicing events catalogued (Section 18.7).

The greater complexity of gene structure in filamentous fungi demands independent data on gene expression to make confident functional assignments. Methods have been described that use cDNA or EST sequence alignments, and gene expression data to predict reliably the function of Aspergillus nidulans genes. We recommend you read the discussion and explanation of the approach by Sims et al. (2004).

Yandell & Ence (2012) have published ‘A beginner’s guide to eukaryotic genome annotation’ and further information and advice is freely available online at:

The most up-to-date information on the genes of any organism in which you are interested can be obtained from the website devoted to that organism (use your preferred web search engine to find it). For example, entering ‘coprinopsis cinerea genome’ into the search engine finds the Coprinopsis cinerea home page, which gives you general information about the organism and its genome, on the JGI Genome Portal [https://genome.jgi.doe.gov/Copci1/Copci1.home.html]. This page has a menu of hyperlinks across the top that give access to the deepest detail about the genome of this species.

The main Internet sites for fungal genomic data are discussed in the next Section (Section 18.12). Bioinformatics is essentially the use of computers to process biological information when computation is necessary to manage, process, and understand very large amounts of data. Although there are many bioinformatics tools and databases, using them effectively often requires specialised knowledge; where this is lacking, the BioStar platform can help. Biostar is an online forum where experts and those seeking solutions to problems of computational biology exchange ideas. BioStar can be accessed at https://www.biostars.org/ (Parnell et al., 2011).

Bioinformatics is particularly important as an adjunct to genomics research, because of the large amount of complex data this type of research generates, so to a great extent the word, and the approaches it encompasses, have become synonymous with the use of computers to store, search and characterise the genetic code of genes (genomics), the transcription products of those genes (transcriptomics), the proteins related to each gene (proteomics) and their associated functions (metabolomics) (see Section 18.10). But there are other large data sets in need of analysis that rightly fall within range of the fundamental definition of the word ‘bioinformatics’.

These are large data sets arising from:

  • Survey data and censuses, particularly, but not only, those involving automatic data capture, and 'surveys of surveys' (metadata) (for example see Section 13.17).
  • Data generated by mathematical models that seek to simulate a biological system and its behaviour in time (for example see Section 4.9).

The aim of functional genomics is to determine the biological function of all the genes and their products, how they are regulated and how they interact with other genes and gene products. Add interactions with the environment and this is fully integrated biology; what has come to be known as systems biology (Klipp et al., 2009; Nagasaki et al., 2009; Horgan & Kenny, 2011). Comprehensive studies of such large collections of molecules as occur in the transcriptome, proteome, and metabolome require what are described as high throughput methods of analysis at each stage from the generation of mutants through to the determination of which proteins are associated with which functions. Each stage generates massive amounts of data that are qualitatively and quan­titatively different, which must be integrated to allow construction of realistic models of the living system (Delneri et al., 2001).

Functional genomic analysis of the yeast Saccharomyces cerevisiae established the key concepts, approaches and techniques, although research on filamentous fungi is expanding (Foster et al., 2006). Considerable progress was made in analysis of yeast gene function using mutants with deletions of ORFs. However, genetic redundancy in the genome, resulting perhaps from gene duplication(s) during evolution, can be a problem in this type of analysis. In retrospect, analysis of yeast shows that much of the redundancy in the yeast genome is made up of identical, or almost identical, gene products fulfilling distinct physiological roles due to differential expression of the genes under different physiological conditions, and/or targeting the similar proteins to different cellular compartments.

Nevertheless, more extensive studies require more extensive collections of mutants; those in which entire gene families are deleted and, ultimately, a collection in which all genes are represented by appropriate mutants. There is scope for large scale international collaboration in this sort of exercise and 1999 saw the establishment of a collection of mutant yeast strains, each bearing a defined deletion in one of 6,000+ potential protein encoding genes in yeast (Winzeler et al., 1999). This is the EUROSCARF collection (EUROpean Saccharomyces Cerevisiae ARchive for Functional analysis; see http://www.euroscarf.de/). Using a PCR-based gene disruption strategy, mutant strains with a deletion of most of the ORFs in the genome were prepared in this systematic deletion project. In addition, each deleted ORF was flanked by two 20 base pair sequences unique for each deletion. These allow the sequences to be detected easily; effectively they act as molecular barcodes that allow large numbers of deletion strains, potentially the whole library, to be analysed in parallel at the same time.

Another approach used a transposon that created gene fusions in a yeast clone library so that the protein products of the mutated yeast genes could be identified and analysed by immunofluorescence using antibodies to the peptide introduced by the transposon. In the original work a yeast genomic DNA library was mutagenised in Escherichia coli with a multipurpose minitransposon derived from the bacterial transposable element known as Tn3. The minitransposon contained cloning sites and a 274-base pair sequence encoding 93 amino acids, called a HAT tag, which was inserted into the yeast target proteins.

The HAT tag allows immunodetection of the mutated yeast protein. Transposon mutagenesis generated 106 independent transformants. Subsequently, individual transformant colonies were selected and stored in 96-well plates. Plasmids were prepared from these strains and transformed into a diploid yeast strain in which homologous recombination integrated each fragment at its corresponding genomic locus, thereby replacing its genomic copy. Then, 92,544 plasmid preparations and yeast transformations were carried out, identifying a collection of over 11,000 strains, each carrying a transposon inserted within a region of the genome expressed during vegetative growth and/or sporulation. These insertions affected nearly 2,000 annotated genes, distributed over all 16 yeast chromosomes and representing about one-third of the yeast genome. The study demonstrated the value of a particular strategy for mutant generation and detection, but it also indicated the scale of what has been called ‘new yeast genetics’.

Finding methods that generate large numbers of gene mutants and simultaneously identify the mutants and/or their products in ways amenable to automation was the start of the high throughput approach (Ross-Macdonald et al., 1999; Cho et al., 2006; Caracuel-Rios & Talbot, 2008; Honda & Selker, 2009).

Messenger RNA molecules are the subject of transcriptome analyses and can be studied in a fully comprehensive manner using hybridisation-array analysis, which is described as a massively parallel technique because it allows so many sequences to be examined at one time. Remember, though, that mRNA molecules transmit instructions for synthesising proteins; they do not function otherwise in the workings of the cell, so transcriptome analyses are considered to be an indirect approach to functional genomics. The transcriptome comprises the complete set of mRNAs synthesised in the cell under any given well-defined set of physiological conditions. Unlike the genome, which has a fixed collection of sequences, the transcriptome is context dependent, which means that its content of sequences depends on the cell response to the current set of physiological circumstances, and the make up of that set will change when the physiological circumstances change.

Those physiological circumstances will be adapted in response to changes in both the intracellular and extracellular environment of the cell; its nutritional status, state of differentiation, age, etc. The mRNA of genes that are newly expressed (up-regulated) will appear in the sequence collection, and the mRNA of genes that are not expressed (down-regulated) in the new circumstance will disappear from, or be greatly reduced in, the sequence collection. Determination of the nature and sequence content of the transcriptome in all these circumstances is precisely what transcriptome analysis is intended to achieve, because the pattern of mRNA content in the transcriptome reveals the pattern of gene regulation.

Hybridisation arrays are now used widely to study the transcriptome because of their ability to measure the expression of many genes with great efficiency. Microarrays permit assessment of the relative expression levels of hundreds, even thousands, of genes in a single experiment. Hybridisation arrays are also called DNA micro- or macroarrays, DNA chips, gene chips, and bio chips (Nowrousian, 2007, 2014a). The web definition of DNA microarray is: a collection of microscopic DNA spots attached to a solid surface forming an array; used to measure the expression levels of many genes simultaneously (https://en.wiktionary.org/wiki/DNA_microarray).

The array of single-stranded DNA molecules is typically distributed on glass, a nylon membrane, or silicon wafer (any of which might be called ‘a chip’), each being immobilised at a specific location on the chip in a predetermined (and computer-recorded) grid formation. Microarrays and macroarrays differ in the size of the sample spots of DNA; in macroarrays the size of the spot is over 300 µm, in microarrays it is less than 200 µm. Macroarrays are normally spotted by high-speed robotics onto nylon membranes, microarrays are made on glass (usually called custom arrays) or quartz surfaces (GeneChip®, from Affymetrix Inc.; see https://www.affymetrix.com/site/mainPage.affx) (Lipshutz et al., 1999). The immobilisation onto the solid matrix is the most crucial aspect of the technique as it must preserve the biological activity of the molecules. The spotted material can be genomic DNA, cDNA, PCR products (any of these sized between 500 to 5,000 base pairs) or oligonucleotides (20 to 80-mer oligos). The identities and locations of the single-stranded DNAs are known, so when the chip is treated with a suspension of experimental cDNA molecules prepared from a set of mRNAs, the cDNAs complementary to those on the chip will bind to those specific spots. The complementary binding pattern can be detected and since the DNAs at each position on each grid are known, the complementary binding pattern indicates the pattern of gene expression in the sample.

 Macroarrays  are hybridised using a radioactive probe; normally 33P, an isotope of phosphorus which decays by β-emission so that the decay, and therefore the position of the complementary binding can be imaged with a phosphorimager, a device in which β-particle emissions excite the phosphor molecules on the plate in a way that can be detected by scanning the plate with a laser and the attached computer converts the energy it detects to an image in which different colours represent different levels of radioactivity.

Microarrays are exposed to a set of targets either separately (single dye experiment) or in a mixture (two dye experiment) to determine the identity/abundance of complementary sequences. Laser excitation of the spots yields an emission with a spectrum characteristic of the dye(s), which is measured using a scanning confocal laser microscope. Monochrome images from the scanner are imported into software in which the images are pseudo-coloured and merged and combined with information about the DNAs immobilised on the chip. The software outputs an image which shows whether expression of each gene represented on the chip is unchanged, increased (up-regulated) or decreased (down-regulated) relative to a reference sample. In addition, data is accumulated from multiple experiments and can be examined using any number of data mining software tools.

There are many uses for DNA microarrays. Apart from the expression profiling to examine the effect of physiological circumstance on gene expression on which we have so far concentrated, hybridisation arrays can be used to:

  • dissect metabolic pathways and signalling networks;
  • establish transcription factor regulatory patterns, target genes and binding sites;
  • compare gene expression in normal tissue with that of diseased tissue, initially to establish which genes are involved in response to disease, and when that is done to diagnose disease;
  • identify gene expression of different tissues and different states of cell differentiation to establish tissue-specific and/or differentiation-specific genes;
  • study reaction to specific drugs, agrochemicals, antibiotics or toxins to identify drug targets, side effects, and resistance mechanisms.

The proteome is the complete set of proteins synthesised in the cell under a given set of conditions. The traditional method for quantitative proteome analysis combines protein separation by high-resolution 2-dimensional isoelectric focusing (IEF)/SDS-PAGE (2DE) with mass spectrometric (MS) or tandem mass spectrometric (MS/MS) identification of selected protein spots detected in the 2DE gels by use of specific protein stains. Continued improvement in technology is steadily increasing the throughput of protein identifications from complex mixtures and permitting quantification of protein expression levels and how they change in different circumstances (Aebersold, 2003; Bhadauria et al., 2007; Rokas, 2009).

An important feature that arises from analysis of the proteome is the enormous extent and complexity of the network of interactions among proteins and between proteins and other components of the cells. These networks can be visualised as maps of cellular function, depicting potential interactive complexes and signalling pathways.

'Metabolomics consists of strategies to quantitatively identify cellular metabolites and to understand how trafficking of these biochemical messengers through the metabolic network influences phenotype’ (quoted from Jewett et al., 2006).

Metabolomics is particularly important in fungi because these organisms are widely used to produce chemicals. The main difficulty in metabolome analysis is not technical as there are sufficient analytical tools and mathematical strategies available for extensive metabolite analyses. However, the indirect relationship between the metabolome and the genome raises conceptual difficulties. The biosynthesis or degradation of a single metabolite may involve many genes, and the metabolite itself may impact on many more. Consequently, the bioinformatics tools and software required must be exceptionally powerful.

Ultimately, you may think in terms of applying all this knowledge to the creation of something entirely new. That is, to developing a biological system of some form that does not already exist in the biosphere. In the past this was achieved by the evolutionary process of artificial selection (selective breeding), producing crop species (like maize) or domesticated animals (like high milk-yield cattle) that simply could not exist in the wild.

The ‘modern’ version of this is called synthetic biology, and with the current passion for applying management definitions to long standing activities it has been defined as the area of science that applies engineering principles to biological systems to design and build novel biological functions and systems.

Wikipedia defines synthetic biology as:

‘…the design and construction of novel artificial biological pathways, organisms or devices, or the redesign of existing natural biological systems…’ (visit https://en.wikipedia.org/wiki/Synthetic_biology).

Kaznessis (2007) adds the crucial rider that synthetic biological engineering is emerging from molecular biology as a distinct discipline based on quantification. And that’s the real defining feature, this is a branch of biology that depends on large scale computer processing of large amounts of numerical data. In fact, this is a branch of biology that verges on engineering (Silver et al., 2014).

Updated July, 2018

December 17, 2016