18.10 Sequencing fungal genomes

18.10 Sequencing fungal genomes

Very little of what we have described in the above four Sections could have been written before the complete sequences of fungal genomes became available, so the topic of genome sequencing deserves some discussion.

The first whole DNA genome to be sequenced was that of the bacteriophage ΦX174 (phi-X-174). This bacteriophage has a circular single-stranded DNA genome consisting of 5,386 nucleotides that encode 11 proteins. The genome was sequenced at the University of Cambridge by a team led by Frederick Sanger which developed and used a DNA sequencing technology that became the backbone of the first part of the genome era.

His technique used the chain termination method, and is now referred to as the ‘first generation technology’ of genome sequencing. As of 2018, Fred Sanger is the only person to have been awarded the Nobel Prize in Chemistry twice. The Nobel Prize in Chemistry 1958 was awarded to Frederick Sanger ‘for his work on the structure of proteins, especially that of insulin’ [https://www.nobelprize.org/nobel_prizes/chemistry/laureates/1958/]. The Nobel Prize in Chemistry 1980 was divided, one half awarded to Paul Berg ‘for his fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-DNA’, the other half jointly to Walter Gilbert and Frederick Sanger ‘for their contributions concerning the determination of base sequences in nucleic acids’ [https://www.nobelprize.org/nobel_prizes/chemistry/laureates/1980/].

The priority of genome sequencing is to establish the number, disposition and function of genes in an organism. Genomics is the systematic study of an organism’s genome. Consideration of the many uses of a genome sequence started by focussing on the human genome (Sharman, 2001) and came up with these activities:

  • studying the proteins and RNA of the proteome and transcriptome (and perhaps deciding how to change them to serve our own purposes);
  • establishing the genetic basis of interactions between organisms, especially pathogenesis and the mechanisms of disease, but including more benign relationships such as mutualisms and mycorrhizas;
  • comparing genome sequences from related organisms to examine genome evolution and relationships between organisms at the genomic level: for example, how/if genes are conserved in different species; how relationships between genomes compare with conventional taxonomic classifications, which are of course based upon the outcome of information encoded in the genome; and studying mechanisms of speciation.

The Human Genome Project began with Sanger sequencing technology. This method relies on dideoxynucleotides (ddNTPs), a type of deoxynucleotide triphosphates (dNTPs) that lack a 3' hydroxyl group and have a hydrogen atom instead. It is the sugar molecule which is ‘dideoxy’;  in the normal deoxynucleoside the deoxyribose has a hydroxyl group replaced with a hydrogen atom on carbon atom-2 (symbolised: 2') and the sugar’s 3' and 5' hydroxyls are used to covalently link adjacent nucleotides in the growing DNA sequence. The 3'-OH is normally phosphorylated and the phosphate is linked to a 5'-sugar carbon atom on the adjacent nucleotide to form the phosphate-sugar backbone of the polynucleotide.

Because dideoxynucleotides also lack that 3' hydroxyl group, when they bind to the growing DNA sequence, they terminate replication as they cannot covalently link to other nucleotides. To perform Sanger Sequencing, you add your primers to a solution containing the genetic information to be sequenced, then divide up the solution into four PCR reactions; each reaction contains a nucleotide mix with one of the four nucleotides substituted with a ddNTP (A, T, G, and C reactions). The DNA polymerase incorporates the dideoxynucleotide efficiently but is then prevented from elongating the growing chain any further. That’s the ‘chain termination’ part of chain termination DNA sequencing.

As an example, consider the result of carrying out DNA synthesis in the presence of dideoxy-ATP. Including ddATP in the reaction means that DNA daughter chains will be terminated at random at all those points where the template has a thymine (the complementary base to the dideoxy-ATP). This produces a family of adenine-nucleotide-terminated chains having lengths equivalent to all the stretches in the template that extend from the 3'-end of the primer to each thymine in the sequence. Consequently, these fragments together report the position of every thymine in the template. When this family of sequences is electrophoresed in polyacrylamide gel, the fragments in the population will migrate at a rate dependent on their exact length, and a series of bands are obtained, each band corresponding to one of those ‘3'-end to thymine’ stretches in diminishing size down the gel.

Of course, there are four nucleotide bases, so to get a complete picture you need four such reactions; then, alongside the ‘3'-end to thymine’ family, there will be three other families of terminated chains corresponding to the ‘3'-end to adenine’, ‘3'-end to cytosine’, and ‘3'-end to guanine’ stretches. When first developed, the banding pattern was visualised by autoradiography, achieved by including a radioactively labelled nucleotide in each reaction.

Products of the four reactions, that is the ddATP, ddCTP, ddGTP and ddTTP-terminated reactions, were loaded into adjacent lanes of the gel. The smallest molecules ran fastest during electrophoresis, so the sequence could then be read from the bottom of the autoradiograph by noting the position of the band in any one of the four lanes: bands in the lane loaded with the ddATP reaction products reported the positions of thymine in the original template (remember, you are reading the radio-label of the complementary copy), bands in the lane loaded with ddCTP-products report the locations of guanine, ddGTP reports cytosine, and ddTTP-terminated reactions report adenine locations in the template.

Consequently, the sequence could be read from its 3'-end by reading up the four lanes of the gel. The smallest, fastest running molecule represented an oligonucleotide terminated at the first base position after the primer site in the template, the second band corresponded to the second base position after the primer, and so on. The sequence of the DNA template could be deduced by continuing to read the banding pattern upwards on a gel until the point near the top of the gel at which the bands could no longer be resolved. In a good separation this corresponded to about 1500 nucleotides.

 This procedure is technically undemanding, but time-consuming and labour-intensive. Over the years various aspects were automated and the development of dideoxynucleotides labelled with four different fluorescent labels, ‘fluorochrome labelling’ enabled all four dideoxynucleotides to be used in one reaction tube, but this also permitted the development of sequencing machines that have a fluorescence detector that can discriminate between the different labels.

Automatic sequencing machines rely on capillary electrophoresis rather than slab gels. In capillary electrophoresis the capillary is filled with buffer solution at a certain pH value. Fluorescently labelled PCR products of various lengths are separated in the capillary according to their size, but the separating force is the difference in charge to size ratio (not their ability to filter-flow through a gel).

In other words, size is measured by the overall negative charge, and the longer the fragment, the more negative the charge it bears. As the fragments are driven toward the positive electrode of a capillary by the electric field, they pass a laser beam that triggers a flash of light from the fluorochrome attached to the ddNTP that is characteristic of the base type (for example, green for A, yellow for T, blue for G, red for C).

In this way, the genome is carefully read by the machine in one pass; and, of course, the machine can examine many capillary gels in each run. A single machine can sequence half a million bases per day; and then continue into the night without complaint. This is the start of the improvement in speed and accuracy, and reduction in manpower and cost, of the genome sequencing technology that was used to complete the human genome project in 2003.

 Subsequent replacement of the electrophoretic capillary with a flow cell, miniaturisation, and use of high-throughput and massively parallel processing brought us to present day Next-Generation Sequencing (NGS), also called second-generation sequencing. Next-generation sequencing is the general term used to describe several different modern sequencing technologies, which differ in engineering configurations and sequencing chemistry.

Some of these platforms can sequence one million to 43 billion ‘short reads’, of sequence fragments of 50-400 bases each, in each instrument run [view: https://en.wikipedia.org/wiki/Massive_parallel_sequencing]. For more details, we suggest you check out the European Bioinformatics Institute (EMBL-EBI) online video lectures at this URL: https://www.ebi.ac.uk/training/online/course/ebi-next-generation-sequencing-practical-course.

Genomics is the systematic study of the genome of an organism and ‘systematic study’ may well involve comparison with the genomic sequences of other organisms; and phylogenetic study may involve comparisons with many other genomes. Genomics characteristically involves large data sets because it deals with DNA sequences by the megabase. Overall the word genomics has come to embrace a considerable range of activities that can be ‘structural’ (these have defined endpoints that are reached when the structural determination is complete) or ‘functional’ (which are more open-ended because additional aspects of function can be added continually).

Genomics requires the use of a combination of different methods, including:

  • DNA mapping and sequencing;
  • Collecting genome variation;
  • Transcriptional control of genes;
  • Transcriptional networks that integrate functions of, potentially, many genes;
  • Protein interaction networks, which are similarly potentially very extensive;
  • Signalling networks.

Genomics has enabled the expansionist approach to be taken to biology. Rather than being restricted by the techniques to concentrate on how individual parts of the organism work in isolation, the biologist can now expect to investigate how many (ultimately, perhaps all) parts of the organism work together. The newly coined expression ‘omics’, although originally informal, is being increasingly used to refer to fields of study of genome biology by adding the ending ‘-omics’. The related suffix ‘‑ome’ is used to describe the objects of study of such fields.

Some examples we have already used in this book are:

  • Genomics/genome, the complete gene complement of an organism;
  • Transcriptomics/transcriptome, all mRNA expressed transcripts;
  • Proteomics/proteome, all translated proteins;
  • Metabolomics/metabolome, the set of metabolites, the small molecule intermediates and products, of primary and secondary metabolism.

All of these fields of study contribute to Systems Biology, which is an holistic (rather than reductionist) scientific approach focussing, often with mathematical and computational modelling, on a wide range of complex interactions in biological systems (see the Wikipedia definition at https://en.wikipedia.org/wiki/Systems_biology).

Horgan & Kenny (2011) explain the rationale this way:

‘…The basic aspect of these approaches is that a complex system can be understood more thoroughly if considered as a whole...Systems biology and omics experiments differ from traditional studies, which are largely hypothesis-driven or reductionist. By contrast, systems biology experiments are hypothesis-generating, using holistic approaches where no hypothesis is known or prescribed but all data are acquired and analysed to define a hypothesis that can be further tested…’

Apart from the four ‘omics’ fields of study outlined above, there are several others you may come across: Taxonomics/taxome is the sum of all the described species and higher groups (genera, families, phyla) of all life, or the sum of all valid taxa of a particular lifeform (often specified, for example, beetle taxome, rust taxome, etc.); Phylogenomics (at the time of writing ‘phylogenome’ is not defined) involves the reconstruction of evolutionary relationships by comparing sequences of whole genomes or portions of genomes; Interactome, an interactome is the whole set of molecular interactions in a specific biological cell; Functome, is the complete set of functional molecular units in biological cells.

The omics wiki site [http://omics.org/] describes many more. Check out the History of Omics: as a generic name for various omics and a standalone biology discipline by Jong Bhak at this URL: http://omics.org/index.php/History_of_Omics. He describes using a computer program to generate tens of thousands of omics terms. Well, I suppose it will keep textbook authors in honest employment. There’s a word for that, too, actually: textome, which is ‘the complete set of biological literature that contain useful information when combined to generate new information through bioinformatics’.

The ability to sequence and compare complete genomes is improving our understanding of many areas of biology. Such data more directly reveal evolutionary relationships and indicate how pathogens spread and cause disease. They enable us to approach a comprehensive understanding of the activities of living cells and how they are controlled at the molecular level. The information has practical value, too. This is why so many pharmaceutical companies are involved in genome projects: the hope is that it will be possible to identify genes responsible for, or which have influence on, diseases, and then design therapies to combat disease directly (Sharma, 2015; Taylor et al., 2017).

Physical and molecular analyses originally moved the genetical focus away from the functional gene and towards the DNA sequence; now, functional genomics dominates. Sequencing an entire genome is only the beginning of functional studies of the transcriptome (all the transcripts made from the genome), the proteome (all the polypeptides made from the transcriptome), and the metabolome (all the metabolic reactions governed by the proteome).

 Now, in the twenty-first century our understanding is that the genome is not context sensitive because it is the full set of genetic information. Instead, the transcriptome, proteome and metabolome are all context sensitive because what they comprise depends upon the instantaneous regulatory status of the cell.

Or, as the old-time segregational (Mendelian) geneticists at the beginning of the twentieth century put it: phenotype = genotype + environment.

Updated July, 2018