18.6 Sequencing fungal genomes

Genomics characteristically involves large data sets because it deals with DNA sequences by the megabase. Overall the word genomics has come to embrace a considerable range of activities that can be ‘structural’ (these have defined endpoints that are reached when the structural determination is complete) or ‘functional’ (which are more open-ended because additional aspects of function can be added continually). Genomics requires the use of a combination of different methods, including:

  • DNA mapping and sequencing;
  • Collecting genome variation;
  • Transcriptional control of genes;
  • Transcriptional networks that integrate functions of, potentially, many genes;
  • Protein interaction networks, which are similarly potentially very extensive;
  • Signalling networks.

Genomics has enabled the expansionist approach to be taken to biology. Rather than being restricted by the techniques to concentrate on how individual parts of the organism work in isolation, the biologist can now expect to investigate how many (ultimately, perhaps all) parts of the organism work together. This is Systems Biology, which is an holistic (rather than reductionist) approach focussing, often with mathematical and computational modelling, on a wide range of complex interactions in biological systems (see the Wikipedia definition at http://en.wikipedia.org/wiki/Systems_biology).

The ability to sequence and compare complete genomes is improving our understanding of many areas of biology. Such data more directly reveal evolutionary relationships, and indicate how pathogens spread and cause disease. They enable us to approach a comprehensive understanding of the activities of living cells and how they are controlled at the molecular level. The information has practical value, too. This is why so many pharmaceutical companies are involved in genome projects: the hope is that it will be possible to identify genes responsible for, or which have influence on, diseases, and then design therapies to combat disease directly.

The nucleotide sequence of DNA is its ultimate physical map, and there are two basic procedures for DNA sequencing:

  • the chain termination method (also known as dideoxy-sequencing) sequences a single‑stranded DNA molecule using enzymic synthesis of complementary polynucleotides which terminate at particular nucleotide positions when a nucleotide triphosphate lacking a 3'‑hydroxyl group is incorporated;
  • the chemical degradation method uses different chemicals to cleave the DNA after a particular base or bases (either A, G, T, or C).

Both procedures generate a complete set of sequence fragments potentially covering a size range from 10 to 1500 nucleotides that differ in length by just one nucleotide from either preceding or succeeding fragments (such a collection of fragments is known as a nested array). This array of DNA molecules can be separated by polyacrylamide gel electrophoresis, the end nucleotide being identified by some sort of label added to it during the sequencing procedure. The chain termination procedure has become the principal technique because it has been possible to automate the process to the extent that large-scale sequencing can be completed at reasonable cost and in a reasonable time-scale (Talbot, 2001; Moore & Novak Frazer, 2002).

Chain termination sequencing relies on the experimental synthesis of new (fragmented) sequences of DNA. Consequently, a clone of the DNA molecule to be sequenced, prepared in single‑stranded form to serve as the DNA synthesis template, is the starting material for chain termination sequencing. The clone may be made as part of the sub-cloning procedure from a high capacity cloning vectors, which may be a:

  • Cosmid, a hybrid plasmid/bacteriophage cloning vector that contains cos sequences from the λ (lambda) bacteriophage. Their plasmid replication origins and antibiotic resistance selection tools enable transfected host cells to be selected following the initial cloning procedure. In addition, though, the cos sequences (‘cos’ stands for ‘cohesive ends’) allow packaging into bacteriophage capsids, which allows much larger foreign sequences to be transferred into cloning cells by transduction.
  • BAC (bacterial artificial chromosome), which is a based on a functional fertility plasmid (or F-plasmid), used for transforming and cloning in bacteria, particularly Escherichia coli. F-plasmids can mobilise the whole bacterial chromoneme for transfer from donor to recipient cell and BACs use this function to transfer large foreign sequences (150-350 kbp, but possibly 700 kbp or more).
  • YAC (yeast artificial chromosome), which can be used to clone DNA fragments larger than 100 kb and up to 3000 kb. The YAC construct contains yeast telomere, centromere, and replication sequences needed for chromosome replication during mitosis in yeast cells. Built initially as a circular plasmid, they can be grown to quantity as a plasmid and then linearised (with restriction enzyme) so that DNA ligase can add the sequence of interest within the linear molecule.

Cloning in a plasmid vector yields double‑stranded DNA, which is first denatured to single strands (with alkali or by boiling). Both complementary single‑strands can then be analysed separately to provide independent sequences of both ends of the cloned DNA. Alternatively, the DNA can be cloned in a vector based on the M13 bacteriophage, which is designed specifically to produce single-stranded templates for DNA sequencing.

Mature M13 bacteriophage particles contain the single‑stranded copies of the cloned DNA molecules; the plasmids are harvested from host cells and then rigorously purified from other polynucleotides. M13 bacteriophages are readily purified, but inserts longer than about 3 kb can suffer deletions and rearrangements when cloned in M13. A compromise vector is a phagemid, which is a plasmid containing an M13 origin of replication. When combined with a helper phage, single‑stranded copies of the phagemid cloning construct are packaged into virus particles. Phagemid vectors can accommodate inserts up to 10kb. PCR can also be used to prepare template DNA and one of the PCR primers can then be used as a primer for the template‑dependent DNA polymerase.

Once you have your template, the first step in sequencing is to anneal to it a short oligonucleotide to serve as the primer for DNA synthesis by a DNA polymerase enzyme. The sequence of this primer determines where the sequencing process will start, so it is up to the experimenter to decide the most appropriate primer for the sequence of interest. In addition to the enzyme itself, DNA synthesis requires the four deoxyribonucleotide triphosphates (dATP, dCTP, dGTP, and dTTP) as its usual substrates. The reaction mixture also contains small amounts of a dideoxynucleotide, which has a hydrogen atom rather than a hydroxyl group on its 3'‑carbon atom. The DNA polymerase incorporates the dideoxynucleotide efficiently, but is then prevented from elongating the growing chain any further. This is because the 3'-OH which is normally phosphorylated and linked to a 5'-sugar carbon atom on one side and a 3'-sugar carbon atom on the other side to form a double-stranded polynucleotide polymer is missing in the dideoxynucleotide. That’s the ‘chain termination’ part of chain termination DNA sequencing.

As an example, consider the result of carrying out DNA synthesis in the presence of dideoxy-ATP. Including ddATP in the reaction means that DNA daughter chains will be terminated at random at all those points where the template has a thymine (the complementary base to the dideoxy-ATP). This produces a family of adenine-nucleotide-terminated chains having lengths equivalent to all the stretches in the template that extend from the 3'-end of the primer to each thymine in the sequence. Consequently, these fragments together report the position of every thymine in the template. When this family of sequences is electrophoresed in polyacrylamide gel, the fragments in the population will migrate at a rate dependent on their exact length, and a series of bands are obtained, each band corresponding to one of those ‘3'-end to thymine’ stretches in diminishing size down the gel.

Of course, there are four nucleotide bases, so to get a complete picture you need four such reactions, and, alongside the ‘3'-end to thymine’ family, there will be three other families of terminated chains corresponding to the ‘3'-end to adenine’, ‘3'-end to cytosine’, and ‘3'-end to guanine’ stretches.

When first developed, the banding pattern was visualised by autoradiography, achieved by including a radioactively labelled nucleotide in each reaction. Products of the four reactions, that is the ddATP, ddCTP, ddGTP and ddTTP-terminated reactions, were loaded into adjacent lanes of the gel. The smallest molecules ran fastest during electrophoresis, so the sequence could then be read from the bottom of the autoradiograph by noting the position of the band in any one of the four lanes: bands in the lane loaded with the ddATP reaction products reported the positions of thymine in the original template (remember, you are reading the radio-label of the complementary copy), bands in the lane loaded with ddCTP-products report the locations of guanine, ddGTP reports cytosine, and ddTTP-terminated reactions report adenine locations in the template.

Consequently, the sequence could be read from its 3'-end by reading up the four lanes of the gel. The smallest, fastest running molecule represented an oligonucleotide terminated at the first base position after the primer site in the template, the second band corresponded to the second base position after the primer, and so on. The sequence of the DNA template could be deduced by continuing to read the banding pattern upwards on a gel until the point near the top of the gel at which the bands could no longer be resolved. In a good separation this corresponded to about 1500 nucleotides (Moore & Novak Frazer, 2002).

This procedure is technicalDecember 17, 2016the years various aspects have been automated but the real breakthrough came with the development of dideoxynucleotides labelled with four different fluorescent labels, which allow the dideoxy-terminated bands on the gel to be distinguished by the colour of their fluorescence. Fluorochrome  labelling enables all four dideoxynucleotides to be used in one reaction tube, but it has also permitted the development of sequencing machines, which have a fluorescence detector that can discriminate between the different labels.

Sequencing machines can carry out the sequencing reactions, using capillary electrophoresis separation rather than gel slabs, and screen the fluorescence of the separated bands as they emerge and pass in front of the detector, identifying and recording the sequence automatically. Automation allows enormous improvement in throughput and accuracy of the procedures is approximately 99.9% (= one erroneous base for every 1000 bases sequenced). Even greater accuracy is achieved if the complementary strand of the target DNA molecule is also sequenced and the two sequences compared by the computer program.

Further progress has led to the development of more radical approaches leading to microchips carrying arrays of different oligonucleotides to establish sequences by determining the hybridisation patterns of test molecules to the components of the array. Continued advances in miniaturisation and computer data processing, combined with electronic detection of hybridisation make such arrays a viable means of sequencing large molecules.

On average, a single sequencing run can establish the sequence of a fragment from a few hundred to just less than a thousand bases long. Longer sequences have to be ‘stitched together’ from smaller ones, and there are two principal strategies for this:

  • shotgun sequencing (Fig. 15)
  • and ordered (or directed) clone contig assembly (Fig. 16)(a ‘contig’ was originally defined as a set of gel readings that are related to one another by overlap of their sequences; the overlapping segments of DNA form into a contiguous consensus sequence, the length of which is the length of the contig).
Flow chart for the shotgun approach to polynucleotide sequencing
Flowchart diagram of the ordered (or directed) clone contig approach to polynucleotide sequencing
Fig. 15. Flow chart for the shotgun approach to polynucleotide sequencing. Fig. 16. Flowchart diagram of the ordered (or directed) clone contig approach to polynucleotide sequencing.

Shotgun sequencing assembles the complete sequence directly by looking for overlaps between the shorter fragments obtained from the output of sequencing experiments. It can be done without prior knowledge of a genetic or physical map, but it does require that many fragments are sequenced. The target DNA sequence is broken into random fragments with restriction enzymes or by physical treatment like shearing or sonication. The fragments are then electrophoresed and those in the 2 kb size range cloned in a plasmid or phage vector for sequencing, and the many short sequences are matched by computer to establish the overall consensus sequence. Sequence assembly is done by computer programs that compare the short sequences, find overlapping ends and put together the contiguous sequence (that is, the contig).

To ensure complete coverage it is necessary to sequence far more DNA than is represented by the actual sequence length of the target, amounting to possibly thousands of sequencing experiments. It is a matter of statistics: to guarantee over 99.8% coverage of a sequence it is necessary to sequence randomly chosen fragments totalling 6.5 to 8 times the true length of the cumulative molecule of interest.

The shotgun strategy was the first to be successfully applied, being used to sequence the genomes of several prokaryotes. By adapting it to something like a production‑line, with each team member having a specific (and highly repetitive) task to perform, the procedure reached the point of revealing the complete sequence of any genome less than about 5 Mb within about one year. A weakness of the approach is that every sequence must be compared with every other sequence to identify overlaps so the data analysis needed to find the consensus contigs is demanding.

If the study is interested in particular genes for which some sequence information (and, therefore, hybridisation probes) are available, these can be used to select out sequences of interest in the early stages of a shotgun-sequencing project. This approach aims at concentrating attention on aspects of particular interest rather than sequencing an entire molecule, and relies on the expectation that there is a good chance that the first few sequence fragments, since they are obtained at random, will include recognisable parts of the gene(s) of interest. Several other short cuts to greater efficiency have been found, but more structured strategies are required for larger sequences.

The clone contig strategy uses information from previously established genetic and/or physical maps. This approach uses the mapping information to ‘anchor’ the cloned fragments derived from restriction digests onto the previously established maps. Consequently, sequence data from the clones can be ordered (made ‘contiguous’) by reference to the pre-existing map. This is the ‘clone contig’ aspect of the strategy, which is well suited to BAC and YAC clones.

Clone contigs are built up by using the DNA sequence from the starting clone to make a hybridisation probe which is used to screen the rest of the library to identify a second clone with which the first overlaps and then the sequence of the second to identify a third clone with which the second overlaps, and so on. This is the essence of chromosome walking. The procedure can be confused if the target DNA contains many repetitive sequences but these are relatively uncommon in fungi.

The slow process of chromosome walking can be avoided if high-resolution restriction maps are available, or are prepared, that have sufficient detail to act as fingerprints of the clones. Clone contigs can then be assembled from clones that have restriction fingerprint features in common in their overlap regions; these clones being identified by appropriate hybridisation probes.

The ordered (or directed) shotgun approach (Fig. 16) makes use of the sequence as it is obtained. Sequence information for the end of each completed segment is used to synthesise PCR primers, which are used to recover the adjacent sections. The process can be repeated until the entire region of interest has been sequenced and assembled.

Of course, these different approaches are complementary, rather than mutually exclusive, and can be used in combination to fit the needs of the project. As we have described it, sequencing of microbial genomes may appear to be routine, but it is a major undertaking. We noted above that approximately 600 scientists were involved in the yeast genome-sequencing project, over a period of about six years. That’s more than 3000 person-years of sequencing effort on an organism that had regularly featured in research over the previous hundred years and for which, at the outset of the sequencing project, there was already a conventional linkage map featuring some 1200 genes encoding either RNA or protein products.

Physical and molecular analyses originally moved the genetical focus away from the functional gene and towards the DNA sequence; now, functional genomics dominates. Sequencing an entire genome is only the beginning of functional studies of the transcriptome (all the transcripts made from the genome), the proteome (all the polypeptides made from the transcriptome), and the metabolome (all the metabolic reactions governed by the proteome). Now, at the beginning of the twenty-first century our understanding is that the genome is not context sensitive because it is the full set of genetic information. Instead, the transcriptome, proteome and metabolome are all context sensitive because what they comprise depends upon the instantaneous regulatory status of the cell.

Or, as the old-time geneticists at the beginning of the twentieth century put it: phenotype = genotype + environment.

Updated December 17, 2016