18.12 Bioinformatics in mycology; manipulating very large data sets

Bioinformatics is essentially the use of computers to process biological information when computation is necessary to manage, process, and understand very large amounts of data. Bioinformatics is particularly important as an adjunct to genomics research, because of the large amount of complex data this type of research generates, so to a great extent the word, and the approaches it encompasses, have become synonymous with the use of computers to store, search and characterise the genetic code of genes (genomics), the transcription products of those genes (transcriptomics), the proteins related to each gene (proteomics) and their associated functions (metabolomics), and we will concentrate our attention on these aspects.
 
But there are other large data sets in need of analysis that rightly fall within range of the fundamental definition of the word ‘bioinformatics’. These are large data sets arising from:

  • Survey data and censuses, particularly, but not only, those involving automatic data capture.
  • Data generated by mathematical models that seek to simulate a biological system and its behaviour in time.

We will briefly mention examples of these non-genomic investigations towards the end of this chapter.

The ultimate aim of functional genomics is to determine the biological function of all the genes and their products, how they are regulated and how they interact with other genes and gene products. Add interactions with the environment and this is fully integrated biology; what has come to be known as systems biology (Klipp et al., 2009; Nagasaki et al., 2009).

The approach involves analysis at three levels:

  • mRNAs, the transcriptome,
  • proteins, known as the proteome, and
  • low molecular weight intermediates, the metabolome.

Comprehensive studies of such large collections of molecules require what are described as high-throughput methods of analysis at each stage from the generation of mutants through to the determination of which proteins are associated with which functions. Each stage generates massive amounts of data that are qualitatively and quan­titatively different, which must be integrated to allow construction of realistic models of the living system (Delneri, Brancia & Oliver, 2001).

Functional genomic analysis of the yeast Saccharomyces cerevisiae established the key concepts, approaches and techniques, although research on filamentous fungi is expanding (Foster, Monahan & Bradshaw, 2006). Considerable progress was made in analysis of yeast gene function using mutants with dele­tions single open reading frames (ORFs). However, genetic redundancy in the genome, resulting perhaps from gene duplication(s) during evolution, can be a problem in this type of analysis. In retrospect, analysis of yeast shows that much of the redundancy in the yeast genome is made up of identical, or almost identical, gene products fulfilling distinct physiological roles due to differential expression of the genes under different physiological conditions, and/or targeting the similar proteins to different cellular compartments. Nevertheless, more extensive studies require more extensive collections of mutants; those in which entire gene families are deleted and, ultimately, a collection in which all genes are represented by appropriate mutants. There is scope for large scale international collaboration in this sort of exercise and 1999 saw the establishment of a collection of mutant yeast strains, each bearing a defined deletion in one of 6000+ potential protein encoding genes in yeast (Winzeler et al., 1999). This is the EUROSCARF collection (EUROpean Saccharomyces Cerevisiae ARchive for Functional analysis; see http://web.uni-frankfurt.de/fb15/mikro/euroscarf/col_index.html). Using a PCR-based gene disruption strategy (see Fig. 18.21) mutant strains with a deletion of most of the open reading frames in the genome were prepared in this systematic deletion project. In addition, each deleted ORF was flanked by two 20 base pair sequences unique for each deletion. These allow the sequences to be detected easily; effectively they act as molecular barcodes that allow large numbers of deletion strains, potentially the whole library, to be analysed in parallel at the same time.

Another approach used a transposon that created gene fusions in a yeast clone library so that the protein products of the mutated yeast genes could be identified and analysed by immunofluorescence using antibodies to the peptide introduced by the transposon (Ross-Macdonald et al., 1999). In the original work a yeast genomic DNA library was mutagenised in Escherichia coli with a multipurpose minitransposon derived from the bacterial transposable element known as  Tn3. The minitransposon contained cloning sites and a 274-base pair sequence encoding 93 amino acids, called a HAT tag, that was inserted into the yeast target proteins. The HAT tag allows immunodetection of the mutated yeast protein. Transposon mutagenesis generated 106 independent transformants. Subsequently, individual transformant colonies were selected and stored in 96-well plates. Plasmids were prepared from these strains and transformed into a diploid yeast strain in which homologous recombination integrated each fragment at its corresponding genomic locus, thereby replacing its genomic copy. 92,544 plasmid preparations and yeast transformations were carried out, identifying a collection of over 11,000 strains, each carrying a transposon inserted within a region of the genome expressed during vegetative growth and/or sporulation. These insertions affected nearly 2,000 annotated genes, distributed over all 16 yeast chromosomes and representing about one-third of the yeast genome (Ross-Macdonald et al., 1999). This study demonstrates the value of a particular strategy for mutant generation and detection, but it also indicates the scale of what has been called ‘new yeast genetics’. Finding methods that generate large numbers of gene mutants and simultaneously identify the mutants and/or their products in ways amenable to automation is the start of the high-throughput approach (see http://ygac.med.yale.edu/default.stm) (Cho et al., 2006; Caracuel-Rios & Talbot, 2008; Honda & Selker, 2009).

Messenger RNA molecules are the subject of transcriptome analyses and can be stud­ied in a fully comprehensive manner using hybridisation-array analysis, which is described as a massively parallel technique because it allows so many sequences to be examined at one time. Remember, though, that mRNA molecules transmit instructions for synthesising proteins; they do not function otherwise in the workings of the cell, so transcriptome analyses are considered to be an indirect approach to functional genomics.

The transcriptome comprises the complete set of mRNAs synthesised in the cell under any given well-defined set of physiological conditions. Unlike the genome, which has a fixed collection of sequences, the transcriptome is context dependent, which means that its content of sequences depends on the cell response to the current set of physiological circumstances, and the make up of that set will change when the physiological circumstances change. Those physiological circumstances will be adapted in response to changes in both the intracellular and extracellular environment of the cell; its nutritional status, state of differentiation, age, etc. The mRNA of genes that are newly expressed (up-regulated) will appear in the sequence collection, and the mRNA of genes that are not expressed (down-regulated) in the new circumstance will disappear from the sequence collection. Determination of the nature and sequence content of the transcriptome in all of these circumstances is precisely what transcriptome analysis is intended to achieve, because the pattern of mRNA content in the transcriptome reveals the pattern of gene regulation.

Hybridisation arrays are now used widely to study the transcriptome because of their ability to measure the expression of a large number of genes with great efficiency. Microarrays permit assessment of the relative expression levels of hundreds, even thousands, of genes in a single experiment. Hybridisation arrays are also called DNA micro- or macroarrays, DNA chips, gene chips, and bio chips (Duggan et al., 1999; Nowrousian, 2007). The web definition of DNA microarray is: a collection of microscopic DNA spots attached to a solid surface forming an array; used to measure the expression levels of large numbers of genes simultaneously (http://en.wiktionary.org/wiki/DNA_microarray). The array of single-stranded DNA molecules are typically distributed on a glass, nylon membrane, or silicon wafer (any of which might be called ‘a chip’), each being immobilised at a specific location on the chip in a predetermined grid formation. Microarrays and macroarrays differ in the size of the sample spots of DNA; in macroarrays the size of the spot is over 300 µm, in microarrays less than 200 µm. Macroarrays are normally spotted onto nylon membranes, microarrays are made on glass surfaces (usually called custom arrays) or quartz (GeneChip®, from Affymetrix Inc.; see http://www.affymetrix.com/products_services/index.affx) (Lipshutz et al., 1999) by high-speed robotics. The immobilisation onto the solid matrix is the most crucial aspect of the technique as it must preserve the biological activity of the molecules. The photolithographic technique most used in making DNA arrays was developed by Affymetrix Inc. The spotted material can be genomic DNA, cDNA, PCR products (any of these sized between 500 to 5000 base pairs) or oligonucleotides (20 to 80-mer oligos).

The identities and locations of the single-stranded DNAs are known, so when the chip is treated with a suspension of experimental cDNA molecules prepared from a set of mRNAs, the cDNAs complementary to those on the chip will bind to those specific spots. The complementary binding pattern can be detected and since the DNAs at each position on each grid are known, the complementary binding pattern indicates the pattern of gene expression in the sample. Macroarrays  are hybridised using a radioactive probe; normally 33P, an isotope of phosphorus that decays by β-emission so that the decay, and therefore the position of the complementary binding can be imaged with a phosphorimager, a device in which β-particle emissions excite the phosphor molecules on the plate in a way that can be detected by scanning the plate with a laser and the attached computer converts the energy it detects to an image in which different colours represent different levels of radioactivity. Microarrays are exposed to a set of targets either separately (single dye experiment) or in a mixture (two dye experiment) to determine the identity/abundance of complementary sequences. Laser excitation of the spots yields an emission with a spectrum characteristic of the dye(s), which is measured using a scanning confocal laser microscope. Monochrome images from the scanner are imported into software in which the images are pseudo-coloured and merged, and combined with information about the DNAs immobilised on the chip. The software outputs an image which shows whether expression of each gene represented on the chip is unchanged, increased (up-regulated) or decreased (down-regulated) relative to a reference sample. In addition, data is accumulated from multiple experiments and can be examined using any number of data mining software tools.

DNA microarrays have many uses. Apart from the expression profiling to examine the effect of physiological circumstance on gene expression on which we have so far concentrated, hybridisation arrays can be used to:

  • dissect metabolic pathways and signalling networks;
  • establish transcription factor regulatory patterns, target genes and binding sites;
  • compare gene expression in normal tissue with that of diseased tissue, initially to establish which genes are involved in response to disease, and when that is done to diagnose disease;
  • identify gene expression of different tissues and different states of cell differentiation to establish tissue-specific and/or differentiation-specific genes;
  • study reaction to specific drugs, agrochemicals, antibiotics or toxins to identify drug targets, side effects, and resistance mechanisms.

The proteome is the complete set of proteins synthesised in the cell under a given set of conditions. The classic methods involved are two-dimensional gel electrophoresis (2DGE) or liquid chromatography for protein separation and mass spectrometry (MS) for protein identification. Although the proteins are true functional entities within the cells, analysis of the proteome is difficult, largely because of slow progress in identifying the proteins that make up the proteome. Nevertheless continued improvement in technology is steadily increasing the throughput of protein identifications from complex mixtures and permitting quantification of protein expression levels and how they change in different circumstances (Washburn & Yates, 2000; Bhadauria et al., 2007; Rokas, 2009). An important feature that arises from analysis of the proteome is the enormous extent and complexity of the network of interactions among proteins and between proteins and other components of the cells. These networks can be visualised as maps of cellular function, depicting potential interactive complexes and signalling pathways (Legrain, Wojcik & Gauthier, 2001; Tucker, Gera & Uetz, 2001).

‘Metabolomics consists of strategies to quantitatively identify cellular metabolites and to understand how trafficking of these biochemical messengers through the metabolic network influences phenotype’ (quoted from Jewett, Hofmann & Nielsen, 2006). Metabolomics is particularly important in fungi because these organisms are widely used for the production of chemicals. The main difficulty in metabolome analysis is not technical as there are sufficient analytical tools and mathematical strategies available for extensive metabolite analyses. However, the indirect relationship between the metabolome and the genome raises conceptual difficulties. The biosynthesis or degradation of a single metabolite may involve many genes, and the metabolite itself may impact on many more. Consequently, the bioinformatics tools and software required must be exceptionally powerful.

Ultimately, you may think in terms of applying all this knowledge to the creation of something entirely new. That is, to developing a biological system of some form that does not already exist in the biosphere. In the past this was achieved by the evolutionary process of artificial selection, producing crop species (like maize) or domesticated animals (like high milk-yield cattle) that simply could not exist in the wild. The ‘modern’ version of this is called synthetic biology, and with the current passion for applying management definitions to long standing activities it has been defined as the area of science that applies engineering principles to biological systems to design and build novel biological functions and systems. The potential examples offered by the BBSRC, which is ‘…the main UK public funder of bioscience research…’ include: ‘… the creation of novel systems to generate power, new medical applications, nanoscale biological computers, new approaches to cleaning up dangerous waste or sensitive biosensors for health or security applications...’ (visit http://www.bbsrc.ac.uk/media/news/2009/090904_public_dialogue_opened_on_synthetic_biology.html). Wikipedia provides a more helpful definition of synthetic biology as ‘…the broad redefinition and expansion of biotechnology, with the ultimate goals of being able to design and build engineered biological systems that process information, manipulate chemicals, fabricate materials and structures, produce energy, provide food, and maintain and enhance human health and our environment… (visit http://en.wikipedia.org/wiki/Synthetic_biology). To these definitions Kaznessis (2007) adds the crucial rider that synthetic biological engineering is emerging from molecular biology as a distinct discipline based on quantification. And that’s the real defining feature; this is a branch of biology that depends on large scale computer processing of large amounts of numerical data.

Updated December 17, 2016