18.13 Genomic data mining supports the notion that there are different developmental control mechanisms in fungi, animals and plants

The Broad Institute’s collection of fungal genomes has been discussed in Section 18.7 (and visit http://www.broadinstitute.org/). This is just one of a number of data banks around the world in which genomic sequencing data are maintained. As genomes were sequenced and such databases increased in size (the amount of data doubles every three years), it became a major challenge for bioinformatics tools to be able to query the databases effectively. When information is contained in different databases, with different data formats, it is very difficult to use a single query tool to search more than one source of data. Gathering data from a range of sources, transforming and extracting meaningful patterns from that data is called data mining. Data mining is vital to genomic analysis but it is commonly used in other areas outside scientific research, such as marketing, commercial stock control, financial services, surveillance, and fraud detection.

All of the sequence databases offer a variety of analytical devices, usually as free downloads, to their users. These include software tools for assembling genome sequences from shotgun sequencing; for visualising and annotating genomes; for identifying ORFs in a DNA sequence by locating stop and start codons; to deduce amino acid sequences; to analyse gene expression; to align cDNA or mRNA sequences to genomic sequences to determine the exon/intron structure; for analyses of multiple sets of genomic data; for comparing protein sequences; to compare mass spectra of peptides and proteins for identification; and many others, often depending on the particular interests of the institution that maintains the sequence database. To view some of these visit the following websites:

The most widely used program for data mining is called BLAST; this is an acronym standing for Basic Local Alignment Search Tool.  BLAST, which was devised in 1990 (Altschul et al., 1990), finds regions of local similarity between sequences and can be used to compare nucleotide or protein sequences and calculate the statistical significance of any matches found. BLAST is widely used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. BLAST is an important program which now comes in several variants which are optimised for particular purposes; it is under continuous improvement and our best advice is to consult the appropriate pages on the NCBI website at http://blast.ncbi.nlm.nih.gov/.

Resources Box

Learning data mining

If you want to learn how to search and analyse genomes for yourself, there are two very helpful resources on the Internet:

Our example data mining exercise has already been mentioned in Section 12.14 as a search for gene sequences that provided support for the idea that control mechanisms of fungal multicellular developmental biology are probably very different from those known in animals and plants. In an initial study a quick comparison was made of genomes of the saprotrophic Coprinopsis cinerea and Phanerochaete chrysosporium, and several other Basidiomycota and Ascomycota, using searches for developmental sequences known from the annelid worm Caenorhabditis elegans and the fruit fly Drosophila melanogaster (Moore, Walsh & Robson, 2005).

In this study the fungal genomes were searched for homologues of sequences of the animal signalling mechanisms known as Wnt, Hedgehog, Notch and TGF-β, all of which are considered by animal developmental biologists to be essential, and highly conserved, components of normal development in all animals. None of these sequences were found in the fungal genomes (they also proved to be absent from plants).

Later, a fully comprehensive data mining exercise to search for homologies to sequences assigned to the category ‘development’ in a standard database was completed (Moore & Meškauskas, 2006). The basis for this was the Gene Ontology (GO) database (visit the URL at: http://www.geneontology.org/GO.database.shtml), which lists known genetic sequences (DNA, RNA, protein) grouped by the cellular process to which they contribute. The initial query to the Gene Ontology database was to get all gene sequences that had been assigned to the gene group ‘development’. That is, the originator of any sequence stored on the database had been described by their originator as having something to do with development, and that means any aspect of development in any organism. Then each and every ‘developmental’ sequence retrieved from the GO server was used to search all the genomes (and partial genomes) included in the taxonomic listings used by NCBI databases (http://www.ncbi.nlm.nih.gov/), which at that time comprised:

  • Metazoa (875 genome sequences);
  • Viridiplantae (53 genome sequences);
  • and the entire list included under ‘Fungi’ (141 genome sequences).

To make such a job possible, you need automatically running web agents, which are reusable programming modules that interact with the Internet seeking user-defined goals; for example ‘get the sequence data’, ‘get the taxonomy information’ or ‘get the similarity search results’, etc. The agents were created using an application called Sight, which is a Java-based software package that provides a user-friendly interface to generate and connect web robots for automatic genomic data mining (visit http://bioinformatics.org/jSight/).

The web agents retrieved the appropriate sequences from the GO database, retrieved information about them from other databases and then submitted those sequences to search for similarities in all the genomes recovered from the NCBI databases. In total the web agents accomplished an estimated total of 590,000 similarity searches across all available genomes for the 552 developmental sequences recovered from the GO database. If you were doing that number of searches yourself by hand you would have to complete 100 searches a day, doing one search every 15 minutes for 24 hours a day, 7 days a week, to complete the job in under 16 years! That’s why software automation is important.

The results, which were also collected together by an automatic web-agent, showed that:

  • none of the sequences involved in animal or plant multicellular development can be found in the genomes of fungi.
  • No sequences were strictly fungus specific, but 68 occurred only in plants and 239 occurred only in animals.
  • True homology was limited to 78 sequences involved in the architecture of the eukaryotic cell.

It is crucial to the interpretation of data mining results like this that BLAST outputs the statistical significance of any matches it finds. Thus, the statements above of the sort ‘none of the sequences…can be found’ really means that BLAST reported that all of the matches it found were statistically not significantly different from random sequence similarities. This statistical analysis feature provides the confidence for the statements that in fungi there are no Wnt, Hedgehog, Notch, TGF or p53 sequences, all of which are crucial to animals; nor were there any SINA, or NAM sequences, which are crucial to plant multicellular development.

The overwhelming majority of highly similar matches found in this survey proved to be between sequences involved in basic cell metabolism or essential eukaryotic cell processes: enzymes in common metabolic pathways, many transcription regulators, binding proteins, receptors and membrane proteins.

  • What is lacking is cross-Kingdom similarity in the ‘higher-management’ functions that integrate these ‘nuts and bolts’ of eukaryotic cells to make tissues, organs and organisms.
  • Overall, these studies suggest that there are no resemblances between the crown group of eukaryotic Kingdoms in the ways they control and regulate their developmental processes.

The unique cell biology of filamentous fungi has clearly caused control of their multicellular development to evolve in a radically different fashion from that in animals and plants. At the moment, unfortunately, we are totally ignorant of the way fungi regulate their multicellular development. Though we have the tools to study this.

Updated December 17, 2016