A New Machine Learning Algorithm for Fungal Classification
When you think of fungi you are likely to wrinkle your nose, thinking of the slimy toadstools that coat rotting tree trunks in the forest or the navy blue splotches that appear on old cheese. While fungi may have a bad reputation, they are a diverse and fascinating group of organisms that have a far greater impact on our daily lives than one might expect. Aside from living in forests and in your fridge, fungus can be found in soil, intertwined with plant roots, and even in your gut. Fungi play an important role in our lives by aiding crop growth through redistribution of nutrients in the soil and symbiotic relationships with plant roots. Furthermore, fungi living in our gut - known as our gut mycobiota- help maintain the integrity of the microbial communities that are the secret behind digestion of tougher materials such as the cellulose of plant fibers. The importance of fungi has prompted numerous studies to study their diverse forms and classify them on the tree of life. With millions of different fungal species, and many being microscopic, it is very difficult for fungal specialists (known as mycologists) to use traditional classification techniques. To tackle this problem, Dr. Fábio Miranda and colleagues have developed a novel machine learning model to increase the effectiveness of fungal taxonomic classification.
Classifying Microbes
You are likely familiar with the idea of taxonomic classification, the process of placing organisms into groups based on shared characteristics and ancestry. Oftentimes, when looking at organisms that are closely related such as a leopard and a lion, the physical characteristics of each animal are telling of this relationship. However, when looking at organisms under a microscope, it can be almost impossible to classify them based on physical appearance alone. In the case of fungi, many are known to create structures that are composed of more than one species. This only serves to complicate the process of classifying them. Since physical, or morphological, characteristics cannot easily be used to classify microorganisms, scientists resort to the analysis of DNA through genomic techniques.
Genomics techniques used to analyze DNA are thoroughly discussed in the main Computational Biology course, so we will not discuss them at length here. These techniques are often reliant on computers, which compare sets of base pairs in DNA sequences to find similarities in the case of comparative analyses. Even with the assistance of computers, complete genomes are extremely long, with the human genome being approximately 3.2 base pairs in length! Of course, there is variability in genome length with some being shorter and longer than the human genome, but, regardless, these lengths can be difficult even for powerful computers to handle.
Some microscopic fungi show distinctive structures known as hyphae when they are viewed under a scanning electron microscope. Many others are more difficult to classify based on physical appearance alone. Image credit: (Yuping et al., 2015).
Due to the sheer size of complete genomes and the vast quantities of extraneous genetic material that are of little use in classification, scientists often select specific sections of DNA for comparison. These sequences, known as genetic markers, are often highly conserved, meaning that they have slower rates of mutation when compared to other genetic sequences. As such, they are subject to fewer mutations and can be used to more accurately identify the relationships between species. In bacteria, which are of great interest in genomic studies, the genetic sequence coding for 16S rRNA is generally used for taxonomic classification. In fungi, 16S rRNA lacks some of the desirable characteristics that it has in bacteria, and a sequence known as the internal transcribed spacer (ITS) is used for fungal classification. By focusing on a smaller portion of the genetic sequence, genomic techniques can work more efficiently, particularly when producing complex phylogenetic trees.
A simplified diagram of the genetic sequence coding for 16 S rRNA. The conserved regions are colored in blue while the hypervariable regions are colored in red. The presence of hypervariable regions interspersed among slowly mutating conserved regions makes 16S rRNA ideal for bacterial classification. Image credit: (Lourenco & Welch, 2022).
To classify fungal species within the taxonomic system, the ITS section of the sequenced DNA is identified and compared across species included within the analysis. Armed with a novel machine learning model, Dr. Fábio Miranda and colleagues set out to test the effectiveness of their algorithm against preexisting models. Before diving into Miranda’s work and the principles of the machine learning algorithm, let’s take a closer look at fungi and where they fit in the tree of life.
The Biology: A Tour of the Fungal Kingdom
Think back to the last time that you took a hike in the woods and ran across a fallen tree covered in the characteristic rounded caps of toadstools. Aside from turning way in disgust (or, hopefully, peering more closely in fascination), what did these toadstools remind you of? Of course, the question is naturally subjective, but many might be inclined to say that the mushrooms resembled a type of strange plant.
While it is true that mushrooms are generally stationary and may have a basic morphology reminiscent of a strange plant, that is about where the similarities end. Instead, mushrooms are a type of fungus, a group completely distinct from plants. In fact, evolutionary studies have shown that fungi are more closely related to animals than plants, yet even these two groups diverged around 1 billion years ago. So, what are fungi?
Like plants and animals, fungi are eukaryotic organisms, meaning that their cells consist of compartmentalized organelles and generally have a nucleus which stores genetic information. Fungi range from microscopic, unicellular organisms to large multicellular forms. Although fungi are not plants or animals, they do share some characteristics of each group. Fungi lack chloroplasts, the organelles that allow plants to photosynthesize, and are obligate heterotrophs like animals. With plants, they share vacuoles for nutrient and water storage and a rigid cell wall. Fungi are unique in that their cell wall is composed largely of chitin - whereas the cell wall of plants is made of cellulose - and many forms have a very inconspicuous nucleus that can be difficult to identify beyond the stage of cellular growth.
A view of fungal cells under an optical microscope. The chitin cell wall can be seen along the circumference of the cells. Image credit: Kayla Fratt.
Some fungal species are unicellular, and are generally known as yeasts (yes, included among these is the yeast used in baking bread), but most species are multicellular. When a multicellular fungus begins to grow, it typically forms elongated, strand-like structures known as hyphae. Fungal hyphae consist of chains of cells and grow from the very tip where branching can occur. Hyphae are very small and serve as the building blocks of the larger, visible fungi that we are familiar with. When a large number of hyphae have clustered together, they form a larger structure known as a mycelium. The mycelium can be thought of as the main body of the fungus and often serves the purpose of holding reproductive structures and producing offspring. Many fungi, such as some pathogenic fungi and components of the gut mycobiota, form mycelia that are strictly within the microscopic realm. However, many of the fungi that we might encounter in our daily lives (such as molds or toadstools) are capable of forming mycelia that are large enough to observe with the naked eye.
This elaborate mycelium belongs to the bridal veil stinkhorn. This fungus along with others in the stinkhorn family are well known for sticky secretions and a foul smell that serves to attract insects. Image credit: Wan Hailun.
Visible fungi, such as the bridal veil stinkhorn pictured above, have clearly developed very complex mycelia. While they might be visually pleasing to an observer, what is the true purpose of these large structures? One answer is reproduction. Fungi reproduce through spores, as do some basal groups of plants, and these are often released by a specialized portion of the mycelium. The production of spores can follow many routes and one species of fungi often reproduces using different methods depending on environmental conditions. Generally, reproductive patterns can be grouped into asexual and sexual reproduction. Many species of fungi can reproduce asexually, which is done either through the formation of vegetative spores (spores created through mitosis that are exact copies of the parent fungus) or through mycelial fragmentation in which a mycelium splits resulting in two new mycelia. Alternatively, fungi can also reproduce sexually, which has the benefit of gene swapping and plays a crucial role in evolution by natural selection. In sexual reproduction, hyphae of two compatible fungi fuse together and undergo a process of meiosis that results in the formation of new, haploid fungal cells.
Turkey tail fungi are commonly found colonizing fallen trees in woodlands across the globe. The hardened, fanlike structures of these fungi contain spores, which are released from the underside to produce new mycelia. Image credit: Kristen Bradley.
Once spores have been generated through either form or reproduction, they must be dispersed to germinate and form a new mycelium. This is where the elaborate structures of fungi come into play. Many fungi, such as the stinkhorn, produce secretions and a foul odor to attract insects. With the same concept as plants, fungi can use insects to disperse spores far and wide. Alternatively, some fungi do not rely on insects but instead rely on abiotic factors such as wind. Wind can also disperse spores over large distances allowing them to colonize large areas and reducing competition between individuals of the same species.
Now that we’ve had a brief exploration of what fungi are, how they grow, and how they reproduce, let’s look at how Dr. Fábio Miranda’s team made inroads in fungal classification with their machine learning model.
The Findings
Recognizing the dominant focus of genomics studies on bacteria over fungi, Dr. Miranda and colleagues sought to develop a classification model that would aid in expanding scientific knowledge about fungi. They created the python-based algorithm HiTaC, to use DNA sequences and the ITS genetic marker to classify fungi.
The model’s operation was broken down by the authors into four distinct parts. The first step involved decomposing the DNA sequences into their constituent ‘k-mers’. K-mers is a term that refers to all possible DNA sequences of a given length within the larger sequence. As an example, a 2-mer would be all possible combinations within the sequence that have two nucleotides, a 3-mer would be all combinations with 3 nucleotides, and so on and so forth. In the case of the machine learning model, this was done for all possible integer lengths using DNA sequences fed into the algorithm. The second step utilized the k-mers identified in the first step and constructed a matrix containing the k-mers and their frequencies. This particular matrix was based on the training data.
A highly detailed chart that breaks down the steps taken by the machine learning algorithm when analyzing the genetic data. Image credit: (Miranda et al., 2024).
Following matrix building, the third step involves training the model to improve the internal algorithms used for classification purposes. At each step in the classification scheme selected, there is a logistic regression classifier that searches for patterns in the data to make predictions. In the training stage, these predictions are compared to known classifications of the fungal sequences to make corrections that improve accuracy. Once the model has reached a place where it can make accurate predictions on training data, it is exposed to new genetic sequences that it is tasked with classifying.
An important aspect of the HiTaC model was the filter the researchers applied. When the algorithm went through the process of classification through each taxonomic level, a confidence score was assigned to the prediction made at each level. A specific threshold value was decided, and the model removed any levels with a confidence score below the selected threshold.
When the model had undergone a sufficient training phase, it was compared to other algorithms focused on fungal classification. In these comparisons, HiTaC fared very well and even saw improved scores in certain areas when compared to some of the best methods currently used in mycology research.
Figure 7 in the publication showing how HiTaC compared to numerous other machine learning models based on F1-scores. HiTaC’s scores were comparable to some of the best models currently used in research. Image credit: (Miranda et al., 2024).
Several metrics were used to compare the models including the recall, precision, and F1-score. In machine learning, recall is how many of one binary class were correctly identified, precision is how many predictions within one binary class were correct, and the F1-score is a combination of both measures. Taking into account all three measures, the HiTaC model performed well against other existing algorithms and is a tool that will help deepen our current understanding of the wonderful world of the fungi. To learn more about HiTaC, read the official publication on Nature.