Late in 2025, a pioneering AI system named Evo was introduced, showcasing its ability to analyze vast numbers of bacterial genomes. By leveraging the natural clustering of related genes in bacteria, Evo could predict subsequent gene sequences or even propose entirely new proteins. However, the complexity of eukaryotic genomes posed a significant challenge to the same approach. The creators of Evo recognized this hurdle and set out to develop Evo 2, an advanced, open-source AI that can handle the intricacies of all three domains of life: bacteria, archaea, and eukaryotes.

The Challenge of Eukaryotic Genomes
Eukaryotic genomes are notoriously complex, characterized by interrupted coding sequences and scattered regulatory elements. Unlike bacteria, where genes encoding proteins are straightforwardly organized, eukaryotic genes often include non-coding introns, making interpretation difficult. This complexity is compounded by vast stretches of DNA previously labeled as “junk,” which may harbor important regulatory functions.
Traditional tools designed to identify genomic features like splice sites have proven to be error-prone when analyzing lengthy eukaryotic genomes, often exceeding three billion base pairs. While evolutionary comparisons can yield insights, they also present limits, especially as researchers seek to uncover the unique genomic differences between species.
Harnessing Neural Networks for Genomic Insights
Neural networks excel in recognizing complex patterns and statistical probabilities, making them ideally suited for genomic analysis. However, the challenge lies in obtaining sufficient data and computational resources to train these systems effectively. The Evo team met this challenge head-on by developing a convolutional neural network called StripedHyena 2, which underwent two stages of training.
Initially, the system learned to identify critical genomic features using sequences of approximately 8,000 bases. The second stage expanded its scope, providing sequences of one million bases at a time to help the AI identify larger-scale genomic patterns. This two-pronged approach allowed Evo 2 to develop a nuanced understanding of genomic architecture.
Training with Massive Datasets
Evo 2 was trained on the OpenGenome2 dataset, which encompasses 8.8 trillion bases from diverse life forms, including bacteria and archaea. Notably, the researchers opted to exclude eukaryotic viruses, prioritizing safety and ethical considerations. Two versions of Evo 2 were created: one with 7 billion parameters trained on 2.4 trillion bases and a full version with 40 billion parameters trained on the complete dataset.
The logic of Evo 2’s training rests on the concept that evolutionarily conserved sequences appear across multiple species. This allows the AI to learn the significance of these sequences and their contexts. The ability to perform zero-shot predictions without task-specific fine-tuning enhances its versatility and enables it to identify genomic features previously unknown to researchers.
Open Access to Genomic Knowledge
In a groundbreaking move, the Evo team made all aspects of Evo 2 publicly accessible, including model parameters, training code, inference code, and the OpenGenome2 dataset. This transparency invites collaboration and innovation among researchers eager to explore the potential of this powerful AI tool.
To further understand what Evo 2 learned, researchers analyzed its internal neural network features. They discovered that Evo 2 effectively recognized protein-coding regions, intron boundaries, and even structural features of proteins, such as alpha helices and beta sheets. The AI also identified mutations that could disrupt coding sequences and recognized non-coding RNA sequences critical for cellular function.
A Versatile Tool for Genomic Evaluation
Evo 2’s ability to recognize eukaryotic genomic features did not compromise its competence with bacterial and archaeal genomes. The system demonstrated a remarkable capacity to adapt its analyses based on the species it was examining, employing the appropriate genetic codes for different evolutionary groups.
Moreover, Evo 2 excelled in identifying splice sites and assessing mutations in critical genes like BRCA2, which has implications in cancer research. With additional training on known BRCA2 mutations, the system’s performance improved significantly, showcasing its potential as an automated tool for preliminary genome annotation and evaluation.
Exploring the Unknown
Despite Evo 2’s impressive capabilities, questions remain about its potential for generating novel proteins or regulatory sequences. Researchers tested the system’s ability to create regulatory DNA active in specific cell types. While the results were promising, with some sequences demonstrating differential activity, they fell short of the anticipated complexity seen in protein design.
As the scientific community begins to explore the applications of Evo 2, it remains to be seen whether the AI can identify genomic features yet undiscovered. The evolving nature of genomic research means that there may be hidden complexities waiting to be unveiled, and Evo 2 stands ready to assist in this exploration.
Future Directions for Evo 2
The tools and concepts underlying Evo 2 open the door to numerous possibilities, including the development of specialized versions of the AI tailored for specific tasks, such as analyzing cancer genomes or annotating newly sequenced genomes. The potential for collaborative research utilizing Evo 2 promises to accelerate advancements in our understanding of genetic codes.
In conclusion, Evo 2 represents a significant leap in the application of AI to genomic research. As researchers harness this powerful tool, they may uncover new insights into the fundamental building blocks of life. The journey has just begun, and the future of genomic exploration with Evo 2 holds exciting possibilities.
- Evo 2 has been trained on trillions of base pairs, enhancing its ability to analyze complex genomes.
- The system excels at identifying genomic features without the need for extensive fine-tuning.
- Its open-source nature encourages collaboration and exploration in the scientific community.
- Evo 2 demonstrates adaptability across different species and genomic contexts.
- Future research may reveal undiscovered genomic features and applications in cancer genomics.
Read more → arstechnica.com
