In the quest for new medicines and materials, scientists have trained computers to design molecules. There was just one problem: the computers kept inventing physical impossibilities. The solution? A new, foolproof language for chemistry called SELFIES.
Imagine an AI that could dream up a revolutionary new drug to treat a devastating disease. Now imagine that same AI spending 99% of its time producing molecular designs that are as physically impossible as a square circle. This was the frustrating reality for chemists using AI in the early 2010s. The problem wasn't the AI's creativity, but the language it used to describe its ideas. This article explores the development of SELFIES, a revolutionary molecular representation that is finally allowing computers to reliably speak the language of chemistry.
At the heart of computational chemistry lies a deceptively simple challenge: how do you represent a complex, three-dimensional molecule as a string of text that a computer can understand?
For decades, the answer was the Simplified Molecular-Input Line-Entry System (SMILES). Developed in the 1980s, SMILES describes a molecule's structure using a sequence of letters, numbers, and symbols, much like a simple chemical sentence 5 .
However, SMILES has a critical flaw for machine learning. Its grammar is complex, and it lacks any mechanism to enforce the basic rules of chemistry. When generative AI models create or modify SMILES strings, they often produce syntactically correct but chemically invalid structures 1 5 .
c1ccccc1
An AI might design a molecule where an oxygen atom forms five bonds, even though chemistry dictates it can only form two 1 . In some applications, as many as 99% of AI-generated SMILES strings can be invalid, rendering them useless 1 .
SELFIES' core innovation is simple yet profound: every possible SELFIES string, even one generated at random, corresponds to a valid molecule 1 4 .
It borrows concepts from theoretical computer science, specifically formal grammars and finite-state automata 1 . You can think of a SELFIES string as a tiny computer program that a "compiler" translates into a molecular graph.
[C][C][C][C][C][C][Ring1][=Branch1]
In SMILES, representing a branch or a ring requires symbols at the beginning and end (like parentheses or numbers). If one is missing or mismatched, the string becomes invalid. SELFIES represents these features with a single symbol followed by an explicit length indicator. This localizes the information, eliminating a major source of syntax errors 1 .
This is SELFIES' masterstroke. The system has a built-in "memory" of the molecular structure it is building. After adding each new atom, its internal state updates. This state ensures that any subsequent bond or atom assignment respects that atom's known valency—the number of bonds it can physically form. This built-in chemical intuition prevents physical impossibilities from ever being proposed 1 .
| Feature | SMILES | SELFIES |
|---|---|---|
| Core Principle | Line notation of atoms and bonds | Formal grammar (Chomsky type-2) |
| Robustness | High risk of invalid strings | 100% robust; any string is valid |
| Handling Rings & Branches | Non-local (start/end symbols) | Local (symbol + explicit length) |
| Chemical Rules | Not enforced; often violated | Enforced via derivation state |
| Best Use Case | Human-readable notation, legacy databases | Machine learning & generative AI |
To truly appreciate the power of SELFIES, let's examine a clever experiment called STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) 1 .
The STONED procedure is elegantly simple and highlights the power of SELFIES' robustness 1 :
Begin with a single, known molecule represented as a SELFIES string (e.g., the stimulant MDMA).
Create a large number of new molecular candidates by randomly mutating the original SELFIES string. These mutations can include:
Decode every single mutated SELFIES string into a molecule. Thanks to SELFIES' 100% robustness, every decode operation is guaranteed to succeed. The resulting molecules are then filtered for desired properties.
Use the most promising mutants as new seeds and repeat the process.
The results were striking. The researchers reported that the STONED algorithm, despite its simplicity, could "perfectly solve many commonly used cheminformatics benchmarks that before were thought to be challenging problems" 1 .
"The limitation in molecular discovery wasn't always the optimization algorithms, but the representation itself. By switching to SELFIES, even simple methods could become powerful tools for discovery."
The STONED experiment is just one example. Across various AI domains, SELFIES has demonstrated significant advantages.
| Model Type | SMILES Performance | SELFIES Performance & Key Advantage |
|---|---|---|
| Combinatorial (STONED) | High failure rate; inefficient exploration 1 | 100% validity enables efficient, massive exploration 1 |
| Genetic Algorithms | Requires complex, hand-crafted mutation rules 1 | Allows arbitrary random mutations; outperforms other models 1 |
| Transformer Models (SELFormer) | Good performance, but limited by invalid strings 3 | Outperforms SMILES and graph methods on tasks like solubility prediction 3 |
| Quantum ML (QK-LSTM) | Lower baseline performance in hybrid models 9 | +5.91% improvement over SMILES when augmented 9 |
SELFIES enables brute-force exploration of chemical space with guaranteed validity, making even simple algorithms powerful discovery tools.
With SELFIES, genetic algorithms can use arbitrary mutations without complex validation rules, significantly improving performance.
Models like SELFormer leverage SELFIES to outperform both SMILES-based and graph-based approaches in property prediction tasks.
While SELFIES is a computational tool, its ultimate goal is to accelerate real-world chemical research and drug discovery.
| Tool / Reagent | Function in the Research Pipeline |
|---|---|
| SELFIES Representation | The core language; provides a robust, machine-readable description of a molecule for AI models 1 4 . |
| Benchmark Datasets (e.g., MoleculeNet, SIDER) | Standardized collections of molecular data and properties used to train and fairly evaluate AI models 3 9 . |
| Generative Model (e.g., STONED, VAE, Transformer) | The AI engine that explores chemical space and proposes new molecular structures based on SELFIES strings 1 3 . |
| High-Throughput Screening Assays | Physical laboratory experiments that test AI-designed molecules for desired activity (e.g., binding to a target protein) . |
SELFIES has fundamentally shifted how researchers approach AI-driven molecular design.
By providing a guaranteed-valid representation, SELFIES has not only solved the problem of invalidity but has also unlocked simpler and more efficient discovery algorithms 1 . The field is rapidly advancing with new models like SELFormer, a transformer model specifically designed for SELFIES that outperforms even graph-based methods in predicting properties like a molecule's solubility 3 .
Researchers are already exploring the next frontiers, such as using augmented SELFIES to improve model performance and integrating SELFIES with cutting-edge quantum machine learning models 9 . While some researchers are already proposing new representations beyond SELFIES for even more complex chemical phenomena 6 , its role as a robust and accessible standard is secure.
SELFIES has done more than just fix a technical problem; it has given computers a reliable chemical intuition. It is a foundational tool that is helping to bridge the gap between digital imagination and physical reality, bringing us closer to a future where AI can be a true partner in solving some of humanity's most pressing challenges in health and materials science.