Beyond SMILES: How SELFIES is Teaching AI the Language of Molecules

In the quest for new medicines and materials, scientists have trained computers to design molecules. There was just one problem: the computers kept inventing physical impossibilities. The solution? A new, foolproof language for chemistry called SELFIES.

Molecular Representation AI Chemistry Drug Discovery

Imagine an AI that could dream up a revolutionary new drug to treat a devastating disease. Now imagine that same AI spending 99% of its time producing molecular designs that are as physically impossible as a square circle. This was the frustrating reality for chemists using AI in the early 2010s. The problem wasn't the AI's creativity, but the language it used to describe its ideas. This article explores the development of SELFIES, a revolutionary molecular representation that is finally allowing computers to reliably speak the language of chemistry.

Why Molecules Need a Machine-Language

At the heart of computational chemistry lies a deceptively simple challenge: how do you represent a complex, three-dimensional molecule as a string of text that a computer can understand?

The SMILES Problem

For decades, the answer was the Simplified Molecular-Input Line-Entry System (SMILES). Developed in the 1980s, SMILES describes a molecule's structure using a sequence of letters, numbers, and symbols, much like a simple chemical sentence 5 .

However, SMILES has a critical flaw for machine learning. Its grammar is complex, and it lacks any mechanism to enforce the basic rules of chemistry. When generative AI models create or modify SMILES strings, they often produce syntactically correct but chemically invalid structures 1 5 .

SMILES Example: Benzene

c1ccccc1

High risk of invalid molecular structures

The Validity Crisis

An AI might design a molecule where an oxygen atom forms five bonds, even though chemistry dictates it can only form two 1 . In some applications, as many as 99% of AI-generated SMILES strings can be invalid, rendering them useless 1 .

Common SMILES Issues:
  • Violated chemical valency rules
  • Mismatched ring closures
  • Unbalanced parentheses
  • Syntactically incorrect sequences

The Grammar of a Foolproof Language

This is where SELFIES, or SELF-referencing Embedded Strings, comes in. Introduced in 2019 by a team led by Mario Krenn and Alán Aspuru-Guzik, SELFIES was designed from the ground up to be a 100% robust molecular representation 1 4 .

How SELFIES Works

SELFIES' core innovation is simple yet profound: every possible SELFIES string, even one generated at random, corresponds to a valid molecule 1 4 .

It borrows concepts from theoretical computer science, specifically formal grammars and finite-state automata 1 . You can think of a SELFIES string as a tiny computer program that a "compiler" translates into a molecular graph.

SELFIES Example: Benzene

[C][C][C][C][C][C][Ring1][=Branch1]

Guaranteed valid molecular structure
Localizing Non-Local Features

In SMILES, representing a branch or a ring requires symbols at the beginning and end (like parentheses or numbers). If one is missing or mismatched, the string becomes invalid. SELFIES represents these features with a single symbol followed by an explicit length indicator. This localizes the information, eliminating a major source of syntax errors 1 .

Encoding Physical Constraints

This is SELFIES' masterstroke. The system has a built-in "memory" of the molecular structure it is building. After adding each new atom, its internal state updates. This state ensures that any subsequent bond or atom assignment respects that atom's known valency—the number of bonds it can physically form. This built-in chemical intuition prevents physical impossibilities from ever being proposed 1 .

SMILES vs. SELFIES: A Comparative Analysis

Feature SMILES SELFIES
Core Principle Line notation of atoms and bonds Formal grammar (Chomsky type-2)
Robustness High risk of invalid strings 100% robust; any string is valid
Handling Rings & Branches Non-local (start/end symbols) Local (symbol + explicit length)
Chemical Rules Not enforced; often violated Enforced via derivation state
Best Use Case Human-readable notation, legacy databases Machine learning & generative AI

A Deep Dive into a Key Experiment: The STONED Algorithm

To truly appreciate the power of SELFIES, let's examine a clever experiment called STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery) 1 .

Methodology: The Power of Randomness

The STONED procedure is elegantly simple and highlights the power of SELFIES' robustness 1 :

Start with a Seed

Begin with a single, known molecule represented as a SELFIES string (e.g., the stimulant MDMA).

Generate Mutants

Create a large number of new molecular candidates by randomly mutating the original SELFIES string. These mutations can include:

  • Random Character Changes: Swapping one symbol in the string for another valid SELFIES symbol.
  • Random Character Insertions: Adding new symbols at random positions.
  • Random Character Deletions: Removing symbols from the string.
Decode and Filter

Decode every single mutated SELFIES string into a molecule. Thanks to SELFIES' 100% robustness, every decode operation is guaranteed to succeed. The resulting molecules are then filtered for desired properties.

Iterate

Use the most promising mutants as new seeds and repeat the process.

Results and Analysis: A Simple Method's Astonishing Success

The results were striking. The researchers reported that the STONED algorithm, despite its simplicity, could "perfectly solve many commonly used cheminformatics benchmarks that before were thought to be challenging problems" 1 .

Key Metric: Validity
SMILES Mutation Success ~1%
1%
SELFIES Mutation Success 100%
100%

"The limitation in molecular discovery wasn't always the optimization algorithms, but the representation itself. By switching to SELFIES, even simple methods could become powerful tools for discovery."

Performance Showdown: SELFIES in Action

The STONED experiment is just one example. Across various AI domains, SELFIES has demonstrated significant advantages.

Model Type SMILES Performance SELFIES Performance & Key Advantage
Combinatorial (STONED) High failure rate; inefficient exploration 1 100% validity enables efficient, massive exploration 1
Genetic Algorithms Requires complex, hand-crafted mutation rules 1 Allows arbitrary random mutations; outperforms other models 1
Transformer Models (SELFormer) Good performance, but limited by invalid strings 3 Outperforms SMILES and graph methods on tasks like solubility prediction 3
Quantum ML (QK-LSTM) Lower baseline performance in hybrid models 9 +5.91% improvement over SMILES when augmented 9
Combinatorial Models

SELFIES enables brute-force exploration of chemical space with guaranteed validity, making even simple algorithms powerful discovery tools.

Genetic Algorithms

With SELFIES, genetic algorithms can use arbitrary mutations without complex validation rules, significantly improving performance.

Transformer Models

Models like SELFormer leverage SELFIES to outperform both SMILES-based and graph-based approaches in property prediction tasks.

The Scientist's Toolkit: Research Reagent Solutions

While SELFIES is a computational tool, its ultimate goal is to accelerate real-world chemical research and drug discovery.

Tool / Reagent Function in the Research Pipeline
SELFIES Representation The core language; provides a robust, machine-readable description of a molecule for AI models 1 4 .
Benchmark Datasets (e.g., MoleculeNet, SIDER) Standardized collections of molecular data and properties used to train and fairly evaluate AI models 3 9 .
Generative Model (e.g., STONED, VAE, Transformer) The AI engine that explores chemical space and proposes new molecular structures based on SELFIES strings 1 3 .
High-Throughput Screening Assays Physical laboratory experiments that test AI-designed molecules for desired activity (e.g., binding to a target protein) .

The Future of Molecular Design

SELFIES has fundamentally shifted how researchers approach AI-driven molecular design.

Current Advancements

By providing a guaranteed-valid representation, SELFIES has not only solved the problem of invalidity but has also unlocked simpler and more efficient discovery algorithms 1 . The field is rapidly advancing with new models like SELFormer, a transformer model specifically designed for SELFIES that outperforms even graph-based methods in predicting properties like a molecule's solubility 3 .

Key Developments:
  • SELFormer: Transformer models optimized for SELFIES representation
  • Augmented SELFIES: Enhanced versions for improved model performance
  • Quantum Integration: Combining SELFIES with quantum machine learning

Next Frontiers

Researchers are already exploring the next frontiers, such as using augmented SELFIES to improve model performance and integrating SELFIES with cutting-edge quantum machine learning models 9 . While some researchers are already proposing new representations beyond SELFIES for even more complex chemical phenomena 6 , its role as a robust and accessible standard is secure.

The Big Picture

SELFIES has done more than just fix a technical problem; it has given computers a reliable chemical intuition. It is a foundational tool that is helping to bridge the gap between digital imagination and physical reality, bringing us closer to a future where AI can be a true partner in solving some of humanity's most pressing challenges in health and materials science.

References