Anthrogen

Abstract

Anthrogen launches Odyssey, a family of protein language models scaled to 102 billion parameters. It is the largest and most powerful biological model ever created.

It allows us to rationally design and optimize proteins toward multi-objective goals--for example, “binds the target,” and “low side effects,” and “manufacturable at scale.”
Proteins life's workhorses; being able to rationally design new molecular machines with the same precision we can design macro machines/robots is Anthrogen's north star. Odyssey is an important first step in that direction.
We introduce Consensus, a more scaling-robust replacement for self-attention that cuts the cost of generating longer, more intricate proteins and trains stably.
We train with discrete diffusion, a learning objective that mirrors evolution: forward steps inject mutations; the reverse process learns to select sequences and structures that work together.
We demonstrate Odyssey's ability to learn very data-efficiently. With proteins, data is scarce--we achieve better performance than competing models with ~10x less data.

Odyssey from a bird's eye view

Odyssey is a frontier, multimodal protein model family that learns jointly from sequence, 3D structure, and functional context. It supports conditional generation, editing, and sequence + structure co-design. We scale our production-ready models from 1.2B to 102B parameters.

At input, Odyssey treats proteins as more than strings. Amino acid sequences are used as usual, while 3D shape is turned into compact structure tokens using a finite scalar quantizer (FSQ)—think of it as a simple alphabet for 3D geometry so the model can “read” shapes as easily as letters. Alongside these, we include light-weight functional cues—domain tags, secondary-structure hints, orthologous group labels, or short text descriptors—so the model can reason about what a region does, not just what it looks like. The three streams are embedded separately, then fused, so local sequence patterns and long-range geometric relationships end up in one shared representation.

The backbone replaces self-attention with Consensus blocks—iterative, locality-aware updates that spread information reliably across the chain and the contact graph. Intuitively: rather than having every residue talk to every other residue at once (which is expensive and noisy), Consensus encourages nearby neighborhoods to agree first, then lets that agreement ripple outward. In practice, this brings two benefits: it scales linearly with sequence length (long proteins stay affordable), and it remains stable to train as models grow. It allowed us to train larger and more accurate models than what could have been achieved with standard self-attention. That makes Odyssey practical for real design problems, where long linkers and multi-domain constructs are becoming the norm.

Training and generation are framed as discrete diffusion. During training, we gradually “corrupt” sequence and structure tokens with masking noise—like proposing mutations—and teach the model to reconstruct coherent sequence + coordinates—like selecting what actually works. At inference, we run that learned reverse process to generate or edit designs under user constraints. In plain terms: you can keep a scaffold, fix a motif, mask a loop, add a functional tag, and let Odyssey fill in the rest with sequence and structure that agree with each other.

On top, lightweight heads turn the shared representation into actionable outputs: sequence logits to choose residues, structure tokens / coordinates to recover 3D shape, and alignment heads that score multi-objective goals (e.g., potency, specificity, stability). A typical workflow is iterative: propose candidates (optionally with targeted masks), score and triage with these heads or wet-lab feedback, then re-prompt Odyssey for the next round. The same interface supports both de novo exploration and precise local edits, so teams can move smoothly from ideas to manufacturable designs.

Abstract

Anthrogen launches Odyssey, a family of protein language models scaled to 102 billion parameters. It is the largest and most powerful biological model ever created.

It allows us to rationally design and optimize proteins toward multi-objective goals--for example, “binds the target,” and “low side effects,” and “manufacturable at scale.”
Proteins life's workhorses; being able to rationally design new molecular machines with the same precision we can design macro machines/robots is Anthrogen's north star. Odyssey is an important first step in that direction.
We introduce Consensus, a more scaling-robust replacement for self-attention that cuts the cost of generating longer, more intricate proteins and trains stably.
We train with discrete diffusion, a learning objective that mirrors evolution: forward steps inject mutations; the reverse process learns to select sequences and structures that work together.
We demonstrate Odyssey's ability to learn very data-efficiently. With proteins, data is scarce--we achieve better performance than competing models with ~10x less data.

Odyssey from a bird's eye view

Odyssey is a frontier, multimodal protein model family that learns jointly from sequence, 3D structure, and functional context. It supports conditional generation, editing, and sequence + structure co-design. We scale our production-ready models from 1.2B to 102B parameters.

At input, Odyssey treats proteins as more than strings. Amino acid sequences are used as usual, while 3D shape is turned into compact structure tokens using a finite scalar quantizer (FSQ)—think of it as a simple alphabet for 3D geometry so the model can “read” shapes as easily as letters. Alongside these, we include light-weight functional cues—domain tags, secondary-structure hints, orthologous group labels, or short text descriptors—so the model can reason about what a region does, not just what it looks like. The three streams are embedded separately, then fused, so local sequence patterns and long-range geometric relationships end up in one shared representation.

The backbone replaces self-attention with Consensus blocks—iterative, locality-aware updates that spread information reliably across the chain and the contact graph. Intuitively: rather than having every residue talk to every other residue at once (which is expensive and noisy), Consensus encourages nearby neighborhoods to agree first, then lets that agreement ripple outward. In practice, this brings two benefits: it scales linearly with sequence length (long proteins stay affordable), and it remains stable to train as models grow. It allowed us to train larger and more accurate models than what could have been achieved with standard self-attention. That makes Odyssey practical for real design problems, where long linkers and multi-domain constructs are becoming the norm.

Training and generation are framed as discrete diffusion. During training, we gradually “corrupt” sequence and structure tokens with masking noise—like proposing mutations—and teach the model to reconstruct coherent sequence + coordinates—like selecting what actually works. At inference, we run that learned reverse process to generate or edit designs under user constraints. In plain terms: you can keep a scaffold, fix a motif, mask a loop, add a functional tag, and let Odyssey fill in the rest with sequence and structure that agree with each other.

On top, lightweight heads turn the shared representation into actionable outputs: sequence logits to choose residues, structure tokens / coordinates to recover 3D shape, and alignment heads that score multi-objective goals (e.g., potency, specificity, stability). A typical workflow is iterative: propose candidates (optionally with targeted masks), score and triage with these heads or wet-lab feedback, then re-prompt Odyssey for the next round. The same interface supports both de novo exploration and precise local edits, so teams can move smoothly from ideas to manufacturable designs.

Consensus as a replacement for self-attention

Modern protein models often borrow from NLP: treat a sequence as a line of tokens and let self-attention connect any token to any other. That works well for text, where long-range links can “teleport” across a paragraph without disturbing the words in between. Proteins don’t work that way. A protein’s long-range effects travel through 3D geometry constrained by a covalent backbone: when residues i and j come close in space, the span between them must co-adjust. Dependencies are therefore many-bodied and locally cooperative, not arbitrary pairwise jumps across a line.

Consensus builds this simple inductive bias into the backbone. Instead of letting every residue talk to every other at once, Consensus encourages nearby neighborhoods to reach agreement first, then lets that agreement propagate outward over a sparse contact/sequence graph. You can think of it as repeated local “vote-and-average” steps that spread information reliably through the chain and its contacts—mirroring how structural changes actually ripple through a protein.

This local-first design has two practical payoffs. First, it gives us the right shape of reasoning for proteins (and other structured domains): information flows along physically plausible paths, which helps the model coordinate sequence and structure. Second, it changes the compute curve. Global attention scales as O(L²) with sequence length L; Consensus scales linearly, O(L). That makes long constructs, linkers, and multi-domain designs far more affordable to train and sample.

We also see a training benefit as models get large: Consensus is more forgiving to learning-rate choices. In our sweeps, attention’s “good LR zone” narrows with scale and can fail abruptly when you step off the optimal value. Consensus keeps near-optimal performance across a wider LR window, avoiding cliffs and loss spikes. In plain terms: fewer brittle runs, fewer restarts, and faster time-to-useful-models when you don’t have the luxury of exhaustive tuning.

What this implies for practice:

We can scale longer, sooner. Linear complexity lets you extend context (longer proteins, multi-domain constructs) without quadratic cost explosions.
We can train faster and create more accurate models. Wider LR tolerance means more aggressive LR schedules.
We keep and extend the transformer’s virtues. You preserve the scaling properties that make transformers effective, while swapping in a backbone that’s better matched to geometric, many-body data—not just proteins, but other structured biomolecules as well.

In short, Consensus replaces all-to-all chatter with local agreement that spreads, aligning the compute with the physics and making large, long-context protein design both faster and sturdier.

Consensus as a replacement for self-attention

Modern protein models often borrow from NLP: treat a sequence as a line of tokens and let self-attention connect any token to any other. That works well for text, where long-range links can “teleport” across a paragraph without disturbing the words in between. Proteins don’t work that way. A protein’s long-range effects travel through 3D geometry constrained by a covalent backbone: when residues i and j come close in space, the span between them must co-adjust. Dependencies are therefore many-bodied and locally cooperative, not arbitrary pairwise jumps across a line.

Consensus builds this simple inductive bias into the backbone. Instead of letting every residue talk to every other at once, Consensus encourages nearby neighborhoods to reach agreement first, then lets that agreement propagate outward over a sparse contact/sequence graph. You can think of it as repeated local “vote-and-average” steps that spread information reliably through the chain and its contacts—mirroring how structural changes actually ripple through a protein.

This local-first design has two practical payoffs. First, it gives us the right shape of reasoning for proteins (and other structured domains): information flows along physically plausible paths, which helps the model coordinate sequence and structure. Second, it changes the compute curve. Global attention scales as O(L²) with sequence length L; Consensus scales linearly, O(L). That makes long constructs, linkers, and multi-domain designs far more affordable to train and sample.

We also see a training benefit as models get large: Consensus is more forgiving to learning-rate choices. In our sweeps, attention’s “good LR zone” narrows with scale and can fail abruptly when you step off the optimal value. Consensus keeps near-optimal performance across a wider LR window, avoiding cliffs and loss spikes. In plain terms: fewer brittle runs, fewer restarts, and faster time-to-useful-models when you don’t have the luxury of exhaustive tuning.

What this implies for practice:

We can scale longer, sooner. Linear complexity lets you extend context (longer proteins, multi-domain constructs) without quadratic cost explosions.
We can train faster and create more accurate models. Wider LR tolerance means more aggressive LR schedules.
We keep and extend the transformer’s virtues. You preserve the scaling properties that make transformers effective, while swapping in a backbone that’s better matched to geometric, many-body data—not just proteins, but other structured biomolecules as well.

In short, Consensus replaces all-to-all chatter with local agreement that spreads, aligning the compute with the physics and making large, long-context protein design both faster and sturdier.

Discrete diffusion as a better training objective

Evolution, at its coarsest scale, looks like proposal + selection: random mutations appear, and natural selection keeps the variants that work. Discrete diffusion mirrors this dynamic. In training, we corrupt sequence and structure tokens with masking noise (the “mutations”), and teach the model a reverse-time denoiser that concentrates probability on functionally consistent proteins (the “selection”). Because the model must repair increasingly degraded contexts, it learns coordinated, multi-residue corrections rather than one-off guesses.

How does it perform? In matched comparisons, discrete diffusion delivers better inference-time results than masked-language modeling during model evaluations. Across sizes, diffusion models show lower training perplexities than complex masking and lower or comparable training perplexities to simple masking; at validation, diffusion models outperform their MLM counterparts, while a 1.2B MLM trained on simple masking overfits to its own masking scheme and degrades when evaluated under diffusion’s schedule.

A final conceptual difference: diffusion models the joint distribution of the full protein (sequence and structure), whereas MLM optimizes conditionals over the masked subset. That joint view matches our use case—co-design of sequence and structure—and helps the model keep geometry and residues in sync.

Discrete diffusion as a better training objective

Evolution, at its coarsest scale, looks like proposal + selection: random mutations appear, and natural selection keeps the variants that work. Discrete diffusion mirrors this dynamic. In training, we corrupt sequence and structure tokens with masking noise (the “mutations”), and teach the model a reverse-time denoiser that concentrates probability on functionally consistent proteins (the “selection”). Because the model must repair increasingly degraded contexts, it learns coordinated, multi-residue corrections rather than one-off guesses.

How does it perform? In matched comparisons, discrete diffusion delivers better inference-time results than masked-language modeling during model evaluations. Across sizes, diffusion models show lower training perplexities than complex masking and lower or comparable training perplexities to simple masking; at validation, diffusion models outperform their MLM counterparts, while a 1.2B MLM trained on simple masking overfits to its own masking scheme and degrades when evaluated under diffusion’s schedule.

A final conceptual difference: diffusion models the joint distribution of the full protein (sequence and structure), whereas MLM optimizes conditionals over the masked subset. That joint view matches our use case—co-design of sequence and structure—and helps the model keep geometry and residues in sync.

Where do we go from here?

While Odyssey is itself incredibly powerful, it is far from perfect. Any models like these must be paired with large-scale biological data alignment. We look forward to share our work in that area in the coming months. We also look forward to sharing more mechanistic/architectural research particularly with reference to Consensus!

If you are interesting in collaborating, reach out! We are always interested in pushing the limits of Odyssey and are opening up our API for early access.

We're also hiring, contact us here!

Back to home.

Where do we go from here?

While Odyssey is itself incredibly powerful, it is far from perfect. Any models like these must be paired with large-scale biological data alignment. We look forward to share our work in that area in the coming months. We also look forward to sharing more mechanistic/architectural research particularly with reference to Consensus!

If you are interesting in collaborating, reach out! We are always interested in pushing the limits of Odyssey and are opening up our API for early access.

We're also hiring, contact us here!

Back to home.