Promoter-GPT: Writing DNA Instructions with Language Models

Community Article Published October 22, 2025

If DNA is truly a language, then we should be able to teach transformers how to write it.

Step 1: Dataset - Load life instructions

Step 2: Tokenization - Breaking DNA into "Words"

Step 3: Build Promoter-GPT

Step 4: Training Loop

Training Setup

Step 5: Generating Novel DNA Sequences

Step 6: Exploration

What's Next?

If DNA is truly a language, then we should be able to teach transformers how to write it.

For decades, biologists have called DNA “the language of life”—and they weren’t being metaphorical: DNA encodes the instructions that make life possible.

But here’s one thing about languages: once you understand their rules, you can start writing your own sentences.

So what if we could teach a transformer to compose entirely new genetic programs and create novel biological instructions that have never existed in nature?

That’s exactly what we want to do with PromoterGPT: a decoder-only transformer trained to generate grammatically correct DNA instruction.

In this context, those “instructions” refer to promoters—the DNA regions located upstream of a gene that control whether it is activated or not. Promoters are what make the same gene expressed in the brain but silent in the liver. If we can learn to design these elements, we can establish new rules for controlling gene expression, with possible applications in biotechnology and medicine.

In this notebook, we'll build a decoder-only transformer that learns to generate biologically plausible DNA promoter sequences. We'll cover:

k-mer tokenization for genomic data
Custom vocabulary building for DNA
Training a small GPT-2 from scratch
Generating novel 200bp promoter sequences

import pandas as pd
import numpy as np
import itertools
from tokenizers import Tokenizer, models, pre_tokenizers, normalizers, trainers
from transformers import PreTrainedTokenizerFast
from transformers import AutoConfig, GPT2LMHeadModel
from torch.utils.data import DataLoader
from torch.optim import AdamW
from accelerate import Accelerator
from transformers import get_scheduler
from torch.nn import CrossEntropyLoss
import torch
from tqdm import tqdm

```python
!wget "https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-024-08070-z/MediaObjects/41586_2024_8070_MOESM4_ESM.txt" -O data.txt

Step 1: Dataset - Load life instructions

First, we need to load the instructions: the DNA promoter sequences. For this, we use a pre-compiled dataset containing 200-base-pair promoter regions. We filter it to ensure it contains only 200 bp promoters and keep the chromosome information for downstream splitting.

# Load and filter the dataset
data = (
    pd.read_csv("data.txt", sep="\t", usecols=['sequence','chr'])
      .assign(len=lambda df: df['sequence'].str.len())  # calcola la lunghezza
      .query("len == 200")                  # filtra
      .drop(columns='len')                  # rimuovi la colonna di servizio
      .reset_index(drop=True)               # resetta l'indice
)

We split the data not randomly, but by chromosome. This chromosomal splitting ensures that the model must generalize across different genomic contexts rather than memorizing chromosome-specific patterns. We hold out chromosomes 19, 21, and X for validation, and chromosomes 7 and 13 for testing.

# Define chromosomes for each split
val_chroms = {"19", "21", "X"}
test_chroms = {"7", "13"}

# Create boolean masks
val_mask = data['chr'].isin(val_chroms)
test_mask = data['chr'].isin(test_chroms)
train_mask = ~(val_mask | test_mask)

# Split the data
train_data = data.loc[train_mask]
val_data = data.loc[val_mask]
test_data = data.loc[test_mask]

print(f"Dataset split:")
print(f"  Training: {len(train_data):,} sequences")
print(f"  Validation: {len(val_data):,} sequences")
print(f"  Test: {len(test_data):,} sequences")
print(f"\nChromosomal split:")
print(f"  Train chromosomes: {sorted(set(data[train_mask]['chr']))}")
print(f"  Val chromosomes: {sorted(val_chroms)}")
print(f"  Test chromosomes: {sorted(test_chroms)}")

Dataset split:
  Training: 640,029 sequences
  Validation: 59,697 sequences
  Test: 63,958 sequences

Chromosomal split:
  Train chromosomes: ['1', '10', '11', '12', '14', '15', '16', '17', '18', '2', '20', '22', '3', '4', '5', '6', '8', '9', 'Y']
  Val chromosomes: ['19', '21', 'X']
  Test chromosomes: ['13', '7']

Step 2: Tokenization - Breaking DNA into "Words"

If DNA is a language, its letters are the four bases: Adenine, Timine, Guanine and Cytosine (A, T, G, C). With just these four symbols, we can create an incredible number of combinations and create many different instructions.

The challenge with DNA is that while we know the letters, we don't always know the "words." We need a way to segment sequences into meaningful units. This step is crucial because biological function often depends on specific combinations of bases, called motifs.

For example, a raw sequence ATGCGCGCG can be tokenized into overlapping 3-mers (also called k-mers with k=3): ATG, TGC, GCG, CGC, GCG, CGC, GCG

A 200-base sequence thus becomes 198 overlapping 3-mers, transforming raw DNA into a biologically meaningful vocabulary.


def kmerization(seq, k=3):
    return " ".join(seq[i:i+k] for i in range(len(seq) - k + 1))

kmers = 3
train_data.loc[:, "kmers"] = train_data["sequence"].apply(lambda x: kmerization(x, k=kmers))
val_data.loc[:, "kmers"] = val_data["sequence"].apply(lambda x: kmerization(x, k=kmers))
test_data.loc[:, "kmers"] = test_data["sequence"].apply(lambda x: kmerization(x, k=kmers))

Since we know in advance the set of "words" (k-mers) that can appear in our sequences, we can directly create a tokenizer with a predefined vocabulary. This approach eliminates the need for the tokenizer to learn tokens from data and ensures consistent representation. We generate all possible k-mers of length 3, add special tokens, and use this complete vocabulary to initialize the tokenizer. The tokenizer has a total vocabulary size of 71: all 64 possible 3-mers plus 7 special tokens.

# Define kmers variable

# Generate all possible k-mers
mers = list(itertools.product(['A','T','G','C'], repeat=kmers))
mers = [(''.join(x)) for x in mers]

# Create vocabulary directly
special_tokens = ["[UNK]", "[PAD]", "[BOS]", "[EOS]", "[CLS]", "[SEP]", "[MASK]"]
vocab = {token: idx for idx, token in enumerate(special_tokens + mers)}

# Create tokenizer with the vocabulary
tokenizer = Tokenizer(models.WordLevel(vocab=vocab, unk_token="[UNK]"))


tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.StripAccents()]
)
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="[UNK]",
    bos_token="[BOS]",
    eos_token="[EOS]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

After creating the tokenizer, we can inspect the k-mers and verify some basic information:

# Print all generated k-mers
print(mers)

# Check the size of the vocabulary
print("Vocabulary size:", len(wrapped_tokenizer))

# Encode a test sequence
test_seq = "ATG CGT TAC"
encoded = wrapped_tokenizer.encode(test_seq)
print("Encoded:", encoded)
print("Decoded:", wrapped_tokenizer.decode(encoded))

['AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATT', 'ATG', 'ATC', 'AGA', 'AGT', 'AGG', 'AGC', 'ACA', 'ACT', 'ACG', 'ACC', 'TAA', 'TAT', 'TAG', 'TAC', 'TTA', 'TTT', 'TTG', 'TTC', 'TGA', 'TGT', 'TGG', 'TGC', 'TCA', 'TCT', 'TCG', 'TCC', 'GAA', 'GAT', 'GAG', 'GAC', 'GTA', 'GTT', 'GTG', 'GTC', 'GGA', 'GGT', 'GGG', 'GGC', 'GCA', 'GCT', 'GCG', 'GCC', 'CAA', 'CAT', 'CAG', 'CAC', 'CTA', 'CTT', 'CTG', 'CTC', 'CGA', 'CGT', 'CGG', 'CGC', 'CCA', 'CCT', 'CCG', 'CCC']
Vocabulary size: 71
Encoded: [13, 64, 26]
Decoded: ATG CGT TAC

Now we tokenize the entire dataset by converting each k-merized sequence into numerical token IDs that the model can process:

train_datat = np.array([wrapped_tokenizer.encode(x) for x in train_data['kmers']])
val_datat = np.array([wrapped_tokenizer.encode(x) for x in val_data['kmers']])
test_datat = np.array([wrapped_tokenizer.encode(x) for x in test_data['kmers']])

print(f"Tokenized shapes:")
print(f"  Training: {train_datat.shape}")
print(f"  Validation: {val_datat.shape}")
print(f"  Test: {test_datat.shape}")

# Example: view first tokenized sequence
print(f"\nFirst training sequence (first 20 tokens):")
print(train_datat[0][:20])

Tokenized shapes:
  Training: (640029, 198)
  Validation: (59697, 198)
  Test: (63958, 198)

First training sequence (first 20 tokens):
[56 11 23  7  9 15 40 13 31 41 17 50 52 60 29 33 50 51 55  7]

Each 200bp promoter sequence becomes a sequence of 198 tokens (one per 3-mer). These token IDs are what we'll feed into the transformer during training. The model will learn to predict the next token given the previous ones—essentially learning the "grammar" of promoter sequences.

Step 3: Build Promoter-GPT

Now we reach the core of our project—building a transformer that "speaks DNA." We adopt the proven GPT-2 architecture but adapt it specifically for genomic sequences. Since our input length is fixed at 198 tokens, we can use a compact decoder-only transformer: just 2 layers and 8 attention heads, giving a total of around 0.5 million parameters.

# Build GPT-2 config with custom vocabulary and architecture
gpt_config = {
    "vocab_size": len(wrapped_tokenizer),
    "n_positions": len(train_datat[0]),  # max sequence length
    "n_head": 8,
    "n_layer": 2,
    "n_embd": 128,
}

config = AutoConfig.from_pretrained(
    "gpt2",
    **gpt_config
)

# Initialize the model
model = GPT2LMHeadModel(config)

# Report model size (in millions of parameters)
num_params = sum(param.numel() for param in model.parameters())
print(f"Initialized GPT-2 ({num_params / 1e6:.1f}M parameters)")
print(f"Model Configuration:")
print(f"  Vocabulary size: {config.vocab_size}")
print(f"  Context length: {config.n_positions}")
print(f"  Hidden dimension: {config.n_embd}")
print(f"  Number of layers: {config.n_layer}")
print(f"  Attention heads: {config.n_head}")
print(f"  Total parameters: {num_params / 1e6:.2f}M")

Initialized GPT-2 (0.4M parameters)
Model Configuration:
  Vocabulary size: 71
  Context length: 198
  Hidden dimension: 128
  Number of layers: 2
  Attention heads: 8
  Total parameters: 0.43M

Step 4: Training Loop

We train with:

Gradient accumulation (8 steps) to simulate larger batches
Cosine learning rate schedule with warmup
Early stopping to prevent overfitting
Chromosomal validation split to test generalization across genomic contexts

Since Promoter-GPT is a decoder-only model, we optimize it using auto-regression in a self-supervised fashion. During training, we monitor both the loss and the perplexity, which gives us a sense of how well the model predicts sequences of varying lengths.

First, we define some utility functions for training:

def get_grouped_params(model, weight_decay, no_decay=["bias", "LayerNorm.weight"]):
    """Separate parameters that should and shouldn't have weight decay."""
    params_with_wd, params_without_wd = [], []
    for n, p in model.named_parameters():
        if any(nd in n for nd in no_decay):
            params_without_wd.append(p)
        else:
            params_with_wd.append(p)
    return [
        {"params": params_with_wd, "weight_decay": weight_decay},
        {"params": params_without_wd, "weight_decay": 0.0},
    ]

def CE_loss(inputs, logits):
    """Calculate cross-entropy loss for autoregressive language modeling."""
    # Shift so that tokens < n predict n
    shift_labels = inputs[..., 1:].contiguous()
    shift_logits = logits[..., :-1, :].contiguous()
    # Calculate per-token loss
    loss_fct = CrossEntropyLoss(reduction='none')
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    # Resize and average loss per sample
    loss_per_sample = loss.view(shift_logits.size(0), shift_logits.size(1)).mean(axis=1)
    # Calculate weighted average
    weighted_loss = loss_per_sample.mean()
    return weighted_loss

class EarlyStopper:
    """Stop training when validation loss stops improving."""
    def __init__(self, patience=1, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.min_validation_loss = float('inf')

    def early_stop(self, validation_loss):
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            if self.counter >= self.patience:
                return True
        return False

def evaluate(model, eval_dataloader, accelerator):
    """Evaluate model on validation set."""
    model.eval()
    losses = []

    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(batch, labels=batch)
            loss = outputs.loss
            gathered_loss = accelerator.gather(loss)
            losses.append(gathered_loss)

    mean_loss = torch.mean(torch.stack(losses))
    try:
        perplexity = torch.exp(mean_loss)
    except OverflowError:
        perplexity = float("inf")

    model.train()
    return mean_loss.item(), perplexity.item()

Training Setup

Now we configure the training components:

# Hyperparameters
batch_size = 128
weight_decay = 0.02
learning_rate = 6e-4
num_train_epochs = 10
gradient_accumulation_steps = 8
eval_steps = 10

# DataLoaders
train_dataloader = DataLoader(
    train_datat, batch_size=batch_size, shuffle=True
)
eval_dataloader = DataLoader(
    val_datat, batch_size=batch_size, shuffle=False
)

# Optimizer with weight decay
optimizer = AdamW(
    get_grouped_params(model, weight_decay),
    lr=learning_rate
)

# Setup Accelerator for distributed training
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Learning rate scheduler with warmup
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    name="cosine",
    optimizer=optimizer,
    num_warmup_steps=1_000,
    num_training_steps=num_training_steps,
)

# Early stopping
early_stopper = EarlyStopper(patience=3, min_delta=1e-3)

print("✅ Setup complete — ready to train!")

✅ Setup complete — ready to train!

Finally, we train the model:


train_loss = []
val_loss = []
completed_steps = 0
global_step = 0

model.train()

with tqdm(total=num_training_steps, desc="Training") as pbar:
    for epoch in range(num_train_epochs):
        for step, batch in enumerate(train_dataloader, start=1):
            # Forward pass
            outputs = model(batch)
            logits = outputs.logits
            loss = CE_loss(batch, logits)
            train_loss.append(loss.item())

            # Backward pass with gradient accumulation
            loss_scaled = loss / gradient_accumulation_steps
            accelerator.backward(loss_scaled)

            # Optimizer step (every N accumulation steps)
            if step % gradient_accumulation_steps == 0:
                accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                completed_steps += 1

            global_step += 1
            pbar.update(1)

            # Evaluation
            if step % (eval_steps * gradient_accumulation_steps) == 0:
                eval_loss, perplexity = evaluate(model, eval_dataloader, accelerator)
                val_loss.append(eval_loss)

                accelerator.print({
                    "epoch": epoch,
                    "step": step,
                    "loss/train": round(loss.item(), 2),
                    "loss/eval": round(eval_loss, 2),
                    "perplexity": round(perplexity, 2)
                })

        # Early stopping check at end of epoch
        if len(val_loss) > 1:
            if early_stopper.early_stop(val_loss[-1]):
                accelerator.print(f"Early stopping at epoch {epoch}")
                break

print("✅ Training complete!")

Training:   0%|          | 83/50010 [00:17<14:59:25,  1.08s/it]

{'epoch': 0, 'step': 80, 'loss/train': 4.26, 'loss/eval': 4.26, 'perplexity': 70.85}

If you've completed training, your model is ready to use. Otherwise, you can load a pretrained version:

# If you trained the model yourself:
unwrapped_model = accelerator.unwrap_model(model)

# Or load from a pretrained checkpoint:
# unwrapped_model = GPT2LMHeadModel.from_pretrained("adehoffer/promoter-gpt-model")
# unwrapped_model = unwrapped_model.to(accelerator.device)

Now let's evaluate our model on the test set (chromosomes 7 and 13) to see how well it generalizes to completely unseen genomic regions.

# Create test dataloader
test_dataloader = DataLoader(
    test_datat, batch_size=batch_size, shuffle=False
)
# Prepare for evaluation
test_dataloader = accelerator.prepare(test_dataloader)
# Evaluate on test set
print("Evaluating on test set...")
test_loss, test_perplexity = evaluate(unwrapped_model, test_dataloader, accelerator)
print(f"\nTest Set Results:")
print(f"  Loss: {test_loss:.4f}")
print(f"  Perplexity: {test_perplexity:.2f}")
# Compare with validation results
print(f"\nComparison:")
print(f"  Validation Loss: {val_loss[-1]:.4f}")
print(f"  Test Loss: {test_loss:.4f}")
print(f"  Difference: {abs(test_loss - val_loss[-1]):.4f}")

Evaluating on test set...

Test Set Results:
  Loss: 1.1975
  Perplexity: 3.31

Comparison:
  Validation Loss: 1.1491
  Test Loss: 1.1975
  Difference: 0.0484

Step 5: Generating Novel DNA Sequences

Now comes the exciting part: using our trained model to generate synthetic promoter sequences that have never existed in nature. This is where we see if the model truly learned the "grammar" of DNA.

The generation is autoregressive: we provide a short "seed" sequence (a biological prompt), and the model predicts the next k-mer, adds it to the sequence, then predicts the next one, continuing until we reach 200bp.

# Set generation parameters
temperature = 1.0  # Controls randomness (higher = more diverse)
top_p = 0.9       # Nucleus sampling threshold

# Start with a biological seed sequence
prompt = kmerization("ATGG", k=3)
input_ids = wrapped_tokenizer.encode(prompt, return_tensors="pt").to(accelerator.device)

# Generate the sequence
desired_length = 200

with torch.no_grad():
    output_ids = unwrapped_model.generate(
        input_ids,
        max_length=desired_length - 2,  # Account for seed length
        min_length=desired_length - 2,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        eos_token_id=wrapped_tokenizer.eos_token_id,
        pad_token_id=wrapped_tokenizer.pad_token_id,
    )

The model generates k-mers (e.g., "ATG TGC GCG CGC..."), which we need to convert back to a continuous DNA sequence by taking the first base of each k-mer and appending the final k-mer:

def readable(t):
    """Convert space-separated k-mers back to a continuous DNA sequence."""
    kmers = t.split()
    return ''.join(kmer[0] for kmer in kmers[:-1]) + (kmers[-1] if kmers else '')

# Decode and display the generated sequence
new_dna = readable(wrapped_tokenizer.decode(output_ids[0], skip_special_tokens=True))
print("Generated promoter sequence:")
print(new_dna)
print(f"Length: {len(new_dna)} bp")

Generated promoter sequence:
ATGGTAGCATTTATAAAAATGACTCCCACTACTATCTCATTTTTAATTCATTATTTGCTCTTCTCCTGTATTTCACCACTTAGATTTTTTTCACTGGTTGAACACACATTCAGGTAAGAAAATAATCTGGTGACAATGGATTACCTCACTCTTCTAGTTTTGTTTCCTTTTGACCCTGATGAGAGGAAAATTTATGCTGC
Length: 200 bp

Each generated sequence is like a sentence written in the language of life—composed of patterns, motifs, and regulatory signals that the model has learned from millions of years of evolution encoded in real genomes. These synthetic promoters represent new genetic instructions that have never existed in nature, yet follow the same grammatical rules that govern gene expression. The question now is: are they biologically functional? That's what we explore in the next step.

Step 6: Exploration

Now that our model can generate DNA sequences, we enter the exploration phase. We want to understand what the model has learned: Are the generated sequences biologically plausible? Do they follow the compositional rules of real promoters?

In this step, we validate the sequences on two key aspects:

GC content – the proportion of guanine (G) and cytosine (C) nucleotides, which influences DNA stability and gene expression. For human promoters, acceptable GC content typically ranges from 40–60%, with many core promoter regions falling in the 45–55% range.
Sequence motifs – recurring patterns that may correspond to biologically known patterns.

Let's generate 100 promoter sequences and analyze their composition:

temperature = 1.0
top_p = 0.9
prompt = kmerization("ATGG", k=3)

input_ids = wrapped_tokenizer.encode(prompt, return_tensors="pt").to(accelerator.device)

# Generate 100 sequences
desired_length = 200

with torch.no_grad():
    output_ids = unwrapped_model.generate(
        input_ids,
        max_length=desired_length - 2,
        min_length=desired_length - 2,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        eos_token_id=wrapped_tokenizer.eos_token_id,
        pad_token_id=wrapped_tokenizer.pad_token_id,
        num_return_sequences=100,
    )

# Convert all generated sequences to readable DNA format
generated_seqs = [
    readable(wrapped_tokenizer.decode(seq, skip_special_tokens=True))
    for seq in output_ids
]

print(f"Generated {len(generated_seqs)} sequences")
print(f"Example sequence:\n{generated_seqs[0]}")

Generated 100 sequences
Example sequence:
ATGGGTCACTGTGGACCCCACAGGGGTGGGCAGGGCTGGAGCCATGTTCCTGCAGGGAAGGCACTCCCCAGCCAGAGTCAGGGTTGTGTGCAGGGGACCGGGAGATGCAGGGCTCCCAGAGCTGAGGCCCCTTGCCTGGGTCCAGGGGAGGGCCTTCTGGCCCTCTGGGAGCAGCCCAGCAGGCTTGTGCTGAGCTGTCT

# Check output shape
print(f"Generated tensor shape: {output_ids.shape}")
print(f"(batch_size={output_ids.shape[0]}, sequence_length={output_ids.shape[1]})")

Generated tensor shape: torch.Size([100, 198])
(batch_size=100, sequence_length=198)

We successfully generated 100 sequences, each with 198 tokens (corresponding to 200bp promoters).

# Convert all sequences to readable DNA format
list_new_dna = []
for i in range(output_ids.shape[0]):
    list_new_dna.append(readable(wrapped_tokenizer.decode(output_ids[i])))

First, let's check if the generated sequences have biologically realistic GC content:

def gc_content(seq):
    """Calculate the percentage of G and C nucleotides in a sequence."""
    gc_count = seq.count('G') + seq.count('C')
    return gc_count / len(seq) * 100

# Calculate GC% for all generated sequences
gc_values = [gc_content(seq) for seq in list_new_dna]

# Average GC%
average_gc = sum(gc_values) / len(gc_values)
print(f"Average GC content: {average_gc:.2f}%")

Average GC content: 44.37%

The average GC content falls right within the expected range for human promoters (45-55%), suggesting the model has learned the compositional constraints of real promoter sequences.

Next, we extract the most frequent 6-mer motifs. Why 6-mers? Many transcription factor binding sites are 6-8 nucleotides long, making 6-mers a good window size to capture biologically meaningful patterns without being too specific or too general.

from collections import Counter

k = 6
all_kmers = []

# Extract all 6-mers from generated sequences
for seq in list_new_dna:
    for i in range(len(seq) - k + 1):
        all_kmers.append(seq[i:i+k])

# Count frequencies
kmer_counts = Counter(all_kmers)

# Most common 10 6-mers
top_6mers = kmer_counts.most_common(10)
print("Top 10 most frequent 6-mers:")
for motif, count in top_6mers:
    print(f"  {motif}: {count}")

Top 10 most frequent 6-mers:
  TTTTTT: 101
  AAAAAA: 65
  AAAAAT: 29
  TTTTCT: 27
  AAAATA: 25
  CTGCTG: 25
  TTTCTT: 25
  GCTGGG: 25
  CCCAGG: 24
  AAGAAA: 23

The top motifs (TTTTTT, AAAAAA) are poly-A/T stretches, which are commonly found in promoter regions. These AT-rich sequences enhance DNA flexibility and are often associated with transcription start sites and TATA boxes—key regulatory elements in gene expression. The model has learned to generate these functionally relevant patterns without explicit instruction.

We've successfully trained a transformer to write the language of DNA. Our PromoterGPT learned to generate biologically plausible promoter sequences with realistic GC content and functionally relevant motifs—without being explicitly told what makes a promoter work.

What's Next?

Here are some exciting directions to explore:

Architecture experiments: Try more layers (4, 6, 8) or different k-mer sizes (k=4, k=5, k=6)
Advanced tokenization: Implement BPE to learn data-driven tokens instead of fixed k-mers
New genomic regions: Train on coding sequences, enhancers, or silencers

The most exciting question remains: Would these synthetic sequences actually work in a cell?
The next step would be validating them computationally with activity prediction models, and ultimately testing them experimentally in the lab.

If DNA is truly a language, we've shown that transformers can learn its basic grammar. The next challenge is teaching them to write not just grammatically correct sequences, but functionally meaningful ones.

Author:

Adele De Hoffer
PhD Student in Systems and Synthetic Biology at di Bernardo Lab 🧬

TIGEM – Telethon Institute of Genetics and Medicine, Pozzuoli, Italy
SSM – Scuola Superiore Meridionale, Naples, Italy

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote