---
library_name: biomed-multi-omic
license: apache-2.0
tags:
- Biology
- DNA
# datasets:
# - PanglaoDB
# - CELLxGENE
---

# ibm-research/biomed.dna.ref.modernbert.113m.v1

Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.

`biomed-multi-omic` enables development and testing of foundation models for DNA sequences and for RNA expression,
with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface.
`biomed-multi-omic` leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.

- 🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
- 🚀 Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
- 📈 Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs)
for DNA sequences)
- Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.

For details on how the models were trained, please refer to [the BMFM-DNA preprint](https://arxiv.org/abs/2507.05265).

- **Developers:** IBM Research
- **GitHub Repository:** [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic)
- **Paper:** [BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects](https://arxiv.org/abs/2507.05265)
- **Release Date**: Jun 26th, 2025
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Checkpoint

**BMFM-DNA-REF**

The pre-training samples were prepared by extracting DNA sequences of random lengths (between 1kb and 10kb) consecutively from the human reference genome. Sequences were excluded if all nucleotides are “N”. To further enrich the diversity of the training set, we repeated the whole-genome random sampling 10 times. For each DNA sequence sample, we also created the reverse complement sequence as the counterpart, leading to a total of 9,982,678 samples that roughly cover the human genome 20 times or about 60 billion nucleotides.

For full details see section 3.1.1 of [the BMFM-DNA manuscript](https://arxiv.org/abs/2507.05265).

## Usage

Using `biomed.dna.ref.modernbert.113m.v1` requires the codebase [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic).

For installation, please follow the [instructions on github](https://github.com/BiomedSciAI/biomed-multi-omic?tab=readme-ov-file#installation).

## DNA Inference

To get embeddings for DNA sequences run:

```bash
export INPUT_DIRECTORY=... # path to your DNA sequences files
bmfm-targets-run -cn dna_predict input_directory=$INPUT_DIRECTORY working_dir=/tmp checkpoint=ibm-research/biomed.dna.ref.modernbert.113m.v1
```

For more details see the [DNA tutorials on github](https://github.com/BiomedSciAI/biomed-multi-omic?tab=readme-ov-file#dna-inference).

## Citation

To cite the tool for both RNA and DNA, please cite both the following articles:

```bibtex
@misc{li2025bmfmdnasnpawarednafoundation,
      title={BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects},
      author={Hongyang Li and Sanjoy Dey and Bum Chul Kwon and Michael Danziger and Michal Rosen-Tzvi and Jianying Hu and James Kozloski and Ching-Huei Tsou and Bharath Dandala and Pablo Meyer},
      year={2025},
      eprint={2507.05265},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN},
      url={https://arxiv.org/abs/2507.05265},
}
```