--- library_name: biomed-multi-omic license: apache-2.0 tags: - Biology - DNA # datasets: # - PanglaoDB # - CELLxGENE --- # ibm-research/biomed.dna.ref.modernbert.113m.v1 Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data. `biomed-multi-omic` enables development and testing of foundation models for DNA sequences and for RNA expression, with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface. `biomed-multi-omic` leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra. - 🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets. - šŸš€ Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting - šŸ“ˆ Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs) for DNA sequences) - Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison. For details on how the models were trained, please refer to [the BMFM-DNA preprint](https://arxiv.org/abs/2507.05265). - **Developers:** IBM Research - **GitHub Repository:** [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic) - **Paper:** [BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects](https://arxiv.org/abs/2507.05265) - **Release Date**: Jun 26th, 2025 - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Checkpoint **BMFM-DNA-REF** The pre-training samples were prepared by extracting DNA sequences of random lengths (between 1kb and 10kb) consecutively from the human reference genome. Sequences were excluded if all nucleotides are ā€œNā€. To further enrich the diversity of the training set, we repeated the whole-genome random sampling 10 times. For each DNA sequence sample, we also created the reverse complement sequence as the counterpart, leading to a total of 9,982,678 samples that roughly cover the human genome 20 times or about 60 billion nucleotides. For full details see section 3.1.1 of [the BMFM-DNA manuscript](https://arxiv.org/abs/2507.05265). ## Usage Using `biomed.dna.ref.modernbert.113m.v1` requires the codebase [https://github.com/BiomedSciAI/biomed-multi-omic](https://github.com/BiomedSciAI/biomed-multi-omic). For installation, please follow the [instructions on github](https://github.com/BiomedSciAI/biomed-multi-omic?tab=readme-ov-file#installation). ## DNA Inference To get embeddings for DNA sequences run: ```bash export INPUT_DIRECTORY=... # path to your DNA sequences files bmfm-targets-run -cn dna_predict input_directory=$INPUT_DIRECTORY working_dir=/tmp checkpoint=ibm-research/biomed.dna.ref.modernbert.113m.v1 ``` For more details see the [DNA tutorials on github](https://github.com/BiomedSciAI/biomed-multi-omic?tab=readme-ov-file#dna-inference). ## Citation To cite the tool for both RNA and DNA, please cite both the following articles: ```bibtex @misc{li2025bmfmdnasnpawarednafoundation, title={BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects}, author={Hongyang Li and Sanjoy Dey and Bum Chul Kwon and Michael Danziger and Michal Rosen-Tzvi and Jianying Hu and James Kozloski and Ching-Huei Tsou and Bharath Dandala and Pablo Meyer}, year={2025}, eprint={2507.05265}, archivePrefix={arXiv}, primaryClass={q-bio.GN}, url={https://arxiv.org/abs/2507.05265}, } ```