MVP / README.md
yzhouchen001's picture
cleaned up description
2500245
---
title: MVP
emoji: πŸ†
colorFrom: blue
colorTo: pink
sdk: streamlit
app_file: app.py
pinned: false
short_description: msms annotation tool
python_version: 3.11.7
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# πŸ† MultiView Projection (MVP) for Spectra Annotation
### Authors
**Yan Zhou Chen, Soha Hassoun**
Department of Computer Science, Tufts University
---
MVP is a framework for **ranking molecular candidates given a spectrum**. This repository provides the official implementation, pretrained models, and utilities for data preparation and training.
---
## πŸ“‘ Table of Contents
0. [Quick Test](#quick-test)
1. [Install & Setup](#install--setup)
2. [Data Preparation](#data-prep)
3. [MassSpecGym Data Download](#massspecgym-data-download)
4. [Using the Pretrained Model](#use-our-pretrained-model)
5. [Training from Scratch](#training-from-scratch)
6. [References](#references)
---
## πŸš€ Quick Test
Run MVP instantly with our [interactive app](https://huggingface.co/spaces/HassounLab/MVP) for small-scale experiments.
---
## βš™οΈ Install & setup
1. Clone the repository: `git clone https://huggingface.co/spaces/HassounLab/MVP/`
2. Install evironment or only key packages:
```
conda create -n mvp python=3.11
conda activate mvp
pip install -r requirements.txt
```
#### Key packages
- python
- dgl
- pytorch
- rdkit
- pytorch-geometric
- numpy
- scikit-learn
- scipy
- massspecgym
- lightning
---
## πŸ“‚ Data prep
We provide sample spectra data and candidates in `data/sample`.
For preprocessing:
1. If using formSpec, compute subformula labels
2. Run our preprocess code to obatain fingerprints and consensus spectra files
```
# If using formSpec
python subformula_assign/assign_subformulae.py --spec-files ../data/sample/data.tsv --output-dir ../data/sample/subformulae_default --max-formulae 60 --labels-file ../data/sample/data.tsv
python data_preprocess.py --spec_type formSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --subformula_dir_pth ../data/sample/subformulae_default/ --output_dir ../data/sample/
# If using binnedSpec
python data_preprocess.py --spec_type binnedSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --output_dir ../data/sample/
```
We include sample subformula, fingerprint, and consensus spectra data in `../data/sample/`.
## Use our pretrained model
You can use our pretrained model (on MassSpecGym) to rank molecular candidates by providing the spectra data and a list of candidates.
After prepping your data, modify the params_binnedSpec.yaml or params_formSpec.yaml with your dataset paths:
```
# If using formSpec
python test.py --param_pth params_formSpec.yaml
# If using binnedSpec
python test.py --param_pth params_binnedSpec.yaml
```
We provide a notebook showing sample result files in `notebooks/demo.ipynb`
---
## MassSpecGym data download
Our model is trained on [MassSpecGym dataset](https://github.com/pluskal-lab/MassSpecGym). Follow their instruction to download the spectra and candidate dataset.
You can preprocess the MassSpecGym dataset as descirbed in the above section or download the preprocessed files as follows:
```
mkdir data/msgym/
cd data/msgym
wget https://zenodo.org/records/15223987/files/msgym_preprocessed.zip?download=1
```
## Training from scratch
To train a model from scratch:
1. Prepare data as described in the data prep section
2. Modify the configuration in params file as necessary
3. Train using the following
```
# If using formSpec
python train.py --param_pth params_formSpec.yaml
# If using binnedSpec
python train.py --param_pth params_binnedSpec.yaml
```
---
## πŸ“š References
Preprint:[Learning from All Views: A Multiview Contrastive Framework for Metabolite Annotation](https://www.biorxiv.org/content/10.1101/2025.11.12.688047v1)
---
## πŸ“§ Contact
For questions, reach out to: [email protected]
=======