MVP / README.md
yzhouchen001's picture
config
45b21f7
|
raw
history blame
3.23 kB
---
title: MVP
emoji: πŸ†
colorFrom: blue
colorTo: pink
sdk: streamlit
app_file: app.py
pinned: false
short_description: msms annotation tool
python_version: 3.11.7
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# MultiView Projection (MVP) for Spectra Annotation
### Yan Zhou Chen, Soha Hassoun
#### Department of Computer Science, Tufts University
This repository provides the implementation of MultiView Projection (MVP). MVP can be used to rank a set of molecular candidates given a spectrum.
## Table of Contents
1. [Install & setup]
2. [Data prep]
3. [MassSpecGym data download]
4. [Use our pretrained model]
5. [Training from scratch]
6. [References]
## Install & setup
1. Clone the repository: git clone <REPO_link>
2. Install evironment or only key packages:
```
conda env create -f environment.yml
```
#### Key packages
- python
- dgl
- pytorch
- rdkit
- pytorch-geometric
- numpy
- scikit-learn
- scipy
- massspecgym
- lightning
## Data prep
We provide sample spectra data and candidates in `data/sample`.
For preprocessing:
1. If using formSpec, compute subformula labels
2. Run our preprocess code to obatain fingerprints and consensus spectra files
```
# If using formSpec
python subformula_assign/assign_subformulae.py --spec-files ../data/sample/data.tsv --output-dir ../data/sample/subformulae_default --max-formulae 60 --labels-file ../data/sample/data.tsv
python data_preprocess.py --spec_type formSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --subformula_dir_pth ../data/sample/subformulae_default/ --output_dir ../data/sample/
# If using binnedSpec
python data_preprocess.py --spec_type binnedSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --output_dir ../data/sample/
```
We include sample subformula, fingerprint, and consensus spectra data in `../data/sample/`.
## Use our pretrained model
You can use our pretrained model (on MassSpecGym) to rank molecular candidates by providing the spectra data and a list of candidates.
After prepping your data, modify the params_binnedSpec.yaml or params_formSpec.yaml with your dataset paths:
```
# If using formSpec
python test.py --param_pth params_formSpec.yaml
# If using binnedSpec
python test.py --param_pth params_binnedSpec.yaml
```
We provide a notebook showing sample result files in `notebooks/demo.ipynb`
## MassSpecGym data download
Our model is trained on [MassSpecGym dataset](https://github.com/pluskal-lab/MassSpecGym). Follow their instruction to download the spectra and candidate dataset.
You can preprocess the MassSpecGym dataset as descirbed in the above section or download the preprocessed files as follows:
```
mkdir data/msgym/
cd data/msgym
wget
wget
```
## Training from scratch
To train a model from scratch:
1. Prepare data as described in the data prep section
2. Modify the configuration in params file as necessary
3. Train using the following
```
# If using formSpec
python train.py --param_pth params_formSpec.yaml
# If using binnedSpec
python train.py --param_pth params_binnedSpec.yaml
```
## References
#### Contact
[email protected]
=======