yzhouchen001 commited on
Commit
78ba665
·
1 Parent(s): 6937578
Files changed (3) hide show
  1. README.md +95 -1
  2. app.py +203 -0
  3. utils_app.py +146 -0
README.md CHANGED
@@ -3,7 +3,7 @@ title: MVP
3
  emoji: 🏆
4
  colorFrom: blue
5
  colorTo: pink
6
- sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
@@ -11,3 +11,97 @@ short_description: msms annotation tool
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  emoji: 🏆
4
  colorFrom: blue
5
  colorTo: pink
6
+ sdk: streamlit
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
 
11
  ---
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+
15
+ # MultiView Projection (MVP) for Spectra Annotation
16
+
17
+ ### Yan Zhou Chen, Soha Hassoun
18
+ #### Department of Computer Science, Tufts University
19
+ This repository provides the implementation of MultiView Projection (MVP). MVP can be used to rank a set of molecular candidates given a spectrum.
20
+
21
+ ## Table of Contents
22
+ 1. [Install & setup]
23
+ 2. [Data prep]
24
+ 3. [MassSpecGym data download]
25
+ 4. [Use our pretrained model]
26
+ 5. [Training from scratch]
27
+ 6. [References]
28
+
29
+ ## Install & setup
30
+ 1. Clone the repository: git clone <REPO_link>
31
+ 2. Install evironment or only key packages:
32
+ ```
33
+ conda env create -f environment.yml
34
+ ```
35
+ #### Key packages
36
+ - python
37
+ - dgl
38
+ - pytorch
39
+ - rdkit
40
+ - pytorch-geometric
41
+ - numpy
42
+ - scikit-learn
43
+ - scipy
44
+ - massspecgym
45
+ - lightning
46
+
47
+ ## Data prep
48
+ We provide sample spectra data and candidates in `data/sample`.
49
+ For preprocessing:
50
+ 1. If using formSpec, compute subformula labels
51
+ 2. Run our preprocess code to obatain fingerprints and consensus spectra files
52
+
53
+ ```
54
+ # If using formSpec
55
+ python subformula_assign/assign_subformulae.py --spec-files ../data/sample/data.tsv --output-dir ../data/sample/subformulae_default --max-formulae 60 --labels-file ../data/sample/data.tsv
56
+ python data_preprocess.py --spec_type formSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --subformula_dir_pth ../data/sample/subformulae_default/ --output_dir ../data/sample/
57
+
58
+ # If using binnedSpec
59
+ python data_preprocess.py --spec_type binnedSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --output_dir ../data/sample/
60
+
61
+ ```
62
+ We include sample subformula, fingerprint, and consensus spectra data in `../data/sample/`.
63
+
64
+ ## Use our pretrained model
65
+ You can use our pretrained model (on MassSpecGym) to rank molecular candidates by providing the spectra data and a list of candidates.
66
+
67
+ After prepping your data, modify the params_binnedSpec.yaml or params_formSpec.yaml with your dataset paths:
68
+
69
+ ```
70
+ # If using formSpec
71
+ python test.py --param_pth params_formSpec.yaml
72
+
73
+ # If using binnedSpec
74
+ python test.py --param_pth params_binnedSpec.yaml
75
+ ```
76
+
77
+ We provide a notebook showing sample result files in `notebooks/demo.ipynb`
78
+
79
+ ## MassSpecGym data download
80
+ Our model is trained on [MassSpecGym dataset](https://github.com/pluskal-lab/MassSpecGym). Follow their instruction to download the spectra and candidate dataset.
81
+
82
+ You can preprocess the MassSpecGym dataset as descirbed in the above section or download the preprocessed files as follows:
83
+ ```
84
+ mkdir data/msgym/
85
+ cd data/msgym
86
+ wget
87
+ wget
88
+ ```
89
+ ## Training from scratch
90
+ To train a model from scratch:
91
+ 1. Prepare data as described in the data prep section
92
+ 2. Modify the configuration in params file as necessary
93
+ 3. Train using the following
94
+ ```
95
+ # If using formSpec
96
+ python train.py --param_pth params_formSpec.yaml
97
+
98
+ # If using binnedSpec
99
+ python train.py --param_pth params_binnedSpec.yaml
100
+ ```
101
+
102
+ ## References
103
+
104
+
105
+ #### Contact
106
107
+ =======
app.py ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import json
4
+ import tempfile
5
+ import os
6
+
7
+ # ==============================
8
+ # App Configuration
9
+ # ==============================
10
+ st.set_page_config(
11
+ page_title="MVP",
12
+ page_icon="",
13
+ layout="centered"
14
+ )
15
+
16
+ # initialize session state
17
+ if 'example_mgf' not in st.session_state:
18
+ st.session_state['example_mgf'] = None
19
+ if 'example_json' not in st.session_state:
20
+ st.session_state['example_json'] = None
21
+
22
+ # ==============================
23
+ # Introductory Section
24
+ # ==============================
25
+ st.title("MVP Playground")
26
+
27
+ st.markdown("""
28
+ This web app lets you test our trained model on your own data.
29
+
30
+ ### 📚 References
31
+ 🔗 **Paper:** [Read the publication here](https://github.com/HassounLab/MVP)
32
+ 📦 **Source Code:** [GitHub Repository](https://github.com/HassounLab/MVP)
33
+
34
+ ---
35
+
36
+ ### 🧠 Available Models
37
+ We have two models trained on the [MassSpecGym](https://github.com/pluskal-lab/MassSpecGym) training dataset:
38
+ - **binnedSpec** – trained on binned spectra and does not require formula information.
39
+ - **formSpec** – our main model trained on spectra with subformula annotations. Requires formula and adduct information.
40
+
41
+ ---
42
+
43
+ ### ⚙️ Instructions
44
+
45
+ 1. **Prepare two input files:**
46
+ - **Spectra file (.mgf)** – your experimental spectra data.
47
+ - **Candidates file (.json)** – candidate molecules for each spectrum.
48
+
49
+ 2. **Select a model** from the dropdown.
50
+
51
+ 3. **Click “Run Prediction”** to start processing.
52
+ ⚠️ **Note:** For fair usage, the web app limits computation to **1,000 pairs**. Each pair consists of one spectrum and one candidate molecule.
53
+
54
+ 4. After processing, you’ll receive a downloadable **CSV file** with your results.
55
+
56
+ ---
57
+
58
+ ### 📁 Example Input Files
59
+
60
+ You can download example files to understand the required format:
61
+ - [Download sample spectra (MGF)](data/app/data.mgf)
62
+ - [Download sample candidates (JSON)](data/app/identifier_to_candidates.json)
63
+
64
+ Here's an example of the spectra file format (.mgf):
65
+ ```
66
+ BEGIN IONS
67
+ TITLE=example_spectrum
68
+ PEPMASS=100.0
69
+ CHARGE=1+
70
+ FORMULA=C10H12O2 # optional, required for formSpec model
71
+ ADDUCT=[M+H]+ # optional, required for formSpec model
72
+ 100.0 1000
73
+ 101.0 1500
74
+ 102.0 2000
75
+ END IONS
76
+ ```
77
+ ---
78
+
79
+ ### 💡 Tip
80
+ If you want to process **more than 1,000 pairs**,
81
+ please **clone the repository** and run it locally with GPU support for faster computation.
82
+ """)
83
+
84
+ # ==============================
85
+ # File Upload Section
86
+ # ==============================
87
+ st.subheader("📤 Upload Your Files")
88
+
89
+
90
+ # --- File uploaders ---
91
+ mgf_file = st.file_uploader("Upload spectra file (.mgf)", type=["mgf"])
92
+ json_file = st.file_uploader("Upload candidates file (.json)", type=["json"])
93
+
94
+ # --- Example files button ---
95
+ if st.button("Use Example Files"):
96
+ with open("data/app/data.mgf", "rb") as f:
97
+ st.session_state["example_mgf"] = f.read()
98
+ with open("data/app/identifier_to_candidates.json", "rb") as f:
99
+ st.session_state["example_json"] = f.read()
100
+ st.success("✅ Example files loaded!")
101
+
102
+ # --- Determine which files to use ---
103
+ if mgf_file is not None:
104
+ mgf_bytes = mgf_file.read()
105
+ elif "example_mgf" in st.session_state:
106
+ mgf_bytes = st.session_state["example_mgf"]
107
+ else:
108
+ mgf_bytes = None
109
+
110
+ if json_file is not None:
111
+ json_bytes = json_file.read()
112
+ elif "example_json" in st.session_state:
113
+ json_bytes = st.session_state["example_json"]
114
+ else:
115
+ json_bytes = None
116
+
117
+ # --- Display results ---
118
+ if mgf_bytes and json_bytes:
119
+ st.success("Files are ready to use!")
120
+ else:
121
+ st.info("Please upload your files or 'Use Example Files'.")
122
+
123
+
124
+ # ==============================
125
+ # Model Selection and Run Button
126
+ # ==============================
127
+ model_choice = st.selectbox(
128
+ "Select model to use:",
129
+ options=["binnedSpec", "formSpec"]
130
+ )
131
+
132
+ run_button = st.button("🚀 Run Prediction")
133
+
134
+ # ==============================
135
+ # Run Prediction
136
+ # ==============================
137
+ if run_button:
138
+ if not mgf_bytes or not json_bytes:
139
+ st.error("Please upload both a spectra (.mgf) and candidates (.json) file.")
140
+ else:
141
+ with st.spinner("Running predictions... please wait ⏳", show_time=True):
142
+ # Save uploaded files to temporary paths
143
+ st.write("Saving files to temporary paths...")
144
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".mgf") as tmp_mgf:
145
+ tmp_mgf.write(mgf_bytes)
146
+ mgf_path = tmp_mgf.name
147
+
148
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".json") as tmp_json:
149
+ tmp_json.write(json_bytes)
150
+ candidates_pth = tmp_json.name
151
+
152
+ # Check number of pairs in candidates file
153
+ st.write("Checking number of pairs in candidates file...")
154
+ with open(candidates_pth, 'r') as f:
155
+ candidates_data = json.load(f)
156
+ total_pairs = sum(len(cands) for cands in candidates_data.values())
157
+ if total_pairs > 1000:
158
+ st.error(f"⚠️ Too many pairs ({total_pairs})! Please limit to 1,000 pairs for the web app.")
159
+ st.stop()
160
+
161
+ # preprocess spectra
162
+ st.write("Preprocessing spectra...")
163
+ from utils_app import preprocess_spectra, setup_config, run_inference
164
+ dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20)
165
+
166
+ if dataset_pth is None:
167
+ st.error("Error in preprocessing spectra. Please check your input files.")
168
+ if model_choice == "formSpec":
169
+ st.info("Make sure that for 'formSpec' model, each spectrum has 'formula' and 'adduct' metadata.")
170
+ st.stop()
171
+
172
+ # Prepare model config paths
173
+ st.write("Preparing model config paths...")
174
+ params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
175
+
176
+ try:
177
+ st.write("Running inference...")
178
+ run_inference(params)
179
+ except Exception as e:
180
+ st.error(f"Error running model inference: {e}")
181
+ st.stop()
182
+
183
+ # Convert to CSV
184
+ st.write("Converting to CSV...")
185
+ df = pd.read_pickle(params['df_test_path'])
186
+ csv_path = params['df_test_path'].replace(".pkl", ".csv")
187
+ df.to_csv(csv_path, index=False)
188
+
189
+ st.success(f"✅ Done! Model: {model_choice}")
190
+ st.download_button(
191
+ label="📥 Download Results CSV",
192
+ data=open(csv_path, "rb").read(),
193
+ file_name=os.path.basename(csv_path),
194
+ mime="text/csv"
195
+ )
196
+
197
+ st.info("To run larger datasets or enable GPU acceleration, please clone the repo and run locally.")
198
+
199
+ # ==============================
200
+ # Footer
201
+ # ==============================
202
+ st.markdown("---")
203
+
utils_app.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from matchms.importing import load_from_mgf
2
+ import yaml
3
+ import numpy as np
4
+ from mvp.subformula_assign.utils.spectra_utils import assign_subforms
5
+ import tempfile
6
+ import json
7
+ import os
8
+ from functools import partial
9
+ from pytorch_lightning import Trainer
10
+ from massspecgym.models.base import Stage
11
+ from mvp.data.data_module import TestDataModule
12
+ from mvp.data.datasets import ContrastiveDataset
13
+ from mvp.utils.data import get_spec_featurizer, get_mol_featurizer, get_test_ms_dataset
14
+ from mvp.utils.models import get_model
15
+ import pandas as pd
16
+
17
+ # check formspec requirements
18
+ def check_formspec_requirements(spectra):
19
+ for spec in spectra:
20
+ if 'formula' not in spec.metadata or 'adduct' not in spec.metadata:
21
+ return False
22
+ return True
23
+
24
+ # preprocess spectra
25
+ def preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20, dataset_pth=None, subformula_dir=None):
26
+
27
+ if dataset_pth is None:
28
+ dataset_pth = os.path.join(tempfile.gettempdir(), f"mvp_data.tsv")
29
+ if subformula_dir is None:
30
+ subformula_dir = os.path.join(tempfile.gettempdir(), f"mvp_subformulae")
31
+ os.makedirs(subformula_dir, exist_ok=True)
32
+
33
+ # load mgf file
34
+ spectra = list(load_from_mgf(mgf_path))
35
+
36
+ columns = ['identifier', 'formula', 'adduct', 'precursor_mz', 'precursor_formula', 'mzs', 'intensities', 'fold']
37
+ data = []
38
+ try:
39
+ for spec in spectra:
40
+ identifier = spec.metadata['title']
41
+ formula = spec.metadata.get('formula', None)
42
+ adduct = spec.metadata.get('adduct', None)
43
+ precursor_mz = spec.metadata['precursor_mz']
44
+ precursor_formula = spec.metadata['formula'] # technically incorrect, but we don't use it
45
+ mzs = spec.peaks.mz
46
+ intensities = spec.peaks.intensities
47
+
48
+ if model_choice == "formSpec":
49
+ if formula is None or adduct is None:
50
+ return None, None
51
+ ms = [(m, i) for m, i in zip(mzs, intensities)]
52
+
53
+ # annotate peaks
54
+ x = assign_subforms(formula, np.array(ms), adduct, mass_diff_thresh=mass_diff_thresh)
55
+ if x['output_tbl'] is None:
56
+ continue
57
+
58
+ # save json file
59
+ json_file = os.path.join(subformula_dir, f"{identifier}.json")
60
+ with open(json_file, 'w') as f:
61
+ json.dump(x['output_tbl'], f)
62
+
63
+ mzs = ','.join([str(m) for m in mzs])
64
+ intensities = ','.join([str(i) for i in intensities])
65
+ data.append([identifier, formula, adduct, precursor_mz, precursor_formula, mzs, intensities, 'test'])
66
+
67
+ df = pd.DataFrame(data, columns=columns)
68
+ df.to_csv(dataset_pth, sep='\t', index=False)
69
+
70
+ return dataset_pth, subformula_dir
71
+ except Exception as e:
72
+ return None, None
73
+
74
+ def setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir):
75
+
76
+ if model_choice == "binnedSpec":
77
+ param_file = f"mvp/params_binnedSpec.yaml"
78
+ checkpoint_path = f"pretrained_models/msgym_binnedSpec.ckpt"
79
+ elif model_choice == "formSpec":
80
+ param_file = f"mvp/params_formSpec.yaml"
81
+ checkpoint_path = f"pretrained_models/msgym_formSpec.ckpt"
82
+
83
+ # load yaml
84
+ with open(param_file, 'r') as f:
85
+ params = yaml.safe_load(f)
86
+
87
+ params['dataset_pth'] = dataset_pth
88
+ params['candidates_pth'] = candidates_pth
89
+ params['subformula_dir_pth'] = subformula_dir
90
+ params['experiment_dir'] = tempfile.mkdtemp()
91
+ params['checkpoint_pth'] = checkpoint_path
92
+ params['df_test_path'] = os.path.join(params['experiment_dir'], f"results_{model_choice}.pkl")
93
+
94
+ return params
95
+
96
+
97
+ def run_inference(params):
98
+
99
+ # Load dataset
100
+ spec_featurizer = get_spec_featurizer(params['spectra_view'], params)
101
+ mol_featurizer = get_mol_featurizer(params['molecule_view'], params)
102
+ dataset = get_test_ms_dataset(params['spectra_view'], params['molecule_view'], spec_featurizer, mol_featurizer, params, external_test=True)
103
+
104
+ # Init data module
105
+ collate_fn = partial(ContrastiveDataset.collate_fn, spec_enc=params['spec_enc'], spectra_view=params['spectra_view'], stage=Stage.TEST)
106
+ data_module = TestDataModule(
107
+ dataset=dataset,
108
+ collate_fn=collate_fn,
109
+ split_pth=params['split_pth'],
110
+ batch_size=params['batch_size'],
111
+ num_workers=params['num_workers']
112
+ )
113
+
114
+ model = get_model(params['model'], params)
115
+ print(model.hparams)
116
+ model.df_test_path = params['df_test_path']
117
+ model.external_test = True
118
+ model.hparams['use_fp'] = False
119
+ model.hparams["contr_views"] = [['spec_enc', 'mol_enc']]
120
+ model.hparams['use_cons_spec'] = False
121
+
122
+ # Init trainer
123
+ trainer = Trainer(
124
+ accelerator=params['accelerator'],
125
+ devices=params['devices'],
126
+ default_root_dir=params['experiment_dir']
127
+ )
128
+
129
+ # Prepare data module to test
130
+ data_module.prepare_data()
131
+ data_module.setup(stage="test")
132
+
133
+ # Test
134
+ trainer.test(model, datamodule=data_module)
135
+
136
+ if __name__ == "__main__":
137
+
138
+ # test run
139
+ mgf_path = "data/app/data.mgf"
140
+ model_choice = "formSpec"
141
+ candidates_pth = "data/app/identifier_to_candidates.json"
142
+ mass_diff_thresh = 20
143
+ dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=mass_diff_thresh)
144
+ params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
145
+ print(params)
146
+ run_inference(params)