Spaces:
Sleeping
Sleeping
Commit
·
78ba665
1
Parent(s):
6937578
st app
Browse files- README.md +95 -1
- app.py +203 -0
- utils_app.py +146 -0
README.md
CHANGED
|
@@ -3,7 +3,7 @@ title: MVP
|
|
| 3 |
emoji: 🏆
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: pink
|
| 6 |
-
sdk:
|
| 7 |
sdk_version: 5.49.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
@@ -11,3 +11,97 @@ short_description: msms annotation tool
|
|
| 11 |
---
|
| 12 |
|
| 13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
emoji: 🏆
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: pink
|
| 6 |
+
sdk: streamlit
|
| 7 |
sdk_version: 5.49.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 14 |
+
|
| 15 |
+
# MultiView Projection (MVP) for Spectra Annotation
|
| 16 |
+
|
| 17 |
+
### Yan Zhou Chen, Soha Hassoun
|
| 18 |
+
#### Department of Computer Science, Tufts University
|
| 19 |
+
This repository provides the implementation of MultiView Projection (MVP). MVP can be used to rank a set of molecular candidates given a spectrum.
|
| 20 |
+
|
| 21 |
+
## Table of Contents
|
| 22 |
+
1. [Install & setup]
|
| 23 |
+
2. [Data prep]
|
| 24 |
+
3. [MassSpecGym data download]
|
| 25 |
+
4. [Use our pretrained model]
|
| 26 |
+
5. [Training from scratch]
|
| 27 |
+
6. [References]
|
| 28 |
+
|
| 29 |
+
## Install & setup
|
| 30 |
+
1. Clone the repository: git clone <REPO_link>
|
| 31 |
+
2. Install evironment or only key packages:
|
| 32 |
+
```
|
| 33 |
+
conda env create -f environment.yml
|
| 34 |
+
```
|
| 35 |
+
#### Key packages
|
| 36 |
+
- python
|
| 37 |
+
- dgl
|
| 38 |
+
- pytorch
|
| 39 |
+
- rdkit
|
| 40 |
+
- pytorch-geometric
|
| 41 |
+
- numpy
|
| 42 |
+
- scikit-learn
|
| 43 |
+
- scipy
|
| 44 |
+
- massspecgym
|
| 45 |
+
- lightning
|
| 46 |
+
|
| 47 |
+
## Data prep
|
| 48 |
+
We provide sample spectra data and candidates in `data/sample`.
|
| 49 |
+
For preprocessing:
|
| 50 |
+
1. If using formSpec, compute subformula labels
|
| 51 |
+
2. Run our preprocess code to obatain fingerprints and consensus spectra files
|
| 52 |
+
|
| 53 |
+
```
|
| 54 |
+
# If using formSpec
|
| 55 |
+
python subformula_assign/assign_subformulae.py --spec-files ../data/sample/data.tsv --output-dir ../data/sample/subformulae_default --max-formulae 60 --labels-file ../data/sample/data.tsv
|
| 56 |
+
python data_preprocess.py --spec_type formSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --subformula_dir_pth ../data/sample/subformulae_default/ --output_dir ../data/sample/
|
| 57 |
+
|
| 58 |
+
# If using binnedSpec
|
| 59 |
+
python data_preprocess.py --spec_type binnedSpec --dataset_pth ../data/sample/data.tsv --candidates_pth ../data/sample/candidates_mass.json --output_dir ../data/sample/
|
| 60 |
+
|
| 61 |
+
```
|
| 62 |
+
We include sample subformula, fingerprint, and consensus spectra data in `../data/sample/`.
|
| 63 |
+
|
| 64 |
+
## Use our pretrained model
|
| 65 |
+
You can use our pretrained model (on MassSpecGym) to rank molecular candidates by providing the spectra data and a list of candidates.
|
| 66 |
+
|
| 67 |
+
After prepping your data, modify the params_binnedSpec.yaml or params_formSpec.yaml with your dataset paths:
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
# If using formSpec
|
| 71 |
+
python test.py --param_pth params_formSpec.yaml
|
| 72 |
+
|
| 73 |
+
# If using binnedSpec
|
| 74 |
+
python test.py --param_pth params_binnedSpec.yaml
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
We provide a notebook showing sample result files in `notebooks/demo.ipynb`
|
| 78 |
+
|
| 79 |
+
## MassSpecGym data download
|
| 80 |
+
Our model is trained on [MassSpecGym dataset](https://github.com/pluskal-lab/MassSpecGym). Follow their instruction to download the spectra and candidate dataset.
|
| 81 |
+
|
| 82 |
+
You can preprocess the MassSpecGym dataset as descirbed in the above section or download the preprocessed files as follows:
|
| 83 |
+
```
|
| 84 |
+
mkdir data/msgym/
|
| 85 |
+
cd data/msgym
|
| 86 |
+
wget
|
| 87 |
+
wget
|
| 88 |
+
```
|
| 89 |
+
## Training from scratch
|
| 90 |
+
To train a model from scratch:
|
| 91 |
+
1. Prepare data as described in the data prep section
|
| 92 |
+
2. Modify the configuration in params file as necessary
|
| 93 |
+
3. Train using the following
|
| 94 |
+
```
|
| 95 |
+
# If using formSpec
|
| 96 |
+
python train.py --param_pth params_formSpec.yaml
|
| 97 |
+
|
| 98 |
+
# If using binnedSpec
|
| 99 |
+
python train.py --param_pth params_binnedSpec.yaml
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
## References
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
#### Contact
|
| 106 | |
| 107 |
+
=======
|
app.py
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
import pandas as pd
|
| 3 |
+
import json
|
| 4 |
+
import tempfile
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
# ==============================
|
| 8 |
+
# App Configuration
|
| 9 |
+
# ==============================
|
| 10 |
+
st.set_page_config(
|
| 11 |
+
page_title="MVP",
|
| 12 |
+
page_icon="",
|
| 13 |
+
layout="centered"
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
# initialize session state
|
| 17 |
+
if 'example_mgf' not in st.session_state:
|
| 18 |
+
st.session_state['example_mgf'] = None
|
| 19 |
+
if 'example_json' not in st.session_state:
|
| 20 |
+
st.session_state['example_json'] = None
|
| 21 |
+
|
| 22 |
+
# ==============================
|
| 23 |
+
# Introductory Section
|
| 24 |
+
# ==============================
|
| 25 |
+
st.title("MVP Playground")
|
| 26 |
+
|
| 27 |
+
st.markdown("""
|
| 28 |
+
This web app lets you test our trained model on your own data.
|
| 29 |
+
|
| 30 |
+
### 📚 References
|
| 31 |
+
🔗 **Paper:** [Read the publication here](https://github.com/HassounLab/MVP)
|
| 32 |
+
📦 **Source Code:** [GitHub Repository](https://github.com/HassounLab/MVP)
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
### 🧠 Available Models
|
| 37 |
+
We have two models trained on the [MassSpecGym](https://github.com/pluskal-lab/MassSpecGym) training dataset:
|
| 38 |
+
- **binnedSpec** – trained on binned spectra and does not require formula information.
|
| 39 |
+
- **formSpec** – our main model trained on spectra with subformula annotations. Requires formula and adduct information.
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
### ⚙️ Instructions
|
| 44 |
+
|
| 45 |
+
1. **Prepare two input files:**
|
| 46 |
+
- **Spectra file (.mgf)** – your experimental spectra data.
|
| 47 |
+
- **Candidates file (.json)** – candidate molecules for each spectrum.
|
| 48 |
+
|
| 49 |
+
2. **Select a model** from the dropdown.
|
| 50 |
+
|
| 51 |
+
3. **Click “Run Prediction”** to start processing.
|
| 52 |
+
⚠️ **Note:** For fair usage, the web app limits computation to **1,000 pairs**. Each pair consists of one spectrum and one candidate molecule.
|
| 53 |
+
|
| 54 |
+
4. After processing, you’ll receive a downloadable **CSV file** with your results.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
### 📁 Example Input Files
|
| 59 |
+
|
| 60 |
+
You can download example files to understand the required format:
|
| 61 |
+
- [Download sample spectra (MGF)](data/app/data.mgf)
|
| 62 |
+
- [Download sample candidates (JSON)](data/app/identifier_to_candidates.json)
|
| 63 |
+
|
| 64 |
+
Here's an example of the spectra file format (.mgf):
|
| 65 |
+
```
|
| 66 |
+
BEGIN IONS
|
| 67 |
+
TITLE=example_spectrum
|
| 68 |
+
PEPMASS=100.0
|
| 69 |
+
CHARGE=1+
|
| 70 |
+
FORMULA=C10H12O2 # optional, required for formSpec model
|
| 71 |
+
ADDUCT=[M+H]+ # optional, required for formSpec model
|
| 72 |
+
100.0 1000
|
| 73 |
+
101.0 1500
|
| 74 |
+
102.0 2000
|
| 75 |
+
END IONS
|
| 76 |
+
```
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
### 💡 Tip
|
| 80 |
+
If you want to process **more than 1,000 pairs**,
|
| 81 |
+
please **clone the repository** and run it locally with GPU support for faster computation.
|
| 82 |
+
""")
|
| 83 |
+
|
| 84 |
+
# ==============================
|
| 85 |
+
# File Upload Section
|
| 86 |
+
# ==============================
|
| 87 |
+
st.subheader("📤 Upload Your Files")
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
# --- File uploaders ---
|
| 91 |
+
mgf_file = st.file_uploader("Upload spectra file (.mgf)", type=["mgf"])
|
| 92 |
+
json_file = st.file_uploader("Upload candidates file (.json)", type=["json"])
|
| 93 |
+
|
| 94 |
+
# --- Example files button ---
|
| 95 |
+
if st.button("Use Example Files"):
|
| 96 |
+
with open("data/app/data.mgf", "rb") as f:
|
| 97 |
+
st.session_state["example_mgf"] = f.read()
|
| 98 |
+
with open("data/app/identifier_to_candidates.json", "rb") as f:
|
| 99 |
+
st.session_state["example_json"] = f.read()
|
| 100 |
+
st.success("✅ Example files loaded!")
|
| 101 |
+
|
| 102 |
+
# --- Determine which files to use ---
|
| 103 |
+
if mgf_file is not None:
|
| 104 |
+
mgf_bytes = mgf_file.read()
|
| 105 |
+
elif "example_mgf" in st.session_state:
|
| 106 |
+
mgf_bytes = st.session_state["example_mgf"]
|
| 107 |
+
else:
|
| 108 |
+
mgf_bytes = None
|
| 109 |
+
|
| 110 |
+
if json_file is not None:
|
| 111 |
+
json_bytes = json_file.read()
|
| 112 |
+
elif "example_json" in st.session_state:
|
| 113 |
+
json_bytes = st.session_state["example_json"]
|
| 114 |
+
else:
|
| 115 |
+
json_bytes = None
|
| 116 |
+
|
| 117 |
+
# --- Display results ---
|
| 118 |
+
if mgf_bytes and json_bytes:
|
| 119 |
+
st.success("Files are ready to use!")
|
| 120 |
+
else:
|
| 121 |
+
st.info("Please upload your files or 'Use Example Files'.")
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
# ==============================
|
| 125 |
+
# Model Selection and Run Button
|
| 126 |
+
# ==============================
|
| 127 |
+
model_choice = st.selectbox(
|
| 128 |
+
"Select model to use:",
|
| 129 |
+
options=["binnedSpec", "formSpec"]
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
run_button = st.button("🚀 Run Prediction")
|
| 133 |
+
|
| 134 |
+
# ==============================
|
| 135 |
+
# Run Prediction
|
| 136 |
+
# ==============================
|
| 137 |
+
if run_button:
|
| 138 |
+
if not mgf_bytes or not json_bytes:
|
| 139 |
+
st.error("Please upload both a spectra (.mgf) and candidates (.json) file.")
|
| 140 |
+
else:
|
| 141 |
+
with st.spinner("Running predictions... please wait ⏳", show_time=True):
|
| 142 |
+
# Save uploaded files to temporary paths
|
| 143 |
+
st.write("Saving files to temporary paths...")
|
| 144 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".mgf") as tmp_mgf:
|
| 145 |
+
tmp_mgf.write(mgf_bytes)
|
| 146 |
+
mgf_path = tmp_mgf.name
|
| 147 |
+
|
| 148 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix=".json") as tmp_json:
|
| 149 |
+
tmp_json.write(json_bytes)
|
| 150 |
+
candidates_pth = tmp_json.name
|
| 151 |
+
|
| 152 |
+
# Check number of pairs in candidates file
|
| 153 |
+
st.write("Checking number of pairs in candidates file...")
|
| 154 |
+
with open(candidates_pth, 'r') as f:
|
| 155 |
+
candidates_data = json.load(f)
|
| 156 |
+
total_pairs = sum(len(cands) for cands in candidates_data.values())
|
| 157 |
+
if total_pairs > 1000:
|
| 158 |
+
st.error(f"⚠️ Too many pairs ({total_pairs})! Please limit to 1,000 pairs for the web app.")
|
| 159 |
+
st.stop()
|
| 160 |
+
|
| 161 |
+
# preprocess spectra
|
| 162 |
+
st.write("Preprocessing spectra...")
|
| 163 |
+
from utils_app import preprocess_spectra, setup_config, run_inference
|
| 164 |
+
dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20)
|
| 165 |
+
|
| 166 |
+
if dataset_pth is None:
|
| 167 |
+
st.error("Error in preprocessing spectra. Please check your input files.")
|
| 168 |
+
if model_choice == "formSpec":
|
| 169 |
+
st.info("Make sure that for 'formSpec' model, each spectrum has 'formula' and 'adduct' metadata.")
|
| 170 |
+
st.stop()
|
| 171 |
+
|
| 172 |
+
# Prepare model config paths
|
| 173 |
+
st.write("Preparing model config paths...")
|
| 174 |
+
params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
|
| 175 |
+
|
| 176 |
+
try:
|
| 177 |
+
st.write("Running inference...")
|
| 178 |
+
run_inference(params)
|
| 179 |
+
except Exception as e:
|
| 180 |
+
st.error(f"Error running model inference: {e}")
|
| 181 |
+
st.stop()
|
| 182 |
+
|
| 183 |
+
# Convert to CSV
|
| 184 |
+
st.write("Converting to CSV...")
|
| 185 |
+
df = pd.read_pickle(params['df_test_path'])
|
| 186 |
+
csv_path = params['df_test_path'].replace(".pkl", ".csv")
|
| 187 |
+
df.to_csv(csv_path, index=False)
|
| 188 |
+
|
| 189 |
+
st.success(f"✅ Done! Model: {model_choice}")
|
| 190 |
+
st.download_button(
|
| 191 |
+
label="📥 Download Results CSV",
|
| 192 |
+
data=open(csv_path, "rb").read(),
|
| 193 |
+
file_name=os.path.basename(csv_path),
|
| 194 |
+
mime="text/csv"
|
| 195 |
+
)
|
| 196 |
+
|
| 197 |
+
st.info("To run larger datasets or enable GPU acceleration, please clone the repo and run locally.")
|
| 198 |
+
|
| 199 |
+
# ==============================
|
| 200 |
+
# Footer
|
| 201 |
+
# ==============================
|
| 202 |
+
st.markdown("---")
|
| 203 |
+
|
utils_app.py
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from matchms.importing import load_from_mgf
|
| 2 |
+
import yaml
|
| 3 |
+
import numpy as np
|
| 4 |
+
from mvp.subformula_assign.utils.spectra_utils import assign_subforms
|
| 5 |
+
import tempfile
|
| 6 |
+
import json
|
| 7 |
+
import os
|
| 8 |
+
from functools import partial
|
| 9 |
+
from pytorch_lightning import Trainer
|
| 10 |
+
from massspecgym.models.base import Stage
|
| 11 |
+
from mvp.data.data_module import TestDataModule
|
| 12 |
+
from mvp.data.datasets import ContrastiveDataset
|
| 13 |
+
from mvp.utils.data import get_spec_featurizer, get_mol_featurizer, get_test_ms_dataset
|
| 14 |
+
from mvp.utils.models import get_model
|
| 15 |
+
import pandas as pd
|
| 16 |
+
|
| 17 |
+
# check formspec requirements
|
| 18 |
+
def check_formspec_requirements(spectra):
|
| 19 |
+
for spec in spectra:
|
| 20 |
+
if 'formula' not in spec.metadata or 'adduct' not in spec.metadata:
|
| 21 |
+
return False
|
| 22 |
+
return True
|
| 23 |
+
|
| 24 |
+
# preprocess spectra
|
| 25 |
+
def preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20, dataset_pth=None, subformula_dir=None):
|
| 26 |
+
|
| 27 |
+
if dataset_pth is None:
|
| 28 |
+
dataset_pth = os.path.join(tempfile.gettempdir(), f"mvp_data.tsv")
|
| 29 |
+
if subformula_dir is None:
|
| 30 |
+
subformula_dir = os.path.join(tempfile.gettempdir(), f"mvp_subformulae")
|
| 31 |
+
os.makedirs(subformula_dir, exist_ok=True)
|
| 32 |
+
|
| 33 |
+
# load mgf file
|
| 34 |
+
spectra = list(load_from_mgf(mgf_path))
|
| 35 |
+
|
| 36 |
+
columns = ['identifier', 'formula', 'adduct', 'precursor_mz', 'precursor_formula', 'mzs', 'intensities', 'fold']
|
| 37 |
+
data = []
|
| 38 |
+
try:
|
| 39 |
+
for spec in spectra:
|
| 40 |
+
identifier = spec.metadata['title']
|
| 41 |
+
formula = spec.metadata.get('formula', None)
|
| 42 |
+
adduct = spec.metadata.get('adduct', None)
|
| 43 |
+
precursor_mz = spec.metadata['precursor_mz']
|
| 44 |
+
precursor_formula = spec.metadata['formula'] # technically incorrect, but we don't use it
|
| 45 |
+
mzs = spec.peaks.mz
|
| 46 |
+
intensities = spec.peaks.intensities
|
| 47 |
+
|
| 48 |
+
if model_choice == "formSpec":
|
| 49 |
+
if formula is None or adduct is None:
|
| 50 |
+
return None, None
|
| 51 |
+
ms = [(m, i) for m, i in zip(mzs, intensities)]
|
| 52 |
+
|
| 53 |
+
# annotate peaks
|
| 54 |
+
x = assign_subforms(formula, np.array(ms), adduct, mass_diff_thresh=mass_diff_thresh)
|
| 55 |
+
if x['output_tbl'] is None:
|
| 56 |
+
continue
|
| 57 |
+
|
| 58 |
+
# save json file
|
| 59 |
+
json_file = os.path.join(subformula_dir, f"{identifier}.json")
|
| 60 |
+
with open(json_file, 'w') as f:
|
| 61 |
+
json.dump(x['output_tbl'], f)
|
| 62 |
+
|
| 63 |
+
mzs = ','.join([str(m) for m in mzs])
|
| 64 |
+
intensities = ','.join([str(i) for i in intensities])
|
| 65 |
+
data.append([identifier, formula, adduct, precursor_mz, precursor_formula, mzs, intensities, 'test'])
|
| 66 |
+
|
| 67 |
+
df = pd.DataFrame(data, columns=columns)
|
| 68 |
+
df.to_csv(dataset_pth, sep='\t', index=False)
|
| 69 |
+
|
| 70 |
+
return dataset_pth, subformula_dir
|
| 71 |
+
except Exception as e:
|
| 72 |
+
return None, None
|
| 73 |
+
|
| 74 |
+
def setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir):
|
| 75 |
+
|
| 76 |
+
if model_choice == "binnedSpec":
|
| 77 |
+
param_file = f"mvp/params_binnedSpec.yaml"
|
| 78 |
+
checkpoint_path = f"pretrained_models/msgym_binnedSpec.ckpt"
|
| 79 |
+
elif model_choice == "formSpec":
|
| 80 |
+
param_file = f"mvp/params_formSpec.yaml"
|
| 81 |
+
checkpoint_path = f"pretrained_models/msgym_formSpec.ckpt"
|
| 82 |
+
|
| 83 |
+
# load yaml
|
| 84 |
+
with open(param_file, 'r') as f:
|
| 85 |
+
params = yaml.safe_load(f)
|
| 86 |
+
|
| 87 |
+
params['dataset_pth'] = dataset_pth
|
| 88 |
+
params['candidates_pth'] = candidates_pth
|
| 89 |
+
params['subformula_dir_pth'] = subformula_dir
|
| 90 |
+
params['experiment_dir'] = tempfile.mkdtemp()
|
| 91 |
+
params['checkpoint_pth'] = checkpoint_path
|
| 92 |
+
params['df_test_path'] = os.path.join(params['experiment_dir'], f"results_{model_choice}.pkl")
|
| 93 |
+
|
| 94 |
+
return params
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
def run_inference(params):
|
| 98 |
+
|
| 99 |
+
# Load dataset
|
| 100 |
+
spec_featurizer = get_spec_featurizer(params['spectra_view'], params)
|
| 101 |
+
mol_featurizer = get_mol_featurizer(params['molecule_view'], params)
|
| 102 |
+
dataset = get_test_ms_dataset(params['spectra_view'], params['molecule_view'], spec_featurizer, mol_featurizer, params, external_test=True)
|
| 103 |
+
|
| 104 |
+
# Init data module
|
| 105 |
+
collate_fn = partial(ContrastiveDataset.collate_fn, spec_enc=params['spec_enc'], spectra_view=params['spectra_view'], stage=Stage.TEST)
|
| 106 |
+
data_module = TestDataModule(
|
| 107 |
+
dataset=dataset,
|
| 108 |
+
collate_fn=collate_fn,
|
| 109 |
+
split_pth=params['split_pth'],
|
| 110 |
+
batch_size=params['batch_size'],
|
| 111 |
+
num_workers=params['num_workers']
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
model = get_model(params['model'], params)
|
| 115 |
+
print(model.hparams)
|
| 116 |
+
model.df_test_path = params['df_test_path']
|
| 117 |
+
model.external_test = True
|
| 118 |
+
model.hparams['use_fp'] = False
|
| 119 |
+
model.hparams["contr_views"] = [['spec_enc', 'mol_enc']]
|
| 120 |
+
model.hparams['use_cons_spec'] = False
|
| 121 |
+
|
| 122 |
+
# Init trainer
|
| 123 |
+
trainer = Trainer(
|
| 124 |
+
accelerator=params['accelerator'],
|
| 125 |
+
devices=params['devices'],
|
| 126 |
+
default_root_dir=params['experiment_dir']
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
# Prepare data module to test
|
| 130 |
+
data_module.prepare_data()
|
| 131 |
+
data_module.setup(stage="test")
|
| 132 |
+
|
| 133 |
+
# Test
|
| 134 |
+
trainer.test(model, datamodule=data_module)
|
| 135 |
+
|
| 136 |
+
if __name__ == "__main__":
|
| 137 |
+
|
| 138 |
+
# test run
|
| 139 |
+
mgf_path = "data/app/data.mgf"
|
| 140 |
+
model_choice = "formSpec"
|
| 141 |
+
candidates_pth = "data/app/identifier_to_candidates.json"
|
| 142 |
+
mass_diff_thresh = 20
|
| 143 |
+
dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=mass_diff_thresh)
|
| 144 |
+
params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
|
| 145 |
+
print(params)
|
| 146 |
+
run_inference(params)
|