Spaces:

HassounLab
/

MVP

Sleeping

App Files Files Community

yzhouchen001 commited on Oct 24

Commit

78ba665

1 Parent(s): 6937578

st app

Browse files

Files changed (3) hide show

README.md +95 -1
app.py +203 -0
utils_app.py +146 -0

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ title: MVP
 emoji: 🏆
 colorFrom: blue
 colorTo: pink
-sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
@@ -11,3 +11,97 @@ short_description: msms annotation tool
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 emoji: 🏆
 colorFrom: blue
 colorTo: pink
+sdk: streamlit
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# MultiView Projection (MVP) for Spectra Annotation
+###  Yan Zhou Chen, Soha Hassoun
+#### Department of Computer Science, Tufts University
+This repository provides the implementation of MultiView Projection (MVP). MVP can be used to rank a set of molecular candidates given a spectrum.
+## Table of Contents
+1. [Install & setup]
+2. [Data prep]
+3. [MassSpecGym data download]
+4. [Use our pretrained model]
+5. [Training from scratch]
+6. [References]
+## Install & setup
+1. Clone the repository: git clone <REPO_link>
+2. Install evironment or only key packages:
+```
+conda env create -f environment.yml
+```
+#### Key packages
+- python
+- dgl
+- pytorch
+- rdkit
+- pytorch-geometric
+- numpy
+- scikit-learn
+- scipy
+- massspecgym
+- lightning
+## Data prep
+We provide sample spectra data and candidates in `data/sample`.
+For preprocessing:
+1. If using formSpec, compute subformula labels
+2. Run our preprocess code to obatain fingerprints and consensus spectra files
+```
+# If using formSpec
+python subformula_assign/assign_subformulae.py --spec-files ../data/sample/data.tsv --output-dir ../data/sample/subformulae_default --max-formulae 60 --labels-file ../data/sample/data.tsv
+python data_preprocess.py --spec_type formSpec --dataset_pth ../data/sample/data.tsv --candidates_pth  ../data/sample/candidates_mass.json --subformula_dir_pth ../data/sample/subformulae_default/ --output_dir ../data/sample/
+# If using binnedSpec
+python data_preprocess.py --spec_type binnedSpec --dataset_pth ../data/sample/data.tsv --candidates_pth  ../data/sample/candidates_mass.json --output_dir ../data/sample/
+```
+We include sample subformula, fingerprint, and consensus spectra data in `../data/sample/`.
+## Use our pretrained model
+You can use our pretrained model (on MassSpecGym) to rank molecular candidates by providing the spectra data and a list of candidates.
+After prepping your data, modify the params_binnedSpec.yaml or params_formSpec.yaml with your dataset paths:
+```
+# If using formSpec
+python test.py --param_pth params_formSpec.yaml
+# If using binnedSpec
+python test.py --param_pth params_binnedSpec.yaml
+```
+We provide a notebook showing sample result files in `notebooks/demo.ipynb`
+## MassSpecGym data download
+Our model is trained on [MassSpecGym dataset](https://github.com/pluskal-lab/MassSpecGym). Follow their instruction to download the spectra and candidate dataset.
+You can preprocess the MassSpecGym dataset as descirbed in the above section or download the preprocessed files as follows:
+```
+mkdir data/msgym/
+cd data/msgym
+wget
+wget
+```
+## Training from scratch
+To train a model from scratch:
+1. Prepare data as described in the data prep section
+2. Modify the configuration in params file as necessary
+3. Train using the following
+```
+# If using formSpec
+python train.py --param_pth params_formSpec.yaml
+# If using binnedSpec
+python train.py --param_pth params_binnedSpec.yaml
+```
+## References
+#### Contact
+[email protected]
+=======

app.py ADDED Viewed

	@@ -0,0 +1,203 @@

+import streamlit as st
+import pandas as pd
+import json
+import tempfile
+import os
+# ==============================
+# App Configuration
+# ==============================
+st.set_page_config(
+    page_title="MVP",
+    page_icon="",
+    layout="centered"
+)
+# initialize session state
+if 'example_mgf' not in st.session_state:
+    st.session_state['example_mgf'] = None
+if 'example_json' not in st.session_state:
+    st.session_state['example_json'] = None
+# ==============================
+# Introductory Section
+# ==============================
+st.title("MVP Playground")
+st.markdown("""
+This web app lets you test our trained model on your own data.
+### 📚 References
+🔗 **Paper:** [Read the publication here](https://github.com/HassounLab/MVP)
+📦 **Source Code:** [GitHub Repository](https://github.com/HassounLab/MVP)
+---
+### 🧠 Available Models
+We have two models trained on the [MassSpecGym](https://github.com/pluskal-lab/MassSpecGym) training dataset:
+- **binnedSpec** – trained on binned spectra and does not require formula information.
+- **formSpec** – our main model trained on spectra with subformula annotations. Requires formula and adduct information.
+---
+### ⚙️ Instructions
+1. **Prepare two input files:**
+   - **Spectra file (.mgf)** – your experimental spectra data.
+   - **Candidates file (.json)** – candidate molecules for each spectrum.
+2. **Select a model** from the dropdown.
+3. **Click “Run Prediction”** to start processing.
+   ⚠️ **Note:** For fair usage, the web app limits computation to **1,000 pairs**. Each pair consists of one spectrum and one candidate molecule.
+4. After processing, you’ll receive a downloadable **CSV file** with your results.
+---
+### 📁 Example Input Files
+You can download example files to understand the required format:
+- [Download sample spectra (MGF)](data/app/data.mgf)
+- [Download sample candidates (JSON)](data/app/identifier_to_candidates.json)
+Here's an example of the spectra file format (.mgf):
+```
+BEGIN IONS
+TITLE=example_spectrum
+PEPMASS=100.0
+CHARGE=1+
+FORMULA=C10H12O2 # optional, required for formSpec model
+ADDUCT=[M+H]+ # optional, required for formSpec model
+100.0 1000
+101.0 1500
+102.0 2000
+END IONS
+```
+---
+### 💡 Tip
+If you want to process **more than 1,000 pairs**,
+please **clone the repository** and run it locally with GPU support for faster computation.
+""")
+# ==============================
+# File Upload Section
+# ==============================
+st.subheader("📤 Upload Your Files")
+# --- File uploaders ---
+mgf_file = st.file_uploader("Upload spectra file (.mgf)", type=["mgf"])
+json_file = st.file_uploader("Upload candidates file (.json)", type=["json"])
+# --- Example files button ---
+if st.button("Use Example Files"):
+    with open("data/app/data.mgf", "rb") as f:
+        st.session_state["example_mgf"] = f.read()
+    with open("data/app/identifier_to_candidates.json", "rb") as f:
+        st.session_state["example_json"] = f.read()
+    st.success("✅ Example files loaded!")
+# --- Determine which files to use ---
+if mgf_file is not None:
+    mgf_bytes = mgf_file.read()
+elif "example_mgf" in st.session_state:
+    mgf_bytes = st.session_state["example_mgf"]
+else:
+    mgf_bytes = None
+if json_file is not None:
+    json_bytes = json_file.read()
+elif "example_json" in st.session_state:
+    json_bytes = st.session_state["example_json"]
+else:
+    json_bytes = None
+# --- Display results ---
+if mgf_bytes and json_bytes:
+    st.success("Files are ready to use!")
+else:
+    st.info("Please upload your files or 'Use Example Files'.")
+# ==============================
+# Model Selection and Run Button
+# ==============================
+model_choice = st.selectbox(
+    "Select model to use:",
+    options=["binnedSpec", "formSpec"]
+)
+run_button = st.button("🚀 Run Prediction")
+# ==============================
+# Run Prediction
+# ==============================
+if run_button:
+    if not mgf_bytes or not json_bytes:
+        st.error("Please upload both a spectra (.mgf) and candidates (.json) file.")
+    else:
+        with st.spinner("Running predictions... please wait ⏳", show_time=True):
+            # Save uploaded files to temporary paths
+            st.write("Saving files to temporary paths...")
+            with tempfile.NamedTemporaryFile(delete=False, suffix=".mgf") as tmp_mgf:
+                tmp_mgf.write(mgf_bytes)
+                mgf_path = tmp_mgf.name
+            with tempfile.NamedTemporaryFile(delete=False, suffix=".json") as tmp_json:
+                tmp_json.write(json_bytes)
+                candidates_pth = tmp_json.name
+            # Check number of pairs in candidates file
+            st.write("Checking number of pairs in candidates file...")
+            with open(candidates_pth, 'r') as f:
+                candidates_data = json.load(f)
+            total_pairs = sum(len(cands) for cands in candidates_data.values())
+            if total_pairs > 1000:
+                st.error(f"⚠️ Too many pairs ({total_pairs})! Please limit to 1,000 pairs for the web app.")
+                st.stop()
+            # preprocess spectra
+            st.write("Preprocessing spectra...")
+            from utils_app import preprocess_spectra, setup_config, run_inference
+            dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20)
+            if dataset_pth is None:
+                st.error("Error in preprocessing spectra. Please check your input files.")
+                if model_choice == "formSpec":
+                    st.info("Make sure that for 'formSpec' model, each spectrum has 'formula' and 'adduct' metadata.")
+                st.stop()
+            # Prepare model config paths
+            st.write("Preparing model config paths...")
+            params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
+            try:
+                st.write("Running inference...")
+                run_inference(params)
+            except Exception as e:
+                st.error(f"Error running model inference: {e}")
+                st.stop()
+            # Convert to CSV
+            st.write("Converting to CSV...")
+            df = pd.read_pickle(params['df_test_path'])
+            csv_path = params['df_test_path'].replace(".pkl", ".csv")
+            df.to_csv(csv_path, index=False)
+            st.success(f"✅ Done! Model: {model_choice}")
+            st.download_button(
+                label="📥 Download Results CSV",
+                data=open(csv_path, "rb").read(),
+                file_name=os.path.basename(csv_path),
+                mime="text/csv"
+            )
+        st.info("To run larger datasets or enable GPU acceleration, please clone the repo and run locally.")
+# ==============================
+# Footer
+# ==============================
+st.markdown("---")

utils_app.py ADDED Viewed

	@@ -0,0 +1,146 @@

+from matchms.importing import load_from_mgf
+import yaml
+import numpy as np
+from mvp.subformula_assign.utils.spectra_utils import assign_subforms
+import tempfile
+import json
+import os
+from functools import partial
+from pytorch_lightning import Trainer
+from massspecgym.models.base import Stage
+from mvp.data.data_module import TestDataModule
+from mvp.data.datasets import ContrastiveDataset
+from mvp.utils.data import get_spec_featurizer, get_mol_featurizer, get_test_ms_dataset
+from mvp.utils.models import get_model
+import pandas as pd
+# check formspec requirements
+def check_formspec_requirements(spectra):
+    for spec in spectra:
+        if 'formula' not in spec.metadata or 'adduct' not in spec.metadata:
+            return False
+    return True
+# preprocess spectra
+def preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=20, dataset_pth=None, subformula_dir=None):
+    if dataset_pth is None:
+        dataset_pth = os.path.join(tempfile.gettempdir(), f"mvp_data.tsv")
+    if subformula_dir is None:
+        subformula_dir = os.path.join(tempfile.gettempdir(), f"mvp_subformulae")
+    os.makedirs(subformula_dir, exist_ok=True)
+    # load mgf file
+    spectra = list(load_from_mgf(mgf_path))
+    columns = ['identifier', 'formula', 'adduct', 'precursor_mz', 'precursor_formula', 'mzs', 'intensities', 'fold']
+    data = []
+    try:
+        for spec in spectra:
+            identifier = spec.metadata['title']
+            formula = spec.metadata.get('formula', None)
+            adduct = spec.metadata.get('adduct', None)
+            precursor_mz = spec.metadata['precursor_mz']
+            precursor_formula = spec.metadata['formula'] # technically incorrect, but we don't use it
+            mzs = spec.peaks.mz
+            intensities = spec.peaks.intensities
+            if model_choice == "formSpec":
+                if formula is None or adduct is None:
+                    return None, None
+                ms = [(m, i) for m, i in zip(mzs, intensities)]
+                # annotate peaks
+                x = assign_subforms(formula, np.array(ms), adduct, mass_diff_thresh=mass_diff_thresh)
+                if x['output_tbl'] is None:
+                    continue
+                # save json file
+                json_file = os.path.join(subformula_dir, f"{identifier}.json")
+                with open(json_file, 'w') as f:
+                    json.dump(x['output_tbl'], f)
+            mzs = ','.join([str(m) for m in mzs])
+            intensities = ','.join([str(i) for i in intensities])
+            data.append([identifier, formula, adduct, precursor_mz, precursor_formula, mzs, intensities, 'test'])
+        df = pd.DataFrame(data, columns=columns)
+        df.to_csv(dataset_pth, sep='\t', index=False)
+        return dataset_pth, subformula_dir
+    except Exception as e:
+        return None, None
+def setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir):
+    if model_choice == "binnedSpec":
+        param_file = f"mvp/params_binnedSpec.yaml"
+        checkpoint_path = f"pretrained_models/msgym_binnedSpec.ckpt"
+    elif model_choice == "formSpec":
+        param_file = f"mvp/params_formSpec.yaml"
+        checkpoint_path = f"pretrained_models/msgym_formSpec.ckpt"
+    # load yaml
+    with open(param_file, 'r') as f:
+        params = yaml.safe_load(f)
+    params['dataset_pth'] = dataset_pth
+    params['candidates_pth'] = candidates_pth
+    params['subformula_dir_pth'] = subformula_dir
+    params['experiment_dir'] = tempfile.mkdtemp()
+    params['checkpoint_pth'] = checkpoint_path
+    params['df_test_path'] = os.path.join(params['experiment_dir'], f"results_{model_choice}.pkl")
+    return params
+def run_inference(params):
+    # Load dataset
+    spec_featurizer = get_spec_featurizer(params['spectra_view'], params)
+    mol_featurizer = get_mol_featurizer(params['molecule_view'], params)
+    dataset = get_test_ms_dataset(params['spectra_view'], params['molecule_view'], spec_featurizer, mol_featurizer, params, external_test=True)
+    # Init data module
+    collate_fn = partial(ContrastiveDataset.collate_fn, spec_enc=params['spec_enc'], spectra_view=params['spectra_view'], stage=Stage.TEST)
+    data_module = TestDataModule(
+        dataset=dataset,
+        collate_fn=collate_fn,
+        split_pth=params['split_pth'],
+        batch_size=params['batch_size'],
+        num_workers=params['num_workers']
+    )
+    model = get_model(params['model'], params)
+    print(model.hparams)
+    model.df_test_path = params['df_test_path']
+    model.external_test = True
+    model.hparams['use_fp'] = False
+    model.hparams["contr_views"] = [['spec_enc', 'mol_enc']]
+    model.hparams['use_cons_spec'] = False
+    # Init trainer
+    trainer = Trainer(
+        accelerator=params['accelerator'],
+        devices=params['devices'],
+        default_root_dir=params['experiment_dir']
+    )
+    # Prepare data module to test
+    data_module.prepare_data()
+    data_module.setup(stage="test")
+    # Test
+    trainer.test(model, datamodule=data_module)
+if __name__ == "__main__":
+    # test run
+    mgf_path = "data/app/data.mgf"
+    model_choice = "formSpec"
+    candidates_pth = "data/app/identifier_to_candidates.json"
+    mass_diff_thresh = 20
+    dataset_pth, subformula_dir = preprocess_spectra(mgf_path, model_choice, mass_diff_thresh=mass_diff_thresh)
+    params = setup_config(model_choice, dataset_pth, candidates_pth, subformula_dir)
+    print(params)
+    run_inference(params)