bilalsm commited on
Commit
2d4adf8
·
verified ·
1 Parent(s): 0420e76

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +192 -0
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-nd-4.0
3
+ tags:
4
+ - mass-spectrometry
5
+ - molecular-formula
6
+ - dissolved-organic-matter
7
+ - knn
8
+ - scikit-learn
9
+ library_name: sklearn
10
+ pipeline_tag: feature-extraction
11
+ ---
12
+
13
+ # DOM Formula Assignment using K-Nearest Neighbors
14
+
15
+
16
+ ![Model Type](https://img.shields.io/badge/Model-KNN-blue)
17
+ ![Data](https://img.shields.io/badge/Data-FT--ICR_MS-green)
18
+ ![License](https://img.shields.io/badge/License-CC_BY_NC_ND_4-yellow)
19
+ [![GitHub](https://img.shields.io/badge/GitHub-pcdslab/dom--formula--assignment--using--ml-blue?logo=github)](https://github.com/pcdslab/dom-formula-assignment-using-ml)
20
+
21
+ **A Machine Learning Approach to Enhanced Molecular Formula Assignment in Fulvic Acid DOM Mass Spectra**
22
+
23
+ > **Paper**: Under review
24
+
25
+ ---
26
+
27
+ ## Abstract
28
+ Dissolved organic matter (DOM) is a critical component of aquatic ecosystems, with the fulvic acid fraction (FA-DOM) exhibiting high mobility and ready bioavailability to microbial communities. While understanding the molecular composition is a vital area of study, the heterogeneity of the material, with a vast number of diverse compounds, makes this task challenging. Existing methods often struggle with incomplete formula assignment or reduced coverage highlighting the need for a better approach. In this study, we developed a machine learning approach using the k-nearest neighbors (KNN) algorithm to predict molecular formulas from ultra-high-resolution mass spectrometry data. The model was trained on chemical formulas assigned to multiple DOM samples using 7 Tesla(7T) and a 21 Tesla(21T) Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) system, and tested on an independent 9.4 T FT-ICR MS Fulvic Acid dataset. A synthetic dataset of plausible elemental combinations (C, H, O, N, S) was also generated to enhance generalization. Our approach achieved a 99.9% assignment rate on the labeled test set and assigned a total of 13,605 formulas for unlabeled peaks compared to the existing approach, which assigned 5914 formulas, achieving up to a 2.3X improvement in formula assignment coverage compared to existing methods.
29
+
30
+
31
+ ![Architecture](Architecture.png)
32
+
33
+ ---
34
+
35
+ ## Model Variants
36
+
37
+ ### Single Models (8 variants)
38
+ Trained on individual datasets (7T or 21T FT-ICR MS data):
39
+
40
+ | Data Source | K | Metric | Variant Name |
41
+ |-------------|---|--------|--------------|
42
+ | 7T | 1 | Euclidean | `knn_7T_k1_euclidean` |
43
+ | 7T | 1 | Manhattan | `knn_7T_k1_manhattan` |
44
+ | 7T | 3 | Euclidean | `knn_7T_k3_euclidean` |
45
+ | 7T | 3 | Manhattan | `knn_7T_k3_manhattan` |
46
+ | 21T | 1 | Euclidean | `knn_21T_k1_euclidean` |
47
+ | 21T | 1 | Manhattan | `knn_21T_k1_manhattan` |
48
+ | 21T | 3 | Euclidean | `knn_21T_k3_euclidean` |
49
+ | 21T | 3 | Manhattan | `knn_21T_k3_manhattan` |
50
+
51
+ ### Ensemble Models (8 variants)
52
+ Each combines multiple sub-models trained on different data versions:
53
+
54
+ | Data Source | K | Metric | Variant Name | Sub-models |
55
+ |-------------|---|--------|--------------|------------|
56
+ | **7T-21T** | 1 | Euclidean | `knn_7T21T_k1_euclidean_ensemble` | 2 (ver2+ver3) |
57
+ | **7T-21T** | 1 | Manhattan | `knn_7T21T_k1_manhattan_ensemble` | 2 (ver2+ver3) |
58
+ | **7T-21T** | 3 | Euclidean | `knn_7T21T_k3_euclidean_ensemble` | 2 (ver2+ver3) |
59
+ | **7T-21T** | 3 | Manhattan | `knn_7T21T_k3_manhattan_ensemble` | 2 (ver2+ver3) |
60
+ | **Synthetic** | 1 | Euclidean | `knn_Synthetic_k1_euclidean_ensemble` | 3 (ver2+ver3+synth) |
61
+ | **Synthetic** | 1 | Manhattan | `knn_Synthetic_k1_manhattan_ensemble` | 3 (ver2+ver3+synth) |
62
+ | **Synthetic** | 3 | Euclidean | `knn_Synthetic_k3_euclidean_ensemble` | 3 (ver2+ver3+synth) |
63
+ | **Synthetic** | 3 | Manhattan | `knn_Synthetic_k3_manhattan_ensemble` | 3 (ver2+ver3+synth) |
64
+
65
+ ---
66
+
67
+ ## Performance
68
+
69
+ Results on combined test sets (Suwannee River Fulvic Acid + Pahokee River Fulvic Acid + others):
70
+
71
+ | Model | True Predictions | New Assignments | False Predictions | **Assignment Rate** |
72
+ |-------|-----------------|-----------------|-------------------|---------------------|
73
+ | **Synthetic (K=1, Euclidean)** | 2,623 | 1,423 | 1 | **99.975%** |
74
+ | **Synthetic (K=1, Manhattan)** | 2,623 | 1,423 | 1 | **99.975%** |
75
+ | **Synthetic (K=3, Euclidean)** | 2,631 | 1,415 | 1 | **99.975%** |
76
+ | **Synthetic (K=3, Manhattan)** | 2,631 | 1,415 | 1 | **99.975%** |
77
+ | **7T-21T (K=1, Euclidean)** | 3,851 | 8 | 188 | **95.355%** |
78
+ | **7T-21T (K=1, Manhattan)** | 3,851 | 8 | 188 | **95.355%** |
79
+ | **7T-21T (K=3, Euclidean)** | 3,846 | 10 | 191 | **95.280%** |
80
+ | **7T-21T (K=3, Manhattan)** | 3,846 | 10 | 191 | **95.280%** |
81
+ | 21T (K=1, Euclidean) | 3,835 | 10 | 202 | 95.009% |
82
+ | 21T (K=1, Manhattan) | 3,835 | 10 | 202 | 95.009% |
83
+ | 21T (K=3, Euclidean) | 3,831 | 11 | 205 | 94.935% |
84
+ | 21T (K=3, Manhattan) | 3,831 | 11 | 205 | 94.935% |
85
+ | 7T (K=1, Euclidean) | 3,201 | 6 | 840 | 79.244% |
86
+ | 7T (K=1, Manhattan) | 3,201 | 6 | 840 | 79.244% |
87
+ | 7T (K=3, Euclidean) | 3,201 | 6 | 840 | 79.244% |
88
+ | 7T (K=3, Manhattan) | 3,201 | 6 | 840 | 79.244% |
89
+
90
+ **Key Findings**:
91
+ - **Synthetic models** achieve highest assignment rate (99.975%) and make many new predictions (1,423 novel formulas)
92
+ - **7T-21T ensemble models** provide best performance for real DOM samples (95.4% with only 8 new assignments)
93
+ - **Recommended for most users**: 7T-21T ensemble (K=1) - optimal balance of accuracy and confidence
94
+
95
+ ---
96
+
97
+ ## Quick Start
98
+
99
+ ### Installation
100
+
101
+ ```bash
102
+ pip install transformers huggingface_hub joblib scikit-learn
103
+ ```
104
+
105
+ ### Load Default Model
106
+
107
+ ```python
108
+ from transformers import AutoModel
109
+ import numpy as np
110
+
111
+ # Load best model (7T-21T, K=1, Euclidean)
112
+ model = AutoModel.from_pretrained(
113
+ "SaeedLab/dom-formula-assignment-using-knn",
114
+ trust_remote_code=True
115
+ )
116
+
117
+ # Prepare mass data
118
+ masses = np.array([[245.1234], [387.2156], [512.3478]])
119
+
120
+ # Get formula predictions
121
+ predictions = model(masses)
122
+ print(predictions)
123
+ # Output: ['C12H15O6' 'C20H31O8' 'C28H48O9']
124
+ ```
125
+
126
+ ### Load Specific Variant
127
+
128
+ ```python
129
+ # Load 21T model with K=1 and Euclidean distance
130
+ model = AutoModel.from_pretrained(
131
+ "SaeedLab/dom-formula-assignment-using-knn",
132
+ data_source="21T",
133
+ k_neighbors=1,
134
+ metric="euclidean",
135
+ trust_remote_code=True
136
+ )
137
+
138
+ # Load 7T-21T ensemble (automatically loads 2 sub-models)
139
+ model = AutoModel.from_pretrained(
140
+ "SaeedLab/dom-formula-assignment-using-knn",
141
+ data_source="7T-21T",
142
+ k_neighbors=1,
143
+ metric="euclidean",
144
+ trust_remote_code=True
145
+ )
146
+ ```
147
+
148
+ ### Batch Prediction
149
+
150
+ ```python
151
+ import pandas as pd
152
+
153
+ # Load your peak list
154
+ peaks = pd.read_csv("my_peaks.csv")
155
+ masses = peaks['m/z'].values.reshape(-1, 1)
156
+
157
+ # Predict formulas
158
+ formulas = model(masses)
159
+
160
+ # Add to dataframe
161
+ peaks['formula'] = formulas
162
+ peaks.to_csv("annotated_peaks.csv", index=False)
163
+ ```
164
+
165
+ ---
166
+
167
+ ## Model Selection Guide
168
+
169
+ | Use Case | Recommended Model | Why? |
170
+ |----------|-------------------|------|
171
+ | **Real DOM samples (best overall)** | 7T-21T ensemble (K=1) | Highest verified accuracy (95.4%), minimal new assignments |
172
+ | **Maximum assignment rate** | Synthetic ensemble (K=1) | 99.98% assignment rate (note: makes many novel predictions) |
173
+ | **21T data only** | 21T (K=1, Euclidean) | Optimized for 21T instrument data |
174
+ | **7T data only** | 7T (K=1, Euclidean) | Optimized for 7T instrument data |
175
+ | **Synthetic/simulated data** | Synthetic ensemble | Trained on computationally generated formulas |
176
+
177
+
178
+
179
+
180
+
181
+ ## License
182
+
183
+ This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.
184
+
185
+ ---
186
+
187
+
188
+ ## Contact
189
+
190
+ For any additional questions or comments, contact Fahad Saeed ([email protected]).
191
+
192
+ ---