esunAI commited on
Commit
7411921
Β·
verified Β·
1 Parent(s): fcc74a5

Add missing file: upload_to_huggingface.py

Browse files
Files changed (1) hide show
  1. src/upload_to_huggingface.py +448 -0
src/upload_to_huggingface.py ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Upload FlowFinal model components to Hugging Face Hub.
4
+ """
5
+
6
+ import os
7
+ from huggingface_hub import HfApi, upload_file, upload_folder
8
+ import shutil
9
+ from datetime import datetime
10
+ import json
11
+
12
+ def create_model_card():
13
+ """Create a comprehensive model card for FlowFinal."""
14
+ model_card = """---
15
+ license: mit
16
+ tags:
17
+ - protein-generation
18
+ - antimicrobial-peptides
19
+ - flow-matching
20
+ - protein-design
21
+ - esm
22
+ - amp
23
+ library_name: pytorch
24
+ ---
25
+
26
+ # FlowFinal: AMP Flow Matching Model
27
+
28
+ FlowFinal is a state-of-the-art flow matching model for generating antimicrobial peptides (AMPs). The model uses continuous normalizing flows to generate protein sequences in the ESM-2 embedding space.
29
+
30
+ ## Model Description
31
+
32
+ - **Model Type**: Flow Matching for Protein Generation
33
+ - **Domain**: Antimicrobial Peptide (AMP) Generation
34
+ - **Base Model**: ESM-2 (650M parameters)
35
+ - **Architecture**: Transformer-based flow matching with classifier-free guidance (CFG)
36
+ - **Training Data**: Curated AMP dataset with ~7K sequences
37
+
38
+ ## Key Features
39
+
40
+ - **Classifier-Free Guidance (CFG)**: Enables controlled generation with different conditioning strengths
41
+ - **ESM-2 Integration**: Leverages pre-trained protein language model embeddings
42
+ - **Compression Architecture**: Efficient 16x compression of ESM-2 embeddings (1280 β†’ 80 dimensions)
43
+ - **Multiple CFG Scales**: Support for no conditioning (0.0), weak (3.0), strong (7.5), and very strong (15.0) guidance
44
+
45
+ ## Model Components
46
+
47
+ ### Core Architecture
48
+ - `final_flow_model.py`: Main flow matching model implementation
49
+ - `compressor_with_embeddings.py`: Embedding compression/decompression modules
50
+ - `final_sequence_decoder.py`: ESM-2 embedding to sequence decoder
51
+
52
+ ### Trained Weights
53
+ - `final_compressor_model.pth`: Trained compressor (315MB)
54
+ - `final_decompressor_model.pth`: Trained decompressor (158MB)
55
+ - `amp_flow_model_final_optimized.pth`: Main flow model checkpoint
56
+
57
+ ### Generated Samples (Today's Results)
58
+ - Generated AMP sequences with different CFG scales
59
+ - HMD-AMP validation results showing 8.8% AMP prediction rate
60
+
61
+ ## Performance Results
62
+
63
+ ### HMD-AMP Validation (80 sequences tested)
64
+ - **Total AMPs Predicted**: 7/80 (8.8%)
65
+ - **By CFG Configuration**:
66
+ - No CFG: 1/20 (5.0%)
67
+ - Weak CFG: 2/20 (10.0%)
68
+ - Strong CFG: 4/20 (20.0%) ← Best performance
69
+ - Very Strong CFG: 0/20 (0.0%)
70
+
71
+ ### Best Performing Sequences
72
+ 1. `ILVLVLARRIVGVIVAKVVLYAIVRSVVAAAKSISAVTVAKVTVFFQTTA` (No CFG)
73
+ 2. `EDLSKAKAELQRYLLLSEIVSAFTALTRFYVVLTKIFQIRVKLIAVGQIL` (Weak CFG)
74
+ 3. `IKLSRIAGIIVKRIRVASGDAQRLITASIGFTLSVVLAARFITIILGIVI` (Strong CFG)
75
+
76
+ ## Usage
77
+
78
+ ```python
79
+ from generate_amps import AMPGenerator
80
+
81
+ # Initialize generator
82
+ generator = AMPGenerator(
83
+ model_path="amp_flow_model_final_optimized.pth",
84
+ device='cuda'
85
+ )
86
+
87
+ # Generate AMP samples
88
+ samples = generator.generate_amps(
89
+ num_samples=20,
90
+ num_steps=25,
91
+ cfg_scale=7.5 # Strong CFG recommended
92
+ )
93
+ ```
94
+
95
+ ## Training Details
96
+
97
+ - **Optimizer**: AdamW with cosine annealing
98
+ - **Learning Rate**: 4e-4 (final)
99
+ - **Epochs**: 2000
100
+ - **Final Loss**: 1.318
101
+ - **Training Time**: 2.3 hours on H100
102
+ - **Dataset Size**: 6,983 samples
103
+
104
+ ## Files Structure
105
+
106
+ ```
107
+ FlowFinal/
108
+ β”œβ”€β”€ models/
109
+ β”‚ β”œβ”€β”€ final_compressor_model.pth
110
+ β”‚ β”œβ”€β”€ final_decompressor_model.pth
111
+ β”‚ └── amp_flow_model_final_optimized.pth
112
+ β”œβ”€β”€ generated_samples/
113
+ β”‚ β”œβ”€β”€ generated_sequences_20250829.fasta
114
+ β”‚ └── hmd_amp_detailed_results.csv
115
+ β”œβ”€β”€ src/
116
+ β”‚ β”œβ”€β”€ final_flow_model.py
117
+ β”‚ β”œβ”€β”€ compressor_with_embeddings.py
118
+ β”‚ β”œβ”€β”€ final_sequence_decoder.py
119
+ β”‚ └── generate_amps.py
120
+ └── README.md
121
+ ```
122
+
123
+ ## Citation
124
+
125
+ If you use FlowFinal in your research, please cite:
126
+
127
+ ```bibtex
128
+ @misc{flowfinal2025,
129
+ title={FlowFinal: Flow Matching for Antimicrobial Peptide Generation},
130
+ author={Edward Sun},
131
+ year={2025},
132
+ url={https://huggingface.co/esunAI/FlowFinal}
133
+ }
134
+ ```
135
+
136
+ ## License
137
+
138
+ This model is released under the MIT License.
139
+ """
140
+ return model_card
141
+
142
+ def main():
143
+ print("πŸš€ Starting comprehensive upload to Hugging Face Hub...")
144
+
145
+ # Initialize API
146
+ api = HfApi()
147
+ repo_id = "esunAI/FlowFinal"
148
+ today = "20250829"
149
+
150
+ # Create model card
151
+ print("πŸ“ Creating model card...")
152
+ model_card = create_model_card()
153
+ with open("README.md", "w") as f:
154
+ f.write(model_card)
155
+
156
+ # Upload model card
157
+ print("πŸ“€ Uploading model card...")
158
+ upload_file(
159
+ path_or_fileobj="README.md",
160
+ path_in_repo="README.md",
161
+ repo_id=repo_id,
162
+ commit_message="Add comprehensive model card"
163
+ )
164
+
165
+ # Upload main model components
166
+ print("πŸ“€ Uploading main model files...")
167
+ model_files = [
168
+ "final_flow_model.py",
169
+ "compressor_with_embeddings.py",
170
+ "final_sequence_decoder.py",
171
+ "generate_amps.py",
172
+ "amp_flow_training_single_gpu_full_data.py",
173
+ "cfg_dataset.py",
174
+ "decode_and_test_sequences.py"
175
+ ]
176
+
177
+ for file in model_files:
178
+ if os.path.exists(file):
179
+ print(f" Uploading {file}...")
180
+ upload_file(
181
+ path_or_fileobj=file,
182
+ path_in_repo=f"src/{file}",
183
+ repo_id=repo_id,
184
+ commit_message=f"Add {file}"
185
+ )
186
+
187
+ # Upload trained model weights
188
+ print("πŸ“€ Uploading model weights...")
189
+ weight_files = [
190
+ ("final_compressor_model.pth", "models/final_compressor_model.pth"),
191
+ ("final_decompressor_model.pth", "models/final_decompressor_model.pth"),
192
+ ("normalization_stats.pt", "models/normalization_stats.pt")
193
+ ]
194
+
195
+ for local_file, repo_path in weight_files:
196
+ if os.path.exists(local_file):
197
+ print(f" Uploading {local_file} -> {repo_path}...")
198
+ upload_file(
199
+ path_or_fileobj=local_file,
200
+ path_in_repo=repo_path,
201
+ repo_id=repo_id,
202
+ commit_message=f"Add {local_file}"
203
+ )
204
+
205
+ # Upload ALL flow model checkpoints from today
206
+ print("πŸ“€ Uploading flow model checkpoints...")
207
+ checkpoint_files = [
208
+ ("/data2/edwardsun/flow_checkpoints/amp_flow_model_final_optimized.pth", "models/amp_flow_model_final_optimized.pth"),
209
+ ("/data2/edwardsun/flow_checkpoints/amp_flow_model_best_optimized.pth", "models/amp_flow_model_best_optimized.pth"),
210
+ ("/data2/edwardsun/flow_checkpoints/amp_flow_model_best_optimized_20250829_RETRAINED.pth", "models/amp_flow_model_best_optimized_20250829_RETRAINED.pth")
211
+ ]
212
+
213
+ for checkpoint_path, repo_path in checkpoint_files:
214
+ if os.path.exists(checkpoint_path):
215
+ print(f" Uploading {os.path.basename(checkpoint_path)}...")
216
+ upload_file(
217
+ path_or_fileobj=checkpoint_path,
218
+ path_in_repo=repo_path,
219
+ repo_id=repo_id,
220
+ commit_message=f"Add {os.path.basename(checkpoint_path)}"
221
+ )
222
+
223
+ # Upload paper and documentation files
224
+ print("πŸ“€ Uploading paper and documentation files...")
225
+ paper_files = [
226
+ "paper_results.tex",
227
+ "supplementary_data.tex",
228
+ "latex_tables.tex"
229
+ ]
230
+
231
+ for file in paper_files:
232
+ if os.path.exists(file):
233
+ print(f" Uploading {file}...")
234
+ upload_file(
235
+ path_or_fileobj=file,
236
+ path_in_repo=f"paper/{file}",
237
+ repo_id=repo_id,
238
+ commit_message=f"Add {file}"
239
+ )
240
+
241
+ # Upload training logs
242
+ print("πŸ“€ Uploading training logs...")
243
+ log_files = [
244
+ "fresh_training_aug29.log",
245
+ "h100_maximized_training.log",
246
+ "training_output_h100_max.log",
247
+ "training_output.log",
248
+ "launch_full_data_training.sh"
249
+ ]
250
+
251
+ for file in log_files:
252
+ if os.path.exists(file):
253
+ print(f" Uploading {file}...")
254
+ upload_file(
255
+ path_or_fileobj=file,
256
+ path_in_repo=f"training_logs/{file}",
257
+ repo_id=repo_id,
258
+ commit_message=f"Add {file}"
259
+ )
260
+
261
+ # Upload datasets
262
+ print("πŸ“€ Uploading datasets...")
263
+ dataset_files = [
264
+ ("all_peptides_data.json", "datasets/all_peptides_data.json"),
265
+ ("combined_final.fasta", "datasets/combined_final.fasta"),
266
+ ("cfgdata.fasta", "datasets/cfgdata.fasta"),
267
+ ("uniprotkb_AND_reviewed_true_AND_model_o_2025_08_29.fasta", "datasets/uniprotkb_reviewed_proteins.fasta")
268
+ ]
269
+
270
+ for local_file, repo_path in dataset_files:
271
+ if os.path.exists(local_file):
272
+ print(f" Uploading {local_file}...")
273
+ upload_file(
274
+ path_or_fileobj=local_file,
275
+ path_in_repo=repo_path,
276
+ repo_id=repo_id,
277
+ commit_message=f"Add {local_file}"
278
+ )
279
+
280
+ # Upload today's results and analysis
281
+ print("πŸ“€ Uploading today's results and analysis...")
282
+ result_files = [
283
+ "generated_sequences_20250829_144923.fasta",
284
+ "hmd_amp_detailed_results.csv",
285
+ "hmd_amp_cfg_analysis.csv",
286
+ "complete_amp_results.csv",
287
+ "summary_statistics.csv"
288
+ ]
289
+
290
+ for file in result_files:
291
+ if os.path.exists(file):
292
+ print(f" Uploading {file}...")
293
+ upload_file(
294
+ path_or_fileobj=file,
295
+ path_in_repo=f"results/{file}",
296
+ repo_id=repo_id,
297
+ commit_message=f"Add {file}"
298
+ )
299
+
300
+ # Upload today's raw embeddings
301
+ print("πŸ“€ Uploading today's raw embeddings...")
302
+ embedding_dir = "/data2/edwardsun/generated_samples"
303
+
304
+ embedding_files = [
305
+ f"generated_amps_best_model_no_cfg_{today}.pt",
306
+ f"generated_amps_best_model_weak_cfg_{today}.pt",
307
+ f"generated_amps_best_model_strong_cfg_{today}.pt",
308
+ f"generated_amps_best_model_very_strong_cfg_{today}.pt"
309
+ ]
310
+
311
+ for file in embedding_files:
312
+ file_path = os.path.join(embedding_dir, file)
313
+ if os.path.exists(file_path):
314
+ print(f" Uploading {file}...")
315
+ upload_file(
316
+ path_or_fileobj=file_path,
317
+ path_in_repo=f"generated_samples/embeddings/{file}",
318
+ repo_id=repo_id,
319
+ commit_message=f"Add {file}"
320
+ )
321
+
322
+ # Upload decoded sequences from today
323
+ print("πŸ“€ Uploading decoded sequences from today...")
324
+ decoded_dir = "/data2/edwardsun/decoded_sequences"
325
+ decoded_files = [
326
+ f"decoded_sequences_no_cfg_00_{today}.txt",
327
+ f"decoded_sequences_weak_cfg_30_{today}.txt",
328
+ f"decoded_sequences_strong_cfg_75_{today}.txt",
329
+ f"decoded_sequences_very_strong_cfg_150_{today}.txt"
330
+ ]
331
+
332
+ for file in decoded_files:
333
+ file_path = os.path.join(decoded_dir, file)
334
+ if os.path.exists(file_path):
335
+ print(f" Uploading {file}...")
336
+ upload_file(
337
+ path_or_fileobj=file_path,
338
+ path_in_repo=f"generated_samples/decoded_sequences/{file}",
339
+ repo_id=repo_id,
340
+ commit_message=f"Add {file}"
341
+ )
342
+
343
+ # Upload APEX analysis results from today
344
+ print("πŸ“€ Uploading APEX analysis results...")
345
+ apex_dir = "/data2/edwardsun/apex_results"
346
+ apex_files = [
347
+ f"apex_results_no_cfg_00_{today}.json",
348
+ f"apex_results_weak_cfg_30_{today}.json",
349
+ f"apex_results_strong_cfg_75_{today}.json",
350
+ f"apex_results_very_strong_cfg_150_{today}.json",
351
+ f"apex_results_all_cfg_comparison_{today}.json",
352
+ f"mic_summary_{today}.json"
353
+ ]
354
+
355
+ for file in apex_files:
356
+ file_path = os.path.join(apex_dir, file)
357
+ if os.path.exists(file_path):
358
+ print(f" Uploading {file}...")
359
+ upload_file(
360
+ path_or_fileobj=file_path,
361
+ path_in_repo=f"analysis/apex_results/{file}",
362
+ repo_id=repo_id,
363
+ commit_message=f"Add {file}"
364
+ )
365
+
366
+ # Upload additional dataset file from data2
367
+ print("πŸ“€ Uploading additional dataset files...")
368
+ additional_dataset_path = "/data2/edwardsun/decoded_sequences/all_dataset_peptides_sequences.txt"
369
+ if os.path.exists(additional_dataset_path):
370
+ print(" Uploading all_dataset_peptides_sequences.txt...")
371
+ upload_file(
372
+ path_or_fileobj=additional_dataset_path,
373
+ path_in_repo="datasets/all_dataset_peptides_sequences.txt",
374
+ repo_id=repo_id,
375
+ commit_message="Add complete dataset sequences"
376
+ )
377
+
378
+ # Create comprehensive summary
379
+ print("πŸ“€ Creating comprehensive summary...")
380
+
381
+ # Count uploaded files
382
+ uploaded_files = {
383
+ "model_components": len([f for f in model_files if os.path.exists(f)]),
384
+ "weight_files": len([f for f, _ in weight_files if os.path.exists(f)]),
385
+ "checkpoints": len([f for f, _ in checkpoint_files if os.path.exists(f)]),
386
+ "paper_files": len([f for f in paper_files if os.path.exists(f)]),
387
+ "training_logs": len([f for f in log_files if os.path.exists(f)]),
388
+ "datasets": len([f for f, _ in dataset_files if os.path.exists(f)]),
389
+ "results": len([f for f in result_files if os.path.exists(f)]),
390
+ "embeddings": len([f for f in embedding_files if os.path.exists(os.path.join(embedding_dir, f))]),
391
+ "decoded_sequences": len([f for f in decoded_files if os.path.exists(os.path.join(decoded_dir, f))]),
392
+ "apex_results": len([f for f in apex_files if os.path.exists(os.path.join(apex_dir, f))])
393
+ }
394
+
395
+ summary = {
396
+ "model_name": "FlowFinal",
397
+ "upload_date": datetime.now().isoformat(),
398
+ "training_date": today,
399
+ "total_sequences_generated": 80,
400
+ "hmd_amp_predictions": 7,
401
+ "hmd_amp_rate": 8.8,
402
+ "best_cfg_configuration": "strong_cfg (20% AMP rate)",
403
+ "training_details": {
404
+ "epochs": 2000,
405
+ "final_loss": 1.318,
406
+ "training_time": "2.3 hours",
407
+ "hardware": "H100",
408
+ "dataset_size": 6983
409
+ },
410
+ "uploaded_files": uploaded_files,
411
+ "total_files_uploaded": sum(uploaded_files.values()),
412
+ "repository_structure": {
413
+ "src/": "Main model implementation files",
414
+ "models/": "Trained model weights and checkpoints",
415
+ "paper/": "LaTeX files and paper documentation",
416
+ "training_logs/": "Complete training logs and scripts",
417
+ "datasets/": "Training datasets and protein sequences",
418
+ "results/": "Generated sequences and validation results",
419
+ "generated_samples/": "Raw embeddings and decoded sequences",
420
+ "analysis/": "APEX antimicrobial activity analysis"
421
+ }
422
+ }
423
+
424
+ with open("comprehensive_summary.json", "w") as f:
425
+ json.dump(summary, f, indent=2)
426
+
427
+ upload_file(
428
+ path_or_fileobj="comprehensive_summary.json",
429
+ path_in_repo="comprehensive_summary.json",
430
+ repo_id=repo_id,
431
+ commit_message="Add comprehensive model and results summary"
432
+ )
433
+
434
+ print("βœ… Comprehensive upload complete!")
435
+ print(f"🌐 Your complete FlowFinal repository is now available at: https://huggingface.co/{repo_id}")
436
+ print("\nπŸ“Š Upload Summary:")
437
+ for category, count in uploaded_files.items():
438
+ print(f" - {category.replace('_', ' ').title()}: {count} files")
439
+ print(f" - Total files uploaded: {sum(uploaded_files.values())} files")
440
+ print(f"\n🎯 Key Results:")
441
+ print(f" - Generated 80 sequences with different CFG scales")
442
+ print(f" - HMD-AMP validated 7 sequences as AMPs (8.8% success rate)")
443
+ print(f" - Strong CFG (7.5) performed best with 20% AMP rate")
444
+ print(f" - Complete training logs, datasets, and analysis included")
445
+ print(f" - Ready for final paper submission!")
446
+
447
+ if __name__ == "__main__":
448
+ main()