monsoon-nlp
/

tomatotomato-gLM2-150M-v0.1

Model card Files Files and versions

monsoon-nlp commited on Sep 26

Commit

03eb735

·

verified ·

1 Parent(s): c500885

Update README.md

Files changed (1) hide show

README.md +1 -0

README.md CHANGED Viewed

@@ -9,6 +9,7 @@ base_model: tattabio/gLM2_150M
 ## tomatotomato-gLM2-150M-v0.1
 TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
 The training data is one sequence representing the differences between two tomato genomes:
 - Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)

 ## tomatotomato-gLM2-150M-v0.1
 TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
+Tokenizing two genomes at once means if one aligned sequence is AAAA and the other is ACGT, we output four tokens representing that variance. The base model is TattaBio's gLM2, repurposing and adding to its vocabulary.
 The training data is one sequence representing the differences between two tomato genomes:
 - Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)