monsoon-nlp commited on
Commit
03eb735
·
verified ·
1 Parent(s): c500885

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -9,6 +9,7 @@ base_model: tattabio/gLM2_150M
9
  ## tomatotomato-gLM2-150M-v0.1
10
 
11
  TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
 
12
 
13
  The training data is one sequence representing the differences between two tomato genomes:
14
  - Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)
 
9
  ## tomatotomato-gLM2-150M-v0.1
10
 
11
  TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
12
+ Tokenizing two genomes at once means if one aligned sequence is AAAA and the other is ACGT, we output four tokens representing that variance. The base model is TattaBio's gLM2, repurposing and adding to its vocabulary.
13
 
14
  The training data is one sequence representing the differences between two tomato genomes:
15
  - Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)