Update README.md
Browse files
README.md
CHANGED
|
@@ -9,6 +9,7 @@ base_model: tattabio/gLM2_150M
|
|
| 9 |
## tomatotomato-gLM2-150M-v0.1
|
| 10 |
|
| 11 |
TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
|
|
|
|
| 12 |
|
| 13 |
The training data is one sequence representing the differences between two tomato genomes:
|
| 14 |
- Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)
|
|
|
|
| 9 |
## tomatotomato-gLM2-150M-v0.1
|
| 10 |
|
| 11 |
TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of [gLM2_150M](https://huggingface.co/tattabio/gLM2_150M) on a new technique in tokenizing pangenomes.
|
| 12 |
+
Tokenizing two genomes at once means if one aligned sequence is AAAA and the other is ACGT, we output four tokens representing that variance. The base model is TattaBio's gLM2, repurposing and adding to its vocabulary.
|
| 13 |
|
| 14 |
The training data is one sequence representing the differences between two tomato genomes:
|
| 15 |
- Heinz 1706, NCBI's sequence [GCF_000188115.5](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000188115.5/)
|