Thanks for the thoughtful comment! For now, I'm of the opinion that SaaS embedding API's are cheap enough that even a large dataset can be re-vectorised. For example, for the 143k chunks the costs were anywhere between around $6 - $30 (from memory). That's every High Court judgement up to 2023 in Australia. Personally I think of the vectors themselves as essentially disposable, since there's better models coming out every month or so. I know not everyone is of a similar mindset, and for ultimate control you'd definitely want to go local.
Adrian Lucas Malec
adlumal
AI & ML interests
None yet
Recent Activity
liked
a dataset
23 days ago
stanfordnlp/sentiment140
liked
a dataset
23 days ago
sentence-transformers/eli5
replied to
their
post
about 1 month ago
I benchmarked embedding APIs for speed, compared local vs hosted models, and tuned USearch for sub-millisecond retrieval on 143k chunks using only CPU. The post walks through the results, trade-offs, and what I learned about embedding API terms of service.
The main motivation for using USearch is that CPU compute is cheap and easy to scale.
Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents
Organizations
None yet