1 3 13

Adrian Lucas Malec

adlumal

AI & ML interests

None yet

Recent Activity

liked a dataset 23 days ago

stanfordnlp/sentiment140

liked a dataset 23 days ago

sentence-transformers/eli5

replied to their post about 1 month ago

I benchmarked embedding APIs for speed, compared local vs hosted models, and tuned USearch for sub-millisecond retrieval on 143k chunks using only CPU. The post walks through the results, trade-offs, and what I learned about embedding API terms of service. The main motivation for using USearch is that CPU compute is cheap and easy to scale. Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents

View all activity

Organizations

None yet

replied to their post about 1 month ago

Thanks for the thoughtful comment! For now, I'm of the opinion that SaaS embedding API's are cheap enough that even a large dataset can be re-vectorised. For example, for the 143k chunks the costs were anywhere between around $6 - $30 (from memory). That's every High Court judgement up to 2023 in Australia. Personally I think of the vectors themselves as essentially disposable, since there's better models coming out every month or so. I know not everyone is of a similar mindset, and for ultimate control you'd definitely want to go local.

posted an update about 1 month ago

Post

2428

I benchmarked embedding APIs for speed, compared local vs hosted models, and tuned USearch for sub-millisecond retrieval on 143k chunks using only CPU. The post walks through the results, trade-offs, and what I learned about embedding API terms of service.
The main motivation for using USearch is that CPU compute is cheap and easy to scale.

Blog post: https://huggingface.co/blog/adlumal/lightning-fast-vector-search-for-legal-documents

2 replies

reacted to abdurrahmanbutler's post with ❤️ about 1 month ago

Post

2538

🎉 I am excited to share news of a project my brother, Umar Butler, and I have been working on for what feels like an eternity now.

𝐈𝐧𝐭𝐫𝐨𝐝𝐮𝐜𝐢𝐧𝐠 𝐌𝐋𝐄𝐁 — 𝐭𝐡𝐞 𝐌𝐚𝐬𝐬𝐢𝐯𝐞 𝐋𝐞𝐠𝐚𝐥 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤.

A suite of 10 high-quality English legal IR datasets, designed by legal experts to set a new standard for comparing embedding models.

Whether you’re exploring legal RAG on your home computer, or running enterprise-scale retrieval, apples-to-apples evaluation is crucial. That’s why we’ve open-sourced everything - including our 7 brand-new, hand-crafted retrieval datasets. All of these datasets are now live on Hugging Face.

Any guesses which embedding model leads on legal retrieval?

𝐇𝐢𝐧𝐭: it’s not OpenAI or Google - they place 7th and 9th on our leaderboard.

To do well on MLEB, embedding models must demonstrate both extensive legal domain knowledge and strong legal reasoning skills.

https://huggingface.co/blog/isaacus/introducing-mleb

1 reply

posted an update about 1 month ago

Post

2457

MLEB is the largest, most diverse, and most comprehensive benchmark for legal text embedding models. https://huggingface.co/blog/isaacus/introducing-mleb

Adrian Lucas Malec

AI & ML interests

Recent Activity

Organizations

adlumal's activity