josh-sematic's picture
Update README.md
67e17ac verified
---
title: Fineweb-edu-fortified Semantic Search Demo
emoji: πŸ“š
sdk: gradio
sdk_version: 4.41.0
app_file: app.py
pinned: false
datasets:
- airtrain-ai/fineweb-edu-fortified
- HuggingFaceFW/fineweb-edu
models:
- TaylorAI/bge-micro
license: apache-2.0
---
# Semantic Search on Fineweb-edu-fortified sample
This performs semantic search on one crawl ({{CRAWL_DUMP}}) from Fineweb-edu-fortified.
It is intended to illustrate the contents of
[fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
and
[fineweb-edu-fortified](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified).
To explore Fineweb-edu-fortified further, you can view automatic clustering, embedding
projections, and more for a 500k row sample using
[this Airtrain dashboard](https://app.airtrain.ai/dataset/c232b33f-4f4a-49a7-ba55-8167a5f433da/null/1/0).
The embeddings are the ones present in the dataset itself, and the same embedding model
is used to embed your search phrase. The search is performed using the 15 rows with the
closest embedding vectors to the embedding of the search phrase.
The search data is lazily loaded, so shortly after
the space is launched it may not yet have the full corpus of text from that crawl available
for search. Refer to 'Rows searched' to see how many rows were searched across to retrieve the results.