Spaces:
Runtime error
Runtime error
| title: Fineweb-edu-fortified Semantic Search Demo | |
| emoji: π | |
| sdk: gradio | |
| sdk_version: 4.41.0 | |
| app_file: app.py | |
| pinned: false | |
| datasets: | |
| - airtrain-ai/fineweb-edu-fortified | |
| - HuggingFaceFW/fineweb-edu | |
| models: | |
| - TaylorAI/bge-micro | |
| license: apache-2.0 | |
| # Semantic Search on Fineweb-edu-fortified sample | |
| This performs semantic search on one crawl ({{CRAWL_DUMP}}) from Fineweb-edu-fortified. | |
| It is intended to illustrate the contents of | |
| [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | |
| and | |
| [fineweb-edu-fortified](https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified). | |
| To explore Fineweb-edu-fortified further, you can view automatic clustering, embedding | |
| projections, and more for a 500k row sample using | |
| [this Airtrain dashboard](https://app.airtrain.ai/dataset/c232b33f-4f4a-49a7-ba55-8167a5f433da/null/1/0). | |
| The embeddings are the ones present in the dataset itself, and the same embedding model | |
| is used to embed your search phrase. The search is performed using the 15 rows with the | |
| closest embedding vectors to the embedding of the search phrase. | |
| The search data is lazily loaded, so shortly after | |
| the space is launched it may not yet have the full corpus of text from that crawl available | |
| for search. Refer to 'Rows searched' to see how many rows were searched across to retrieve the results. |