--- title: A virtual Catalan grandparent emoji: 💬 colorFrom: yellow colorTo: purple sdk: gradio sdk_version: 5.14.0 app_file: app.py pinned: false python_version: 3.12.10 license: mit short_description: A virtual Catalan grandparent --- # A virtual Catalan grandparent This is the repository for the project "A virtual Catalan grandparent", created as part of the course "Natural Language Processing" of the "Master's degree in Machine Learning and Cybersecurity for Internet Connected Systems" at UPC-EPSEM. This project consists of a "virtual grandparent", an application that wraps a pre-trained transformer model to retrieve the most fitting Catalan proverb given the user's input. A public demo of the project is available at [Gradio](https://huggingface.co/spaces/pauhmolins/virtual-catalan-grandparent). To run the demo locally, you can follow the instructions below. > **DISCLAIMER**: the demo may not available at all times, as it is hosted on a free plan. > The project is based on models trained as part of [Projecte Aina](https://huggingface.co/projecte-aina). ## Repository structure - `app.py`: Main script to run the Gradio web demo application (this is what's running on Hugging Face Spaces). - `src/`: Contains the source code of the project. - `customlogger.py`: Custom logger used by modules in the project. Can be adjusted to change the logging level and format. - `datasets.py`: Functions to load and manage the datasets, as well as generate the text representation of the proverbs. - `models.py`: Contains the class that wraps the pre-trained transformer model and manages embedding tasks. - `indexes.py`: Wrapper for the FAISS library, used to create and manage the FAISS index of the dataset. - `commons.py`: Contains the common functionalities used both by the demo and the tests. - `tests.py`: Contains the code used to setup, run and record the tests of different hyperparameters for the system. - `datasets/`: Contains the JSON files that correspond to the datasets used in the project. - `tests_runs/`: Contains the results of the tests ran to select the best-performing hyperparameters for the system. > The datasets can also be accessed publicly at Hugging Face: [catalan-proverbs](https://huggingface.co/datasets/pauhmolins/catalan-proverbs) and [catalan-proverbs-prompts](https://huggingface.co/datasets/pauhmolins/catalan-proverbs-prompts). ## How to run the project First, create a virtual environment in Python (preferably `>3.12.0`) and make sure to install the required packages: ```bash python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` Then, run the `app.py` script to start the application: ```bash python app.py ``` This will start a Gradio web application that will run locally by default. You can enable public sharing by adjusting the ``SHARE`` parameter in that script. Running the `app.py` script for the first time will also generate a FAISS index of the dataset and save it to file `proverbs.index` for faster loading in the future. ## Running tests To run the tests, you can use the `tests.py` script. This script will run a set of tests to evaluate the performance of the system with the hyperparameters set up for combination. Note that if no filtering of the hyperparameters is set, the script will run a test for each combination of hyperparameters, which can take a long time to finish. ```bash python src/tests.py ``` The results of the executions ran during the development of the project are stored in the `tests_runs` folder. The results are stored in a JSON file with all the relevant information of the test, including the hyperparameters used and the results of the evaluation metrics.