Spaces:
Runtime error
Runtime error
| Types of Question Answering | |
| - extractive question answering (encoder only models BERT) | |
| - posing questions about a document and identifying the answers as spans of text in the document itself | |
| - generative question answering (encoder-decoder T5/BART) | |
| - open ended questions, which need to synthesize information | |
| - retrieval based/community question answering | |
| First approach - translate dataset, fine-tune model | |
| !Not really feasible, because it needs lots of human evaluation for correctly determine answer start token | |
| 1. Translate English QA dataset into Hungarian | |
| - SQuAD - reading comprehension based on Wikipedia articles | |
| - ~ 100.000 question/answers | |
| 2. Fine-tune a model and evaluate on this dataset | |
| Second approach - fine-tune multilingual model | |
| !MQA format different than SQuAD, cannot use ModelForQuestionAnswering | |
| 1. Use a Hungarian dataset | |
| - MQA - multilingual parsed from Common Crawl | |
| - FAQ - 878.385 (2.415 domain) | |
| - CQA - 27.639 (171 domain) | |
| 2. Fine-tune and evaluate a model on this dataset | |
| Possible steps: | |
| - Use an existing pre-trained model in Hungarian/Romanian/or multilingual to generate embeddings | |
| - Select Model: | |
| - multilingual which includes hu: | |
| - distiluse-base-multilingual-cased-v2 (400MB) | |
| - paraphrase-multilingual-MiniLM-L12-v2 (400MB) - fastest | |
| - paraphrase-multilingual-mpnet-base-v2 (900MB) - best performing | |
| - hubert | |
| - Select a dataset | |
| - use MQA hungarian subset | |
| - use hungarian wikipedia pages data, split it up | |
| - DBpedia, shortened abstracts = 500.000 | |
| - Pre-compute embeddings for all answers/paragraphs | |
| - Compute embedding for incoming query | |
| - Compare similarity between query embedding and precomputed | |
| - return top-3 answers/questions | |
| Alternative steps: | |
| - train a sentence transformer on the Hungarian / Romanian subsets | |
| - Use the trained sentence transformer to generate embeddings |