hagsaeng
/

MachinLearningBootCamp_QAclassifier_Gemma-2B

Text Classification

question-classification

Model card Files Files and versions

Metrics Training metrics Community

hagsaeng commited on Oct 3, 2024

Commit

cc241f5

·

verified ·

1 Parent(s): b0b40bc

Create README.md

Files changed (1) hide show

README.md +74 -0

README.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# Model Card: Question Classification using LoRA with Quantization
+## Model Overview
+This model is a fine-tuned version of [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) designed to classify text into two categories: **QUESTION** or **NOT_QUESTION**. It was fine-tuned on a custom dataset that combines the **SQuAD** dataset (containing questions) and the **GLUE SST-2** dataset (containing general non-question sentences).
+### Model Architecture
+- Base Model: `google/gemma-2b-it`
+- Fine-tuning Method: LoRA (Low-Rank Adaptation) with k-bit quantization (4-bit quantization with NF4).
+- Configurations:
+  - Quantization: 4-bit quantization using `BitsAndBytesConfig`
+  - Adapter (LoRA) settings:
+    - Rank: 64
+    - LoRA Alpha: 32
+    - Dropout: 0.05
+    - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`
+## Dataset
+The model was trained using a combination of two datasets:
+- **SQuAD v1.1** (Question dataset)
+- **GLUE SST-2** (Non-question dataset)
+Each dataset was preprocessed to contain a label:
+- **QUESTION**: For SQuAD questions
+- **NOT_QUESTION**: For non-question sentences from GLUE SST-2.
+### Data Preprocessing
+- A random removal probability (`P_remove = 0.3`) was applied to remove some of the questions containing a question mark (`?`), to increase the model's robustness.
+- Both datasets were balanced with an equal number of samples (`N=100` for training and testing).
+## Model Performance
+- **Metrics Evaluated**:
+  - Accuracy
+  - F1 Score
+  - Precision
+  - Recall
+- These metrics were computed on a balanced test dataset containing both question and non-question examples.
+## How to Use
+You can use this model to classify whether a given text is a question or not. Here’s how you can use it:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("your_model_name")
+model = AutoModelForSequenceClassification.from_pretrained("your_model_name")
+inputs = tokenizer("What is the capital of France?", return_tensors="pt")
+outputs = model(**inputs)
+predictions = torch.argmax(outputs.logits, axis=1)
+label = "QUESTION" if predictions == 1 else "NOT_QUESTION"
+print(f"Predicted Label: {label}")
+```
+## Limitations
+- The model was trained on English data only, so it may not perform well on non-English languages.
+- Since it is fine-tuned on specific datasets (SQuAD and GLUE SST-2), performance may vary with out-of-domain data.
+- The model assumes well-formed input sentences, so performance may degrade with informal or very short text.
+## Intended Use
+This model is intended for text classification tasks where distinguishing between questions and non-questions is needed. Potential use cases include:
+- Improving chatbot or virtual assistant interactions.
+- Enhancing query detection for search engines.
+## License
+This model follows the same license as [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it). Please refer to the original license for any usage restrictions.