hagsaeng commited on
Commit
cc241f5
·
verified ·
1 Parent(s): b0b40bc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: Question Classification using LoRA with Quantization
2
+
3
+ ## Model Overview
4
+
5
+ This model is a fine-tuned version of [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) designed to classify text into two categories: **QUESTION** or **NOT_QUESTION**. It was fine-tuned on a custom dataset that combines the **SQuAD** dataset (containing questions) and the **GLUE SST-2** dataset (containing general non-question sentences).
6
+
7
+ ### Model Architecture
8
+
9
+ - Base Model: `google/gemma-2b-it`
10
+ - Fine-tuning Method: LoRA (Low-Rank Adaptation) with k-bit quantization (4-bit quantization with NF4).
11
+ - Configurations:
12
+ - Quantization: 4-bit quantization using `BitsAndBytesConfig`
13
+ - Adapter (LoRA) settings:
14
+ - Rank: 64
15
+ - LoRA Alpha: 32
16
+ - Dropout: 0.05
17
+ - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`
18
+
19
+ ## Dataset
20
+
21
+ The model was trained using a combination of two datasets:
22
+ - **SQuAD v1.1** (Question dataset)
23
+ - **GLUE SST-2** (Non-question dataset)
24
+
25
+ Each dataset was preprocessed to contain a label:
26
+ - **QUESTION**: For SQuAD questions
27
+ - **NOT_QUESTION**: For non-question sentences from GLUE SST-2.
28
+
29
+ ### Data Preprocessing
30
+ - A random removal probability (`P_remove = 0.3`) was applied to remove some of the questions containing a question mark (`?`), to increase the model's robustness.
31
+ - Both datasets were balanced with an equal number of samples (`N=100` for training and testing).
32
+
33
+ ## Model Performance
34
+
35
+ - **Metrics Evaluated**:
36
+ - Accuracy
37
+ - F1 Score
38
+ - Precision
39
+ - Recall
40
+ - These metrics were computed on a balanced test dataset containing both question and non-question examples.
41
+
42
+ ## How to Use
43
+
44
+ You can use this model to classify whether a given text is a question or not. Here’s how you can use it:
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("your_model_name")
50
+ model = AutoModelForSequenceClassification.from_pretrained("your_model_name")
51
+
52
+ inputs = tokenizer("What is the capital of France?", return_tensors="pt")
53
+ outputs = model(**inputs)
54
+ predictions = torch.argmax(outputs.logits, axis=1)
55
+
56
+ label = "QUESTION" if predictions == 1 else "NOT_QUESTION"
57
+ print(f"Predicted Label: {label}")
58
+ ```
59
+
60
+ ## Limitations
61
+
62
+ - The model was trained on English data only, so it may not perform well on non-English languages.
63
+ - Since it is fine-tuned on specific datasets (SQuAD and GLUE SST-2), performance may vary with out-of-domain data.
64
+ - The model assumes well-formed input sentences, so performance may degrade with informal or very short text.
65
+
66
+ ## Intended Use
67
+
68
+ This model is intended for text classification tasks where distinguishing between questions and non-questions is needed. Potential use cases include:
69
+ - Improving chatbot or virtual assistant interactions.
70
+ - Enhancing query detection for search engines.
71
+
72
+ ## License
73
+
74
+ This model follows the same license as [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it). Please refer to the original license for any usage restrictions.