hagsaeng
/

MachinLearningBootCamp_QAclassifier_Gemma-2B

Text Classification

question-classification

Model card Files Files and versions

Metrics Training metrics Community

MachinLearningBootCamp_QAclassifier_Gemma-2B / README.md

hagsaeng's picture

Update README.md

f41799c verified about 1 year ago

|

history blame contribute delete

3.37 kB

	---
	language: en
	license: apache-2.0
	tags:
	- text-classification
	- question-classification
	- LoRA
	- quantization
	datasets:
	- squad
	- glue
	model_name: question-classification-lora-quant
	base_model: google/gemma-2b-it
	widget:
	- text: "What is the capital of France?"
	- text: "This is a beautiful day."
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	---

	# Model Card: Question Classification using LoRA with Quantization

	## Model Overview

	This model is a fine-tuned version of [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) designed to classify text into two categories: QUESTION or NOT_QUESTION. It was fine-tuned on a custom dataset that combines the SQuAD dataset (containing questions) and the GLUE SST-2 dataset (containing general non-question sentences).

	### Model Architecture

	- Base Model: `google/gemma-2b-it`
	- Fine-tuning Method: LoRA (Low-Rank Adaptation) with k-bit quantization (4-bit quantization with NF4).
	- Configurations:
	- Quantization: 4-bit quantization using `BitsAndBytesConfig`
	- Adapter (LoRA) settings:
	- Rank: 64
	- LoRA Alpha: 32
	- Dropout: 0.05
	- Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`

	## Dataset

	The model was trained using a combination of two datasets:
	- SQuAD v1.1 (Question dataset)
	- GLUE SST-2 (Non-question dataset)

	Each dataset was preprocessed to contain a label:
	- QUESTION: For SQuAD questions
	- NOT_QUESTION: For non-question sentences from GLUE SST-2.

	### Data Preprocessing
	- A random removal probability (`P_remove = 0.3`) was applied to remove some of the questions containing a question mark (`?`), to increase the model's robustness.
	- Both datasets were balanced with an equal number of samples (`N=100` for training and testing).

	## Model Performance

	- Metrics Evaluated:
	- Accuracy
	- F1 Score
	- Precision
	- Recall
	- These metrics were computed on a balanced test dataset containing both question and non-question examples.

	## How to Use

	You can use this model to classify whether a given text is a question or not. Here’s how you can use it:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("your_model_name")
	model = AutoModelForSequenceClassification.from_pretrained("your_model_name")

	inputs = tokenizer("What is the capital of France?", return_tensors="pt")
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, axis=1)

	label = "QUESTION" if predictions == 1 else "NOT_QUESTION"
	print(f"Predicted Label: {label}")
	```

	## Limitations

	- The model was trained on English data only, so it may not perform well on non-English languages.
	- Since it is fine-tuned on specific datasets (SQuAD and GLUE SST-2), performance may vary with out-of-domain data.
	- The model assumes well-formed input sentences, so performance may degrade with informal or very short text.

	## Intended Use

	This model is intended for text classification tasks where distinguishing between questions and non-questions is needed. Potential use cases include:
	- Improving chatbot or virtual assistant interactions.
	- Enhancing query detection for search engines.

	## License

	This model follows the same license as [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it). Please refer to the original license for any usage restrictions.