--- license: apache-2.0 datasets: - yentinglin/TaiwanChat language: - zh base_model: - HuggingFaceTB/SmolLM2-135M-Instruct pipeline_tag: text-generation --- # SmolLM2‑135M‑Instruct‑TaiwanChat A fine‑tuned SmolLM2‑135M Instruct model on the TaiwanChat dataset, optimized for multi‑turn Traditional Chinese conversational AI. --- ## Model Description - **Base model:** `HuggingFaceTB/SmolLM2-135M-Instruct` - **Fine‑tuned on:** `yentinglin/TaiwanChat` (first 3 000 training samples) - **Task:** Instruction‑tuned chat in Mandarin/Taiwanese - **Framework:** Hugging Face Transformers [`Trainer`] - **Precision:** - BF16 on Intel XPU (if available) - FP16 on CUDA (if available) - Falls back to CPU otherwise - **Memory optimizations:** Gradient checkpointing enabled --- ## How to Use ### 1. Install dependencies ```bash pip install transformers datasets accelerate ``` ### 2. Load & Generate ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline model_id = "Luigi/SmolLM2-135M-Instruct-TaiwanChat" # 1) Select device device_str = "cpu" if torch.xpu.is_available(): device_str = "xpu" elif torch.cuda.is_available(): device_str = "cuda" # 2) Load tokenizer & model tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id).to(device_str) # 3) Set up HF pipeline (use integer index for device) hf_device = 0 if device_str in ("cuda", "xpu") else -1 generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, device=hf_device, max_new_tokens=512, do_sample=True, temperature=0.8, ) # 4) Inference example prompt = "請問台北今天的天氣如何?" result = generator(prompt) print(result[0]["generated_text"]) ``` --- ## Training Script All training logic is contained in `SmolLM2-135M-Instruct-TaiwanChat.py`. **Key settings** (hard‑coded at top of script): ```python PROJECT_NAME = "SmolLM2-135M-Instruct-TaiwanChat" BASE_MODEL_ID = "HuggingFaceTB/SmolLM2-135M-Instruct" DATASET_ID = "yentinglin/TaiwanChat" N_SAMPLES = 3000 MAX_LEN = 512 ``` **Trainer hyperparameters**: - `per_device_train_batch_size=4` - `learning_rate=5e-5` - `num_train_epochs=3` - `fp16` on CUDA, `bf16` on XPU - `logging_steps=1000` - `save_steps=5000` - `gradient_checkpointing=True` - `push_to_hub=True` ### Run training ```bash python SmolLM2-135M-Instruct-TaiwanChat.py ``` The script will: 1. Auto‑detect and select **XPU**, **CUDA**, or **CPU**. 2. Load & preprocess the first 3 000 samples from TaiwanChat. 3. Fine‑tune the model with your chosen precision and logging. 4. Save the fine‑tuned model & tokenizer under `./SmolLM2-135M-Instruct-TaiwanChat`. 5. Push the checkpoint to `huggingface.co/Luigi/SmolLM2-135M-Instruct-TaiwanChat`. --- ## Limitations - Trained on a subset (3 000 samples) only – may underperform on out‑of‑domain queries. - No separate validation or evaluation loop in the script. - Generated responses may be incorrect or inconsistent – always verify before production use. --- ## License - **Code** is released under the [Apache 2.0 License](LICENSE). - **Training data** (and any derived model weights) are licensed under [CC BY‑NC 4.0](LICENSE-CC-BY-NC-4.0) – non‑commercial only. You may use and modify the code for any purpose, but any use of the dataset or the models trained on it must comply with the CC BY‑NC 4.0 terms. ## Citation If you use this model, please cite: ```bibtex @misc{SmolLM2TaiwanChat2025, title = {SmolLM2‑135M‑Instruct‑TaiwanChat}, author = {Luigi Liu}, year = {2025}, howpublished = {\url{https://huggingface.co/Luigi/SmolLM2-135M-Instruct-TaiwanChat}} } ```