xlm-roberta-ja-massive-intent
Fine-tuned xlm-roberta-base for Japanese intent classification on the MASSIVE dataset (ja-JP). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).
- Task: multi-class text classification (intent)
- Language: Japanese (multilingual base)
- License: CC BY 4.0
Usage
Using Transformers pipeline:
from transformers import pipeline
clf = pipeline("text-classification", model="takehika/xlm-roberta-ja-massive-intent")
clf("今日の天気を教えて")
From from_pretrained:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "takehika/xlm-roberta-ja-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
Data
- Dataset:
AmazonScience/massive - Label space: 60 intents
Preprocessing
- Tokenizer:
xlm-roberta-base(fast) - Settings:
max_length=256,truncation=True
Training
- Epochs: 3
- Learning rate: 2e-5
- Warmup ratio: 0.06
- Weight decay: 0.01
- Batch sizes: train/eval = 16
Evaluation
Validation set metrics:
- Accuracy: 0.8431
- F1: 0.8321
Intended Use & Limitations
- Intended for Japanese assistant/chatbot intent recognition.
- Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
- Always validate on your target domain before use.
Attribution & Licenses
- License: CC BY 4.0
- When using or redistributing this fine-tuned model (or its weights), please credit the original authors, link to this model card, include the license (CC BY 4.0), and indicate if any changes were made.
- Base model:
xlm-roberta-baseby Meta AI — MIT License- Model card: https://huggingface.co/xlm-roberta-base
- Dataset: MASSIVE (ja-JP) by Amazon Science — CC BY 4.0
- Dataset card: https://huggingface.co/datasets/AmazonScience/massive
This model modifies the base model by fine-tuning on the above dataset.
Base Model Citation
Please cite the following when using the XLM-R base model:
@article{DBLP:journals/corr/abs-1911-02116,
author = {Alexis Conneau and
Kartikay Khandelwal and
Naman Goyal and
Vishrav Chaudhary and
Guillaume Wenzek and
Francisco Guzm{\'{a}}n and
Edouard Grave and
Myle Ott and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {Unsupervised Cross-lingual Representation Learning at Scale},
journal = {CoRR},
volume = {abs/1911.02116},
year = {2019},
url = {http://arxiv.org/abs/1911.02116},
eprinttype = {arXiv},
eprint = {1911.02116},
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Dataset Citation
Please cite the following papers when using the MASSIVE dataset.
@misc{fitzgerald2022massive,
title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
year={2022},
eprint={2204.08582},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@inproceedings{bastianelli-etal-2020-slurp,
title = "{SLURP}: A Spoken Language Understanding Resource Package",
author = "Bastianelli, Emanuele and
Vanzo, Andrea and
Swietojanski, Pawel and
Rieser, Verena",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.emnlp-main.588",
doi = "10.18653/v1/2020.emnlp-main.588",
pages = "7252--7262",
abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}
- Downloads last month
- 54
Model tree for takehika/xlm-roberta-ja-massive-intent
Base model
FacebookAI/xlm-roberta-baseDataset used to train takehika/xlm-roberta-ja-massive-intent
Evaluation results
- accuracy on MASSIVE (ja-JP)validation set self-reported0.843
- f1 on MASSIVE (ja-JP)validation set self-reported0.832