xlm-roberta-ja-massive-intent

Fine-tuned xlm-roberta-base for Japanese intent classification on the MASSIVE dataset (ja-JP). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).

Task: multi-class text classification (intent)
Language: Japanese (multilingual base)
License: CC BY 4.0

Usage

Using Transformers pipeline:

from transformers import pipeline

clf = pipeline("text-classification", model="takehika/xlm-roberta-ja-massive-intent")
clf("今日の天気を教えて")

From from_pretrained:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "takehika/xlm-roberta-ja-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

Data

Dataset: AmazonScience/massive
Label space: 60 intents

Preprocessing

Tokenizer: xlm-roberta-base (fast)
Settings: max_length=256, truncation=True

Training

Epochs: 3
Learning rate: 2e-5
Warmup ratio: 0.06
Weight decay: 0.01
Batch sizes: train/eval = 16

Evaluation

Validation set metrics:

Accuracy: 0.8431
F1: 0.8321

Intended Use & Limitations

Intended for Japanese assistant/chatbot intent recognition.
Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
Always validate on your target domain before use.

Attribution & Licenses

License: CC BY 4.0
- When using or redistributing this fine-tuned model (or its weights), please credit the original authors, link to this model card, include the license (CC BY 4.0), and indicate if any changes were made.
Base model: xlm-roberta-base by Meta AI — MIT License
- Model card: https://huggingface.co/xlm-roberta-base
Dataset: MASSIVE (ja-JP) by Amazon Science — CC BY 4.0
- Dataset card: https://huggingface.co/datasets/AmazonScience/massive

This model modifies the base model by fine-tuning on the above dataset.

Base Model Citation

Please cite the following when using the XLM-R base model:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dataset Citation

Please cite the following papers when using the MASSIVE dataset.

@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}

Downloads last month: 54

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for takehika/xlm-roberta-ja-massive-intent

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3559)

this model

Dataset used to train takehika/xlm-roberta-ja-massive-intent

Evaluation results

accuracy on MASSIVE (ja-JP)
validation set self-reported

0.843
f1 on MASSIVE (ja-JP)
validation set self-reported

0.832

View on Papers With Code