xlm-roberta-ja-massive-intent

Fine-tuned xlm-roberta-base for Japanese intent classification on the MASSIVE dataset (ja-JP). The model predicts one of 60 intent classes from short utterances (e.g., assistant commands).

  • Task: multi-class text classification (intent)
  • Language: Japanese (multilingual base)
  • License: CC BY 4.0

Usage

Using Transformers pipeline:

from transformers import pipeline

clf = pipeline("text-classification", model="takehika/xlm-roberta-ja-massive-intent")
clf("今日の天気を教えて")

From from_pretrained:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "takehika/xlm-roberta-ja-massive-intent"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

Data

  • Dataset: AmazonScience/massive
  • Label space: 60 intents

Preprocessing

  • Tokenizer: xlm-roberta-base (fast)
  • Settings: max_length=256, truncation=True

Training

  • Epochs: 3
  • Learning rate: 2e-5
  • Warmup ratio: 0.06
  • Weight decay: 0.01
  • Batch sizes: train/eval = 16

Evaluation

Validation set metrics:

  • Accuracy: 0.8431
  • F1: 0.8321

Intended Use & Limitations

  • Intended for Japanese assistant/chatbot intent recognition.
  • Out-of-domain utterances and colloquial expressions not present in MASSIVE may degrade performance.
  • Always validate on your target domain before use.

Attribution & Licenses

This model modifies the base model by fine-tuning on the above dataset.

Base Model Citation

Please cite the following when using the XLM-R base model:

@article{DBLP:journals/corr/abs-1911-02116,
  author    = {Alexis Conneau and
               Kartikay Khandelwal and
               Naman Goyal and
               Vishrav Chaudhary and
               Guillaume Wenzek and
               Francisco Guzm{\'{a}}n and
               Edouard Grave and
               Myle Ott and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
  journal   = {CoRR},
  volume    = {abs/1911.02116},
  year      = {2019},
  url       = {http://arxiv.org/abs/1911.02116},
  eprinttype = {arXiv},
  eprint    = {1911.02116},
  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dataset Citation

Please cite the following papers when using the MASSIVE dataset.

@misc{fitzgerald2022massive,
      title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages},
      author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan},
      year={2022},
      eprint={2204.08582},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{bastianelli-etal-2020-slurp,
    title = "{SLURP}: A Spoken Language Understanding Resource Package",
    author = "Bastianelli, Emanuele  and
      Vanzo, Andrea  and
      Swietojanski, Pawel  and
      Rieser, Verena",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.588",
    doi = "10.18653/v1/2020.emnlp-main.588",
    pages = "7252--7262",
    abstract = "Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https://github.com/pswietojanski/slurp."
}
Downloads last month
54
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for takehika/xlm-roberta-ja-massive-intent

Finetuned
(3559)
this model

Dataset used to train takehika/xlm-roberta-ja-massive-intent

Evaluation results