Token Classification
Transformers
Safetensors
modernbert
chunking
RAG
text-split

Chonky_mmbert_small_multilingual_v1

Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.

🆕 Now multilingual!

Model Description

The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.

⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).

How to use

I've made a small python library for this model: chonky

Here is the usage:

from src.chonky import ParagraphSplitter

# on the first run it will download the transformer model
splitter = ParagraphSplitter(
  model_id="mirth/chonky_mmbert_small_multilingual_1",
  device="cpu"
)

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
    "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)

for chunk in splitter(text):
  print(chunk)
  print("--")

Sample Output:

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep
--
. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--

But you can use this model using standart NER pipeline:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "mirth/chonky_mmbert_small_multilingual_1"

tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)

id2label = {
    0: "O",
    1: "separator",
}
label2id = {
    "O": 0,
    "separator": 1,
}

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = (
    "Before college the two main things I worked on, outside of school, were writing and programming. "
    "I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
    "My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
    "The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
    "This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
    "and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
    "CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)

pipe(text)

Sample output

[{'entity_group': 'separator',
  'score': np.float32(0.66304857),
  'word': ' deep',
  'start': 332,
  'end': 337}]

Training Data

The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets.

Metrics

Token based F1-score.

Project Gutenberg validation:

Model de en es fr it nl pl pt ru sv zh
chonky_mmbert_small_multi_1 🆕 0.88 0.78 0.91 0.93 0.86 0.81 0.81 0.88 0.97 0.91 0.11
chonky_modernbert_large_1 0.53 0.43 0.48 0.51 0.56 0.21 0.65 0.53 0.87 0.51 0.33
chonky_modernbert_base_1 0.42 0.38 0.34 0.4 0.33 0.22 0.41 0.35 0.27 0.31 0.26
chonky_distilbert_base_uncased_1 0.19 0.3 0.17 0.2 0.18 0.04 0.27 0.21 0.22 0.19 0.15
Number of val tokens 1m 1m 1m 1m 1m 1m 38k 1m 24k 1m 132k

Various english datasets:

Model bookcorpus en_judgements paul_graham 20_newsgroups
chonkY_modernbert_large_1 0.79 0.29 0.69 0.17
chonkY_modernbert_base_1 0.72 0.08 0.63 0.15
chonkY_distilbert_base_uncased_1 0.69 0.05 0.52 0.15
chonky_mmbert_small_multilingual_1 🆕 0.72 0.2 0.56 0.13

Hardware

Model was fine-tuned on a single H100 for a several hours

Downloads last month
10
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mirth/chonky_mmbert_small_multilingual_1

Finetuned
(8)
this model

Datasets used to train mirth/chonky_mmbert_small_multilingual_1