Chonky_mmbert_small_multilingual_v1
Chonky is a transformer model that intelligently segments text into meaningful semantic chunks. This model can be used in the RAG systems.
🆕 Now multilingual!
Model Description
The model processes text and divides it into semantically coherent segments. These chunks can then be fed into embedding-based retrieval systems or language models as part of a RAG pipeline.
⚠️This model was fine-tuned on sequence of length 1024 (by default mmBERT supports sequence length up to 8192).
How to use
I've made a small python library for this model: chonky
Here is the usage:
from src.chonky import ParagraphSplitter
# on the first run it will download the transformer model
splitter = ParagraphSplitter(
model_id="mirth/chonky_mmbert_small_multilingual_1",
device="cpu"
)
text = (
"Before college the two main things I worked on, outside of school, were writing and programming. "
"I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
"My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
"The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
"This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
"and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
"CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)
for chunk in splitter(text):
print(chunk)
print("--")
Sample Output:
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep
--
. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
--
But you can use this model using standart NER pipeline:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "mirth/chonky_mmbert_small_multilingual_1"
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=1024)
id2label = {
0: "O",
1: "separator",
}
label2id = {
"O": 0,
"separator": 1,
}
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=2,
id2label=id2label,
label2id=label2id,
)
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = (
"Before college the two main things I worked on, outside of school, were writing and programming. "
"I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. "
"My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. "
"The first programs I tried writing were on the IBM 1401 that our school district used for what was then called 'data processing.' "
"This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, "
"and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — "
"CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights."
)
pipe(text)
Sample output
[{'entity_group': 'separator',
'score': np.float32(0.66304857),
'word': ' deep',
'start': 332,
'end': 337}]
Training Data
The model was trained to split paragraphs from minipile, bookcorpus and Project Gutenberg datasets.
Metrics
Token based F1-score.
Project Gutenberg validation:
Model | de | en | es | fr | it | nl | pl | pt | ru | sv | zh |
---|---|---|---|---|---|---|---|---|---|---|---|
chonky_mmbert_small_multi_1 🆕 | 0.88 | 0.78 | 0.91 | 0.93 | 0.86 | 0.81 | 0.81 | 0.88 | 0.97 | 0.91 | 0.11 |
chonky_modernbert_large_1 | 0.53 | 0.43 | 0.48 | 0.51 | 0.56 | 0.21 | 0.65 | 0.53 | 0.87 | 0.51 | 0.33 |
chonky_modernbert_base_1 | 0.42 | 0.38 | 0.34 | 0.4 | 0.33 | 0.22 | 0.41 | 0.35 | 0.27 | 0.31 | 0.26 |
chonky_distilbert_base_uncased_1 | 0.19 | 0.3 | 0.17 | 0.2 | 0.18 | 0.04 | 0.27 | 0.21 | 0.22 | 0.19 | 0.15 |
Number of val tokens | 1m | 1m | 1m | 1m | 1m | 1m | 38k | 1m | 24k | 1m | 132k |
Various english datasets:
Model | bookcorpus | en_judgements | paul_graham | 20_newsgroups |
---|---|---|---|---|
chonkY_modernbert_large_1 | 0.79 | 0.29 | 0.69 | 0.17 |
chonkY_modernbert_base_1 | 0.72 | 0.08 | 0.63 | 0.15 |
chonkY_distilbert_base_uncased_1 | 0.69 | 0.05 | 0.52 | 0.15 |
chonky_mmbert_small_multilingual_1 🆕 | 0.72 | 0.2 | 0.56 | 0.13 |
Hardware
Model was fine-tuned on a single H100 for a several hours
- Downloads last month
- 10
Model tree for mirth/chonky_mmbert_small_multilingual_1
Base model
jhu-clsp/mmBERT-small