Korean Neural Sparse Encoder v22.0
ํ๊ตญ์ด ์ ๊ฒฝ๋ง ํฌ์ ์ธ์ฝ๋ - OpenSearch Neural Sparse ๊ฒ์์ ์ํ SPLADE ๊ธฐ๋ฐ ๋ชจ๋ธ
Model Description
This model is a SPLADE-based sparse encoder fine-tuned for Korean text, specifically optimized for:
- Legal domain terminology
- Medical domain terminology
- General Korean synonym expansion
v22.0 Improvements (over v21.4)
- InfoNCE Contrastive Loss: In-batch negatives for better discriminative representations
- Temperature Annealing: 0.07 โ 0.05 โ 0.03 for progressively sharper discrimination
- Expanded Training Data: 840,859 total triplets across 3 phases
- Curriculum Learning: 3-phase training with dynamic InfoNCE weight increase
Training Results
| Metric | v21.4 | v22.0 |
|---|---|---|
| Training Recall@1 | - | 99.87% |
| Training MRR | - | 0.9994 |
| General Terms Recall | 78.7% | 81.5% |
| Garbage Outputs | 0/5 | 0/5 |
Training Phases
| Phase | Epochs | Data Size | Temperature | InfoNCE Weight |
|---|---|---|---|---|
| Phase 1 (Single-term) | 1-10 | 66,685 | 0.07 | 1.0 |
| Phase 2 (Balanced) | 11-20 | 224,177 | 0.05 | 1.5 |
| Phase 3 (Full) | 21-30 | 549,997 | 0.03 | 2.0 |
Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn as nn
# Load model
tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder-v21.4")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder-v21.4")
# Encode text
def encode(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
relu = nn.ReLU()
token_scores = torch.log1p(relu(logits))
mask = inputs["attention_mask"].unsqueeze(-1).float()
sparse_repr = (token_scores * mask).max(dim=1).values[0]
return sparse_repr
# Example
sparse = encode("๋น๋จ๋ณ ์น๋ฃ ๋ฐฉ๋ฒ")
top_values, top_indices = sparse.topk(10)
for idx, val in zip(top_indices, top_values):
print(f"{tokenizer.decode([idx])}: {val:.4f}")
Example Expansions
| Query | Top Expansions |
|---|---|
| ์ํด๋ฐฐ์ | ์ํด, ํผํด, ๋ฐฐ์, ์์ค, ๋ณด์, ์์ก, ์์๋ฃ |
| ์ธ๊ณต์ง๋ฅ | AI, ์ง๋ฅ, ์ปดํจํฐ, IT, ๋ก๋ด, ์๊ณ ๋ฆฌ์ฆ |
| ๋น๋จ๋ณ | ๋น๋จ, ํ๋น, ์ธ์๋ฆฐ, ๋น๋ง, ์ฝ๋ ์คํ ๋กค |
| ๊ณ์ฝ์ | ๊ณ์ฝ, ์ฝ์ , ํ์ฝ, ํฉ์, ๊ณ์ฝ๊ธ, ์ฝ๊ด |
OpenSearch Integration
This model is designed to work with OpenSearch Neural Sparse Search. See the OpenSearch documentation for integration details.
Base Model
- Base: skt/A.X-Encoder-base
- Parameters: 149,372,240
- Vocabulary: 50,000 tokens
- Max Length: 64 tokens
Citation
@misc{korean-neural-sparse-v22.0,
author = {Sewoong Lee},
title = {Korean Neural Sparse Encoder v22.0},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/sewoong/korean-neural-sparse-encoder-v21.4}
}
- Downloads last month
- 8