Korean Neural Sparse Encoder v22.0

ํ•œ๊ตญ์–ด ์‹ ๊ฒฝ๋ง ํฌ์†Œ ์ธ์ฝ”๋” - OpenSearch Neural Sparse ๊ฒ€์ƒ‰์„ ์œ„ํ•œ SPLADE ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

Model Description

This model is a SPLADE-based sparse encoder fine-tuned for Korean text, specifically optimized for:

  • Legal domain terminology
  • Medical domain terminology
  • General Korean synonym expansion

v22.0 Improvements (over v21.4)

  • InfoNCE Contrastive Loss: In-batch negatives for better discriminative representations
  • Temperature Annealing: 0.07 โ†’ 0.05 โ†’ 0.03 for progressively sharper discrimination
  • Expanded Training Data: 840,859 total triplets across 3 phases
  • Curriculum Learning: 3-phase training with dynamic InfoNCE weight increase

Training Results

Metric v21.4 v22.0
Training Recall@1 - 99.87%
Training MRR - 0.9994
General Terms Recall 78.7% 81.5%
Garbage Outputs 0/5 0/5

Training Phases

Phase Epochs Data Size Temperature InfoNCE Weight
Phase 1 (Single-term) 1-10 66,685 0.07 1.0
Phase 2 (Balanced) 11-20 224,177 0.05 1.5
Phase 3 (Full) 21-30 549,997 0.03 2.0

Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
import torch.nn as nn

# Load model
tokenizer = AutoTokenizer.from_pretrained("sewoong/korean-neural-sparse-encoder-v21.4")
model = AutoModelForMaskedLM.from_pretrained("sewoong/korean-neural-sparse-encoder-v21.4")

# Encode text
def encode(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        relu = nn.ReLU()
        token_scores = torch.log1p(relu(logits))
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        sparse_repr = (token_scores * mask).max(dim=1).values[0]
    return sparse_repr

# Example
sparse = encode("๋‹น๋‡จ๋ณ‘ ์น˜๋ฃŒ ๋ฐฉ๋ฒ•")
top_values, top_indices = sparse.topk(10)
for idx, val in zip(top_indices, top_values):
    print(f"{tokenizer.decode([idx])}: {val:.4f}")

Example Expansions

Query Top Expansions
์†ํ•ด๋ฐฐ์ƒ ์†ํ•ด, ํ”ผํ•ด, ๋ฐฐ์ƒ, ์†์‹ค, ๋ณด์ƒ, ์†Œ์†ก, ์œ„์ž๋ฃŒ
์ธ๊ณต์ง€๋Šฅ AI, ์ง€๋Šฅ, ์ปดํ“จํ„ฐ, IT, ๋กœ๋ด‡, ์•Œ๊ณ ๋ฆฌ์ฆ˜
๋‹น๋‡จ๋ณ‘ ๋‹น๋‡จ, ํ˜ˆ๋‹น, ์ธ์А๋ฆฐ, ๋น„๋งŒ, ์ฝœ๋ ˆ์Šคํ…Œ๋กค
๊ณ„์•ฝ์„œ ๊ณ„์•ฝ, ์•ฝ์ •, ํ˜‘์•ฝ, ํ•ฉ์˜, ๊ณ„์•ฝ๊ธˆ, ์•ฝ๊ด€

OpenSearch Integration

This model is designed to work with OpenSearch Neural Sparse Search. See the OpenSearch documentation for integration details.

Base Model

  • Base: skt/A.X-Encoder-base
  • Parameters: 149,372,240
  • Vocabulary: 50,000 tokens
  • Max Length: 64 tokens

Citation

@misc{korean-neural-sparse-v22.0,
  author = {Sewoong Lee},
  title = {Korean Neural Sparse Encoder v22.0},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/sewoong/korean-neural-sparse-encoder-v21.4}
}
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support