Model Description
Comp4Cls is a retrieval-augmented classification framework that uses entity-centric semantic compression to turn long scientific/technical documents into short, task-focused representations for both retrieval and labeling. Documents (papers, patents, and R&D reports) are first compressed into structured summaries that preserve discriminative signals (e.g., core concepts, methods, problems, findings), embedded, and stored in a vector DB. At inference, a query is compressed the same way, nearest neighbors are retrieved, and a small LLM assigns the final class label using the compressed evidence.
The end-to-end workflowβPhase 1: compression + indexing, Phase 2: retrieval + classificationβis illustrated in the framework diagram on page 2. Experiments on a large bilingual corpus with hierarchical, multi-label taxonomies show that a 4B-scale Comp4Cls matches or outperforms 8Bβ14B models, especially in fine-grained categories, while cutting token usage and compute. Moderate compression (often ~20% of entities) preserves retrieval fidelity and boosts downstream F1, enabling lightweight, low-latency deployment in production pipelines. See Table II on page 8 (compression vs. length), Figure 6 on page 9 (retrieval quality under compression), and Figure 7 on page 10 (accuracy vs. larger LLMs).
Framework Diagram
Figure 1. Overview of the **Comp4Cls** framework. The system operates in two phases: (i) documents with predefined class labels are semantically compressed, embedded, and stored in a vector database; (ii) when a new query arrives, it is compressed and used to retrieve the top-$k$ most similar documents from the vector store. The large language model (LLM) then determines the final class label based on the retrieved context. Finally, the compressed query and its assigned label are stored back into the database, enabling downstream services such as document categorization, semantic search, and TL;DR summarization.
Key Features
Entity-centric Semantic Compression Two-stage prompting (entity extraction β selective rewriting) produces concise, structured summaries that retain label-relevant semantics while removing redundancy. The compressor exposes an explicit compression ratio to match accuracy/latency budgets.
Retrieval-Augmented Classification (RAG) with Short Contexts Operates on compressed texts for both the query and neighbors, reducing context length and enabling broader top-k without βlost-in-the-middleβ degradation.
Small-Model, Big-Model Performance With ~20% compression, a 4B backbone achieves or exceeds the accuracy of 8Bβ14B models across domains and taxonomy levels.
Provable Efficiency Gains Compression reduces input tokens by ~50% on average while maintaining semantic similarity; retrieval accuracy remains near full-text levels.
Scales to Real-World, Heterogeneous Corpora Trained/evaluated on large bilingual datasets spanning papers, patents, and R&D reports with hierarchical, multi-label taxonomies; robust under domain shift and taxonomy changes.
Production-minded Latency/Throughput Shorter prompts cut classification-stage latency; compression allows higher top-k (β20β30) before context saturation.
Vector DB-Ready Artifacts Outputs compressed texts + embeddings that plug into standard ANN indices (e.g., HNSW) for high-throughput retrieval in enterprise knowledge systems.
Beyond Classification The compressed representations support downstream semantic search, TL;DR summaries, and knowledge organization tasks out of the box.
Comp4Cls β Full Usage Guide w/ vLLM
This guide shows how to run all three stages of Comp4Cls with vLLM:
- Entity Extraction β 2) Compression β 3) Classification.
It uses your exact prompt templates for each stage and a minimal vLLM wrapper. Replace the model name with your fine-tuned repo if needed.
0) Install & Setup
pip install vllm "transformers>=4.44" accelerate einops huggingface-hub
1) Minimal Inference Primitives
import os, re, json
from typing import Optional, List, Dict
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# ----------------------
# Config
# ----------------------
MODEL_NAME = "comp4cls/comp4cls-4B"
# Generation params (Stage-3 uses stop at </answer>)
GEN_COMMON = SamplingParams(
temperature=0.2,
top_p=0.8,
repetition_penalty=1.1,
frequency_penalty=0.1,
presence_penalty=0.1,
max_tokens=2048,
)
GEN_CLASSIFICATION = SamplingParams(**{**GEN_COMMON.__dict__['_asdict']() if hasattr(GEN_COMMON, '_asdict') else {}}, stop=["</answer>"]) \
if hasattr(GEN_COMMON, '_asdict') else SamplingParams(
temperature=GEN_COMMON.temperature,
top_p=GEN_COMMON.top_p,
repetition_penalty=GEN_COMMON.repetition_penalty,
frequency_penalty=GEN_COMMON.frequency_penalty,
presence_penalty=GEN_COMMON.presence_penalty,
max_tokens=GEN_COMMON.max_tokens,
stop=["</answer>"]
)
# ----------------------
# Load tokenizer & model
# ----------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
model=MODEL_NAME,
trust_remote_code=True,
tensor_parallel_size=1,
gpu_memory_utilization=0.95,
max_model_len=30000,
max_num_seqs=64,
)
# ----------------------
# Helpers
# ----------------------
def apply_chat_template(prompt: str, enable_thinking: bool=False) -> str:
"""Wrap raw prompt with the model's chat template."""
messages = [{"role": "user", "content": prompt}]
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=enable_thinking,
)
def generate_text(prompt: str, params: SamplingParams) -> str:
"""Single-pass generation with vLLM."""
formatted = apply_chat_template(prompt, enable_thinking=False)
out = llm.generate([formatted], params)
text = out[0].outputs[0].text
return text
def parse_json_object(text: str) -> dict:
"""Extract the first top-level JSON object from text and parse it."""
start = text.find("{")
end = text.rfind("}") + 1
if start == -1 or end == 0:
raise ValueError("No JSON object detected in model output.")
return json.loads(text[start:end])
def parse_answer_ids(text: str) -> Optional[List[Dict[str, int]]]:
"""Extract class IDs from <answer> ... </answer> block: [{'class_id': 123}, ...]."""
try:
m = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
if not m:
return None
body = m.group(1).strip()
if body.lower() == "none":
return []
body = body.strip().strip('[]')
classes = []
for mm in re.finditer(r'\((\d+)\)', body):
classes.append({"class_id": int(mm.group(1))})
if not classes and body:
parts = [x.strip() for x in body.split(",")]
for p in parts:
if p.isdigit():
classes.append({"class_id": int(p)})
return classes if classes else []
except Exception:
return None
2) Stage 1 β Entity Extraction
Prompt (exact as provided):
prompt_template_entity_extraction = """You are tasked with extracting keywords from scientific literature abstracts based on their domain classification.
Extract keywords that appear EXACTLY in the given abstract and organize them into 7 predefined keyword types.
Instructions:
1. Read the provided abstract and domain classification carefully
2. Extract keywords/phrases that appear verbatim in the abstract
3. Organize each keyword into the most appropriate keyword type
4. Each keyword should be assigned to only one type
5. Focus on meaningful technical terms, not common words
6. Return results in JSON format
Keyword Types for Organization:
1. core_concepts: Central theories, main ideas, or fundamental concepts that define the research
2. methodologies: Research methods, experimental techniques, analytical approaches, or procedural strategies
3. subjects_problems: Research subjects, target problems, phenomena under investigation, or challenges being addressed
4. findings_impacts: Key discoveries, results, outcomes, implications, or impacts of the research
5. theoretical_framework: Underlying theories, models, principles, or conceptual foundations
6. quantitative_metrics: Numerical values, measurements, statistics, percentages, or any quantifiable data
7. contextual_background: Historical context, motivation, prior work references, or situational background
Guidelines:
- Extract only words/phrases that exist exactly in the abstract
- Prefer technical terms over generic academic vocabulary
- Include both single words and meaningful phrases
- For quantitative metrics, include the complete value with units
- Ensure keywords are relevant to the domain classification Output must be in JSON format with all 7 keyword types as keys.
Example output format: {{ "core_concepts": ["CEST MRI", "thermally activated delayed fluorescence", "blue phosphorescent organic light-emitting diodes"], "methodologies": ["synthesized", "subspace-based spectral signal decomposition", "sphere formation assay"], "subjects_problems": ["z-spectrum analysis", "cancer stem cells", "charge balance"], "findings_impacts": ["high quantum efficiency", "inhibits mobility", "record high"], "theoretical_framework": ["saturation transfer phenomena", "energy transfer", "structure-property relationship"], "quantitative_metrics": ["Above 30%", "24.2%", "70-110 GHz", "40-80 ΞΌM"], "contextual_background": ["drug resistance", "alternative to conventional", "for molecular MRI"] }}
Extract keywords from the following scientific literature:
Abstract: {abstract}
Return the keywords organized by their types in JSON format with all 7 keyword types.
"""
# Example input (replace with your real abstract)
abstract = "We present a novel lithium-sulfur battery cathode design using porous carbon hosts..."
entity_prompt = prompt_template_entity_extraction.format(abstract=abstract)
entity_output = generate_text(entity_prompt, GEN_COMMON)
entities = parse_json_object(entity_output) # dict with 7 keys
print(json.dumps(entities, indent=2, ensure_ascii=False))
3) Stage 2 β Compression
Prompt (exact as provided):
prompt_template_compression = """You are a scientific document summarizer specializing in category-driven summarization.
Task: Create a concise summary using ONLY {max_items} items from the provided semantic categories (out of {total_items} total items).
Requirements:
- Write the summary in the same language as the original text
- Select the {max_items} most relevant items that align with the original text
- Use content from the original text ONLY when it directly supports these categories
- The summary should read as if the original text was written to illustrate the semantic categories
- Maintain scientific accuracy and use precise terminology
- Ensure logical flow and coherence between concepts
Input:
- Original Text: {text}
- Semantic Categories (in order of priority): {categories}
CRITICAL: You MUST output ONLY a valid JSON object in exactly this format:
{{"response": "Your concise summary here"}}
Do not include any text before or after the JSON object. The summary should be a single continuous text without line breaks.
Output Format (example):
{{"response": "This research focuses on developing novel battery materials using advanced synthesis methods, achieving significant improvements in energy density and cycle stability through optimized electrode design."}}
"""
# Choose how many items you want to keep
max_items = 10
categories = list(entities.keys()) # ["core_concepts", "methodologies", ...]
total_items = sum(len(v) for v in entities.values())
compression_prompt = prompt_template_compression.format(
max_items=max_items,
total_items=total_items,
text=abstract,
categories=categories,
)
compression_output = generate_text(compression_prompt, GEN_COMMON)
compressed = parse_json_object(compression_output)["response"]
print("Compressed summary:", compressed)
4) Stage 3 β Classification (Patent-focused)
Prompt (exact as provided):
prompt_template_classification = """You are a text classification expert specializing in patent documents.
You are given a JSON record for a target patent and a set of Retrieved Similar Items.
Your task is to assign one or more class labels to a given target patent using the provided examples as guidance.
---
**Step-by-Step Instructions:**
1. **Analyze Target and Retrieved Examples:**
- Review each example, paying attention to the class label and how the text reflects it.
- Focus on technical innovation, claims, and patent-specific terminology.
2. **Similarity Scoring (1β5):**
For each Retrieved Similar Item, score along three dimensions and sum to 1β5:
- Domain (0β2):
- 2: Same primary technology field
- 1: Closely related technology
- 0: Unrelated
- Innovation Type (0β2):
- 2: Same type of innovation (e.g., device, method, composition)
- 1: Partial overlap in innovation approach
- 0: Different innovation type
- Application/Material (0β1):
- 1: Shares key technical terms or entities
- 0: Different application/material
3. **Total Score β Similarity Label:**
- 5: Fully similar (Domain=2 + Innovation=2 + Application=1)
- 4: Mostly similar (sum = 4)
- 3: Partially similar (sum = 3)
- 2: Little similarity (sum = 2)
- 1: Irrelevant (sum = 0 or 1)
4. **Make a Classification Decision:**
- Based on all retrieved items, assign the most appropriate class ID(s) to the target.
---
**Response Format:**
1. **Chain-of-Thought** (between `<begin_of_thought>` and `<end_of_thought>`):
- Summarize the target's core innovation, claims, and technical field.
- For each Retrieved Similar Item, analyze its similarity and assign score.
- Conclude with overall comparison.
2. **Final Answer:**
- Provide classification with brief justification.
- Output ONLY the list of class id values.
**Use exactly this structure and STOP immediately after </answer>:**
```
<begin_of_thought>
<p>Target patent analysis... </p>
<p>Reference[Item ID=...], [Similarity=...], judgment text</p>
...
<end_of_thought>
<solution>Overall evaluation=...</solution>
<answer>[Class_label_ID_1, Class_label_ID_2, ...]</answer>
```
**CRITICAL: Your response MUST end with </answer>. Do not add any text after the closing </answer> tag.**
---
**Special Condition:**
- If Total Score β€ 2:
- `<solution>`: Cannot determine answer
- `<answer>`: None
- Otherwise:
- `<solution>`: Overall evaluation=...
- `<answer>`: [<Class_label_ID_1>, <Class_label_ID_2>, ...]
---
**Input Data:**
- Target ID: {target_id}
- Target Text: {target_text}
- Retrieved Similar Items (Top {retrieved_count}):
{retrieved_items_text}
---
"""
# Example retrieved neighbors (use COMPRESSED text for better accuracy/latency)
retrieved = [
{"id": "US-AAA", "label": "H01M10/0525", "text": "Porous carbon hosts for Li-S cathodes..."},
{"id": "US-BBB", "label": "H01M4/13", "text": "Conductive polymer binder for sulfur cathode..."},
]
retrieved_items_text = "\n".join(
f"- ID: {r['id']}\n Label: {r.get('label','')}\n Text: {r['text']}" for r in retrieved
)
classification_prompt = prompt_template_classification.format(
target_id="TARGET-1",
target_text=compressed, # classify on compressed text
retrieved_count=len(retrieved),
retrieved_items_text=retrieved_items_text,
)
# Use stop at </answer> for clean termination
cls_text = generate_text(classification_prompt, GEN_CLASSIFICATION)
if '</answer>' not in cls_text and '<answer>' in cls_text:
cls_text += '</answer>'
print(cls_text)
parsed_ids = parse_answer_ids(cls_text)
print("parsed:", parsed_ids)
5) End-to-End Helper (Optional)
def comp4cls_pipeline(abstract: str, retrieve_fn, k: int = 10) -> dict:
"""
:param abstract: raw document text
:param retrieve_fn: function(query_text, k) -> list of dicts [{id, label, text}, ...]
:param k: top-k neighbors
:return: {"entities": {...}, "compressed": "...", "classification_raw": "...", "parsed_ids": [...]}
"""
# Stage 1: Entities
ent_prompt = prompt_template_entity_extraction.format(abstract=abstract)
ent_text = generate_text(ent_prompt, GEN_COMMON)
entities = parse_json_object(ent_text)
# Stage 2: Compression
max_items = 10
categories = list(entities.keys())
total_items = sum(len(v) for v in entities.values())
comp_prompt = prompt_template_compression.format(
max_items=max_items, total_items=total_items, text=abstract, categories=categories
)
comp_text = generate_text(comp_prompt, GEN_COMMON)
compressed = parse_json_object(comp_text)["response"]
# Stage 3: Retrieval + Classification
neighbors = retrieve_fn(compressed, k=k) # [{"id","label","text"}, ...]
retrieved_items_text = "\n".join(
f"- ID: {r['id']}\n Label: {r.get('label','')}\n Text: {r['text']}" for r in neighbors
)
cls_prompt = prompt_template_classification.format(
target_id="TARGET-1",
target_text=compressed,
retrieved_count=len(neighbors),
retrieved_items_text=retrieved_items_text,
)
cls_raw = generate_text(cls_prompt, GEN_CLASSIFICATION)
if '</answer>' not in cls_raw and '<answer>' in cls_raw:
cls_raw += '</answer>'
parsed = parse_answer_ids(cls_raw)
return {"entities": entities, "compressed": compressed, "classification_raw": cls_raw, "parsed_ids": parsed}
6) Notes
- Stage-1/2 prompts demand strict JSON. The helper
parse_json_objectextracts the first valid JSON block. - For Stage-3, keep
stop=["</answer>"]to avoid over-generation and simplify parsing. - Swap
MODEL_NAMEfor your fine-tuned repo (e.g.,gsjang/lim-4b-1-0826) if desired. - Retrieval should use compressed texts for both query and neighbors.
Citation
If you use Comp4Cls in your work, please cite:
@inproceedings{lim2026comp4cls,
author = {Lim, Chanuk},
title = {Comp4Cls: Semantic Compression for Enhanced Retrieval-Augmented Classification of Real-World Scientific and Technical Documents},
booktitle = {ICDE 2026 (submitted)},
year = {2026},
}
Acknowledgements
- Korea Institute of Science and Technology Information (KISTI) β This research was supported in 2025 under project K25L1M1C1, as part of the development of KONI (KISTI Open Neural Intelligence), a large language model specialized for science and technology.
- National Supercomputing Center (KISTI) β We gratefully acknowledge the computational resources and technical support provided by the National Supercomputing Center.
- Downloads last month
- 188