library_name: transformers
datasets:
  - WebOrganizer/FormatAnnotations-Llama-3.1-8B
  - WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
  - Alibaba-NLP/gte-base-en-v1.5
WebOrganizer/FormatClassifier
The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages. The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
- WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
 - WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
 
All Domain Classifiers
- WebOrganizer/FormatClassifier ← you are here!
 - WebOrganizer/FormatClassifier-NoURL
 - WebOrganizer/TopicClassifier
 - WebOrganizer/TopicClassifier-NoURL
 
Usage
This classifier expects input in the following input format:
{url}
{text}
Example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to make a good sandwich? [Click here to read article]"""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)
You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):
- Academic Writing
 - Content Listing
 - Creative Writing
 - Customer Support
 - Comment Section
 - FAQ
 - Truncated
 - Knowledge Article
 - Legal Notices
 - Listicle
 - News Article
 - Nonfiction Writing
 - About (Org.)
 - News (Org.)
 - About (Pers.)
 - Personal Blog
 - Product Page
 - Q&A Forum
 - Spam / Ads
 - Structured Data
 - Documentation
 - Audio Transcript
 - Tutorial
 - User Review
 
The full definitions of the categories can be found in the taxonomy config.
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)
Citation
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}