WebOrganizer-FormatClassifier / README.md

wissamantoun

Upload folder using huggingface_hub

b59fc85 verified 6 months ago

preview code

raw

history blame contribute delete

3.87 kB

metadata

library_name: transformers
datasets:
  - WebOrganizer/FormatAnnotations-Llama-3.1-8B
  - WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
  - Alibaba-NLP/gte-base-en-v1.5

WebOrganizer/FormatClassifier

[Paper] [Website] [GitHub]

The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages. The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:

WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

WebOrganizer/FormatClassifier ← you are here!
WebOrganizer/FormatClassifier-NoURL
WebOrganizer/TopicClassifier
WebOrganizer/TopicClassifier-NoURL

Usage

This classifier expects input in the following input format:

{url}

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

Academic Writing
Content Listing
Creative Writing
Customer Support
Comment Section
FAQ
Truncated
Knowledge Article
Legal Notices
Listicle
News Article
Nonfiction Writing
About (Org.)
News (Org.)
About (Pers.)
Personal Blog
Product Page
Q&A Forum
Spam / Ads
Structured Data
Documentation
Audio Transcript
Tutorial
User Review

The full definitions of the categories can be found in the taxonomy config.

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}