A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations.
Asankhaya Sharma PRO
codelion
AI & ML interests
Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.
Recent Activity
reacted
to
their
post
with ❤️
1 day ago
Perplexity released a dataset (BrowseSafe) and benchmark to catch and prevent malicious prompt-injection instructions in real-time.
We trained a prompt injection classifier on BrowseSafe using adaptive-classifier with ModernBERT-base embeddings.
74.9% F1 on detecting prompt injection in web content.
Model -> https://huggingface.co/adaptive-classifier/browsesafe
Dataset -> https://huggingface.co/datasets/perplexity-ai/browsesafe-bench
Repo -> https://github.com/codelion/adaptive-classifier
reacted
to
their
post
with 👀
1 day ago
Perplexity released a dataset (BrowseSafe) and benchmark to catch and prevent malicious prompt-injection instructions in real-time.
We trained a prompt injection classifier on BrowseSafe using adaptive-classifier with ModernBERT-base embeddings.
74.9% F1 on detecting prompt injection in web content.
Model -> https://huggingface.co/adaptive-classifier/browsesafe
Dataset -> https://huggingface.co/datasets/perplexity-ai/browsesafe-bench
Repo -> https://github.com/codelion/adaptive-classifier
reacted
to
their
post
with 🚀
1 day ago
Perplexity released a dataset (BrowseSafe) and benchmark to catch and prevent malicious prompt-injection instructions in real-time.
We trained a prompt injection classifier on BrowseSafe using adaptive-classifier with ModernBERT-base embeddings.
74.9% F1 on detecting prompt injection in web content.
Model -> https://huggingface.co/adaptive-classifier/browsesafe
Dataset -> https://huggingface.co/datasets/perplexity-ai/browsesafe-bench
Repo -> https://github.com/codelion/adaptive-classifier