--- language: - en library_name: transformers pipeline_tag: summarization license: apache-2.0 tags: - chemistry - scientific-summarization - distilbart - abstractive - tldr - knowledge-graphs datasets: - Bocklitz-Lab/lit2vec-tldr-bart-dataset model-index: - name: lit2vec-tldr-bart results: - task: name: Summarization type: summarization dataset: name: Lit2Vec TL;DR Chemistry Dataset type: Bocklitz-Lab/lit2vec-tldr-bart-dataset split: test size: 1001 metrics: - type: rouge1 value: 56.11 - type: rouge2 value: 30.78 - type: rougeLsum value: 45.43 --- # lit2vec-tldr-bart (DistilBART fine-tuned for chemistry TL;DRs) **lit2vec-tldr-bart** is a DistilBART model fine-tuned on **19,992** CC-BY licensed chemistry abstracts to produce **concise TL;DR-style summaries** aligned with methods β†’ results β†’ significance. It’s designed for scientific **abstractive summarization**, **semantic indexing**, and **knowledge-graph population** in chemistry and related fields. - **Base model:** `sshleifer/distilbart-cnn-12-6` - **Training data:** [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset) - **Max input length:** 1024 tokens - **Target length:** ~128 tokens --- ## πŸ§ͺ Evaluation (held-out test) | Split | ROUGE-1 | ROUGE-2 | ROUGE-Lsum | |------:|--------:|--------:|-----------:| | Test | **56.11** | **30.78** | **45.43** | > Validation RLsum: 46.05 > Metrics computed with `evaluate`'s `rouge` (NLTK sentence segmentation, `use_stemmer=True`). --- ## πŸš€ Quickstart ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig repo = "Bocklitz-Lab/lit2vec-tldr-bart" tok = AutoTokenizer.from_pretrained(repo) model = AutoModelForSeq2SeqLM.from_pretrained(repo) gen = GenerationConfig.from_pretrained(repo) # loads default decoding params text = "Proton exchange membrane fuel cells convert chemical energy into electricity..." inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024) summary_ids = model.generate(**inputs, **gen.to_dict()) print(tok.decode(summary_ids[0], skip_special_tokens=True)) ```` ### Batch inference (PyTorch) ```python texts = [ "Abstract 1 ...", "Abstract 2 ...", ] batch = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=1024) out = model.generate(**batch, **gen.to_dict()) summaries = tok.batch_decode(out, skip_special_tokens=True) ``` --- ## πŸ”§ Default decoding (saved in `generation_config.json`) These are the defaults saved with the model (you can override at `generate()` time): ```json { "max_length": 142, "min_length": 56, "early_stopping": true, "num_beams": 4, "length_penalty": 2.0, "no_repeat_ngram_size": 3, "forced_bos_token_id": 0, "forced_eos_token_id": 2 } ``` --- ## πŸ“Š Training details * **Base:** `sshleifer/distilbart-cnn-12-6` (Distilled BART) * **Data:** 19,992 CC-BY chemistry abstracts with TL;DR summaries * **Splits:** train=17,992 / val=999 / test=1,001 * **Max lengths:** input 1024, target 128 * **Optimizer:** AdamW, **lr=2e-5** * **Batching:** per-device train/eval batch size 4, **gradient\_accumulation\_steps=4** * **Epochs:** 5 * **Precision:** fp16 (when CUDA available) * **Hardware:** single NVIDIA RTX 3090 * **Seed:** 42 * **Libraries:** πŸ€— Transformers + Datasets, `evaluate` for ROUGE, NLTK for sentence splitting --- ## βœ… Intended use * TL;DR abstractive summaries for **chemistry** and adjacent domains (materials science, chemical engineering, environmental science). * **Semantic indexing**, **IR reranking**, and **knowledge graph** ingestion where concise method/result statements are helpful. ### Limitations & risks * May **hallucinate** details not present in the abstract (typical for abstractive models). * Not a substitute for expert judgment; avoid using summaries as sole evidence for scientific claims. * Trained on CC-BY English abstracts; performance may degrade on other domains/languages. --- ## πŸ“¦ Files This repo should include: * `config.json`, `pytorch_model.bin` or `model.safetensors` * `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, merges/vocab as applicable * `generation_config.json` (decoding defaults) --- ## πŸ” Reproducibility * Dataset: [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset) * Recommended preprocessing: truncate inputs at 1024 tokens; targets at 128. * ROUGE evaluation: `evaluate.load("rouge")`, NLTK sentence tokenization, `use_stemmer=True`. --- ## πŸ“š Citation If you use this model or dataset, please cite: ```bibtex @software{lit2vec_tldr_bart_2025, title = {lit2vec-tldr-bart: DistilBART fine-tuned for chemistry TL;DR summarization}, author = {Bocklitz Lab}, year = {2025}, url = {https://huggingface.co/Bocklitz-Lab/lit2vec-tldr-bart}, note = {Model trained on CC-BY chemistry abstracts; dataset at Bocklitz-Lab/lit2vec-tldr-bart-dataset} } ``` Dataset: ```bibtex @dataset{lit2vec_tldr_dataset_2025, title = {Lit2Vec TL;DR Chemistry Dataset}, author = {Bocklitz Lab}, year = {2025}, url = {https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset} } ``` --- ## πŸ“ License * **Model weights & code:** Apache-2.0 * **Dataset:** CC BY 4.0 (attribution in per-record metadata) --- ## πŸ™Œ Acknowledgements * Base model: DistilBART (`sshleifer/distilbart-cnn-12-6`) * Licensing and OA links curated from publisher/aggregator sources; dataset restricted to **CC-BY** content.