Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

LICENSE +2 -2
README.md +336 -0
configuration_qianfanvl_chat.py +1 -1
modeling_qianfanvl_chat.py +1 -1

LICENSE CHANGED Viewed

@@ -6,7 +6,7 @@ Composite License: MIT (for Original Contributions) + Llama 3.1 Community Licens
 MIT License
-Copyright (c) 2025 Qianfan
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
@@ -146,6 +146,6 @@ exclusive jurisdiction of any dispute arising out of this Agreement.
 === Scope Clarification (Non‑operative summary) ===
-- Section A (MIT) covers only the Project’s original contributions authored by Qianfan.
 - Section B (Llama 3.1 Community License) governs any included Llama Materials and any derivatives thereof (e.g., fine‑tuned weights).
 - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Llama 3.1 for Llama Materials).

 MIT License
+Copyright (c) 2025 Baidu
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 === Scope Clarification (Non‑operative summary) ===
+- Section A (MIT) covers only the Project’s original contributions authored by Baidu.
 - Section B (Llama 3.1 Community License) governs any included Llama Materials and any derivatives thereof (e.g., fine‑tuned weights).
 - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Llama 3.1 for Llama Materials).

README.md ADDED Viewed

	@@ -0,0 +1,336 @@

+---
+license: other
+license_link: LICENSE
+language:
+- en
+- zh
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- multimodal
+---
+# Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
+Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale | Document Understanding & OCR Enhancement | Chain-of-Thought Reasoning Support
+## Model Description
+Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.
+### Model Variants
+| Model              | Parameters | Context Length | CoT Support | Best For                                   |
+| ------------------ | ---------- | -------------- | ----------- | ------------------------------------------ |
+| **Qianfan-VL-3B**  | 3B         | 32k            | ❌           | Edge deployment, real-time OCR             |
+| **Qianfan-VL-8B**  | 8B         | 32k            | ✅           | Server-side general scenarios, fine-tuning |
+| **Qianfan-VL-70B** | 70B        | 32k            | ✅           | Complex reasoning, data synthesis          |
+### Architecture
+- **Language Model**:
+  - Qianfan-VL-3B: Based on Qwen2.5-3B
+  - Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
+  - Enhanced with 3T multilingual corpus
+- **Vision Encoder**: InternViT-based, supports dynamic patching up to 4K resolution
+- **Cross-modal Fusion**: MLP adapter for efficient vision-language bridging
+## Key Capabilities
+### 🔍 OCR & Document Understanding
+- **Full-Scenario OCR**: Handwriting, formulas, natural scenes, cards/documents
+- **Document Intelligence**: Layout analysis, table parsing, chart understanding, document Q&A
+- **High Precision**: Industry-leading performance on OCR benchmarks
+### 🧮 Chain-of-Thought Reasoning (8B & 70B)
+- Complex chart analysis and reasoning
+- Mathematical problem-solving with step-by-step derivation
+- Visual reasoning and logical inference
+- Statistical computation and trend prediction
+### 📊 Benchmark Performance
+#### General Vision-Language Benchmarks
+| Benchmark       | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
+| --------------- | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
+| A-Bench_VAL     | 75.65         | 75.72         | **78.1**       | 75.86         | 75.86          | 76.49         | **79.22**      |
+| CCBench         | 66.86         | 70.39         | **80.98**      | 77.84         | 70.78          | 57.65         | 73.73          |
+| SEEDBench_IMG   | 76.55         | 78.02         | **79.13**      | 77.0          | 77.52          | 76.98         | 78.34          |
+| SEEDBench2_Plus | 67.59         | 70.97         | **73.17**      | 69.52         | 68.47          | 70.93         | 73.25          |
+| MMVet           | 48.17         | 53.21         | 67.34          | **80.28**     | 78.9           | 70.64         | 75.69          |
+| MMMU_VAL        | 46.44         | 47.11         | 58.33          | 56.11         | **60.78**      | 51.0          | **65.78**      |
+| ScienceQA_TEST  | 95.19         | 97.62         | **98.76**      | 97.97         | 97.17          | 85.47         | 92.51          |
+| ScienceQA_VAL   | 93.85         | 97.62         | **98.81**      | **97.81**     | 95.14          | 83.59         | 91.32          |
+| MMT-Bench_VAL   | 62.23         | 63.22         | **71.06**      | 65.17         | 63.67          | 61.4          | 69.49          |
+| MTVQA_TEST      | 26.5          | 30.14         | **32.18**      | 30.3          | 27.62          | 29.08         | **31.48**      |
+| BLINK           | 49.97         | 56.81         | **59.44**      | 55.87         | 51.87          | 54.55         | **63.02**      |
+| MMStar          | 57.93         | 64.07         | **69.47**      | 68.4          | 66.07          | 61.53         | 66.0           |
+| RealWorldQA     | 65.75         | 70.59         | 71.63          | 71.11         | **74.25**      | 69.28         | **73.86**      |
+| Q-Bench1_VAL    | 73.51         | 75.25         | 77.46          | 75.99         | **77.99**      | **78.1**      | **79.93**      |
+| POPE            | 85.08         | 86.06         | 88.97          | **90.59**     | 88.87          | 85.97         | 83.35          |
+| RefCOCO (Avg)   | 85.94         | 89.37         | **91.01**      | 89.65         | **91.40**      | 86.56         | 90.25          |
+#### OCR & Document Understanding
+| Benchmark    | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
+| ------------ | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | ------------- | -------------- |
+| OCRBench     | 831           | 854           | 873            | **881**       | 847            | 810           | **883**       | 874            |
+| AI2D_TEST    | 81.38         | **85.07**     | **87.23**      | **85.07**     | 83.55          | 77.07         | 80.472        | 83.84          |
+| OCRVQA_TEST  | 66.15         | 68.98         | **74.06**      | 39.03         | 35.58          | 69.24         | **71.02**     | 66.8           |
+| TextVQA_VAL  | 80.11         | 82.13         | **84.48**      | 82.15         | 83.52          | 79.09         | **84.962**    | 83.26          |
+| DocVQA_VAL   | 90.85         | 93.54         | 94.75          | 92.04         | 83.82          | 92.71         | **94.91**     | **95.75**      |
+| ChartQA_TEST | 81.79         | **87.72**     | **89.6**       | 85.76         | 82.04          | 83.4          | 86.68         | 87.16          |
+#### Mathematical Reasoning
+| Benchmark         | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
+| ----------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
+| Mathvista-mini    | 69.19         | **78.6**       | 69.5          | 70.1           | 67.2          | 73.9           |
+| Mathvision        | 32.82         | **50.29**      | 29.61         | 34.8           | 25.95         | 39.34          |
+| Mathverse         | 48.4          | **61.04**      | 43.68         | 49.26          | 44.21         | 55.18          |
+| ChartQA Pro       | 50.43         | **52**         | 37.32         | 44.43          | 43.73         | 45.3           |
+| HallusionBench    | 51.72         | **54.52**      | 49.2          | 40.2           | 47.9          | 49.9           |
+| InHouse Dataset A | 59.87         | **71.78**      | 40.64         | 41.47          | 45.58         | 57.2           |
+| InHouse Dataset B | 61.33         | **75.6**       | 36.25         | 42.65          | 30.62         | 59.68          |
+## Quick Start
+### Installation
+```bash
+pip install transformers accelerate torch torchvision pillow einops
+```
+### Using Transformers
+```python
+import torch
+import torchvision.transforms as T
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoModel, AutoTokenizer
+from PIL import Image
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+# Load model
+MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
+model = AutoModel.from_pretrained(
+    MODEL_PATH,
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+    device_map="auto"
+).eval()
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+# Load and process image
+pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)
+# Inference
+prompt = "<image>请识别图中所有文字"
+with torch.no_grad():
+    response = model.chat(
+        tokenizer,
+        pixel_values=pixel_values,
+        question=prompt,
+        generation_config={"max_new_tokens": 512},
+        verbose=False
+    )
+print(response)
+```
+### Using vLLM
+You can deploy Qianfan-VL using vLLM's official Docker image for high-performance inference with an OpenAI-compatible API:
+#### Start vLLM Service
+```bash
+docker run -d --name qianfan-vl \
+  --gpus all \
+  -v /path/to/Qianfan-VL-8B:/model \
+  -p 8000:8000 \
+  --ipc=host \
+  vllm/vllm-openai:latest \
+  --model /model \
+  --served-model-name qianfan-vl \
+  --trust-remote-code \
+  --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'
+```
+#### Call the API
+```bash
+curl 'http://127.0.0.1:8000/v1/chat/completions' \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "model": "qianfan-vl",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
+            }
+          },
+          {
+            "type": "text",
+            "text": "<image>请识别图中所有文字"
+          }
+        ]
+      }
+    ]
+  }'
+```
+Or use Python with OpenAI SDK:
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key="EMPTY",
+    base_url="http://127.0.0.1:8000/v1"
+)
+response = client.chat.completions.create(
+    model="qianfan-vl",
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
+                },
+                {
+                    "type": "text",
+                    "text": "<image>请描述这张图片"
+                }
+            ]
+        }
+    ],
+    max_tokens=512
+)
+print(response.choices[0].message.content)
+```
+## Training Details
+### Four-Stage Progressive Training
+1. **Cross-modal Alignment** (100B tokens): Establishes vision-language connections
+2. **General Knowledge Injection** (3.5T tokens): Builds strong foundational capabilities
+3. **Domain Enhancement** (300B tokens): Specialized OCR and reasoning capabilities
+4. **Post-training** (1B tokens): Instruction following and preference alignment
+### Infrastructure
+- Trained on 5000+ Baidu Kunlun chips
+- Single-task parallel training with 5000 chips demonstrating unprecedented scale
+- 90%+ scaling efficiency for large-scale distributed training
+- Innovative communication-computation fusion technology
+## Model Card
+- **Developed by**: Baidu AI Cloud Qianfan Team
+- **Model type**: Vision-Language Transformer
+- **Languages**: Multilingual support
+- **License**: [Please check model card for specific license]
+- **Base Architecture**: Please Reference Technical Report
+## Citation
+If you use Qianfan-VL in your research, please cite:
+```bibtex
+@misc{qianfan-vl-2025,
+  title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
+  author={Qianfan Team},
+  year={2025},
+  publisher={Baidu}
+}
+```
+## Contact
+For more information and API access, visit: [Baidu Qianfan Platform](https://qianfan.cloud.baidu.com/)
+## Acknowledgments
+This model series represents a significant advancement in multimodal AI, combining general capabilities with domain-specific enhancements for real-world applications.

configuration_qianfanvl_chat.py CHANGED Viewed

@@ -1,4 +1,4 @@
-# Copyright (c) 2025 Qianfan
 # Licensed under the MIT License. See LICENSE file in the project root for full license information.
 import copy

+# Copyright (c) 2025 Baidu
 # Licensed under the MIT License. See LICENSE file in the project root for full license information.
 import copy

modeling_qianfanvl_chat.py CHANGED Viewed

@@ -1,4 +1,4 @@
-# Copyright (c) 2025 Qianfan
 # Licensed under the MIT License. See LICENSE file in the project root for full license information.
 import warnings
 from typing import List, Optional, Tuple, Union

+# Copyright (c) 2025 Baidu
 # Licensed under the MIT License. See LICENSE file in the project root for full license information.
 import warnings
 from typing import List, Optional, Tuple, Union