Upload folder using huggingface_hub
Browse files- LICENSE +2 -2
 - README.md +336 -0
 - configuration_qianfanvl_chat.py +1 -1
 - modeling_qianfanvl_chat.py +1 -1
 
    	
        LICENSE
    CHANGED
    
    | 
         @@ -6,7 +6,7 @@ Composite License: MIT (for Original Contributions) + Llama 3.1 Community Licens 
     | 
|
| 6 | 
         | 
| 7 | 
         
             
            MIT License
         
     | 
| 8 | 
         | 
| 9 | 
         
            -
            Copyright (c) 2025  
     | 
| 10 | 
         | 
| 11 | 
         
             
            Permission is hereby granted, free of charge, to any person obtaining a copy
         
     | 
| 12 | 
         
             
            of this software and associated documentation files (the "Software"), to deal
         
     | 
| 
         @@ -146,6 +146,6 @@ exclusive jurisdiction of any dispute arising out of this Agreement. 
     | 
|
| 146 | 
         | 
| 147 | 
         | 
| 148 | 
         
             
            === Scope Clarification (Non‑operative summary) ===
         
     | 
| 149 | 
         
            -
            - Section A (MIT) covers only the Project’s original contributions authored by  
     | 
| 150 | 
         
             
            - Section B (Llama 3.1 Community License) governs any included Llama Materials and any derivatives thereof (e.g., fine‑tuned weights).
         
     | 
| 151 | 
         
             
            - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Llama 3.1 for Llama Materials).
         
     | 
| 
         | 
|
| 6 | 
         | 
| 7 | 
         
             
            MIT License
         
     | 
| 8 | 
         | 
| 9 | 
         
            +
            Copyright (c) 2025 Baidu
         
     | 
| 10 | 
         | 
| 11 | 
         
             
            Permission is hereby granted, free of charge, to any person obtaining a copy
         
     | 
| 12 | 
         
             
            of this software and associated documentation files (the "Software"), to deal
         
     | 
| 
         | 
|
| 146 | 
         | 
| 147 | 
         | 
| 148 | 
         
             
            === Scope Clarification (Non‑operative summary) ===
         
     | 
| 149 | 
         
            +
            - Section A (MIT) covers only the Project’s original contributions authored by Baidu.
         
     | 
| 150 | 
         
             
            - Section B (Llama 3.1 Community License) governs any included Llama Materials and any derivatives thereof (e.g., fine‑tuned weights).
         
     | 
| 151 | 
         
             
            - In the event of any conflict, the applicable license for the relevant component controls (MIT for original contributions; Llama 3.1 for Llama Materials).
         
     | 
    	
        README.md
    ADDED
    
    | 
         @@ -0,0 +1,336 @@ 
     | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            ---
         
     | 
| 2 | 
         
            +
            license: other
         
     | 
| 3 | 
         
            +
            license_link: LICENSE
         
     | 
| 4 | 
         
            +
            language:
         
     | 
| 5 | 
         
            +
            - en
         
     | 
| 6 | 
         
            +
            - zh
         
     | 
| 7 | 
         
            +
            pipeline_tag: image-text-to-text
         
     | 
| 8 | 
         
            +
            library_name: transformers
         
     | 
| 9 | 
         
            +
            tags:
         
     | 
| 10 | 
         
            +
            - multimodal
         
     | 
| 11 | 
         
            +
            ---
         
     | 
| 12 | 
         
            +
            # Qianfan-VL: Domain-Enhanced Universal Vision-Language Models
         
     | 
| 13 | 
         
            +
             
     | 
| 14 | 
         
            +
            Domain Capability Enhancement through Continuous Pre-training | 3B to 70B Parameter Scale | Document Understanding & OCR Enhancement | Chain-of-Thought Reasoning Support
         
     | 
| 15 | 
         
            +
             
     | 
| 16 | 
         
            +
            ## Model Description
         
     | 
| 17 | 
         
            +
             
     | 
| 18 | 
         
            +
            Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.
         
     | 
| 19 | 
         
            +
             
     | 
| 20 | 
         
            +
            ### Model Variants
         
     | 
| 21 | 
         
            +
             
     | 
| 22 | 
         
            +
            | Model              | Parameters | Context Length | CoT Support | Best For                                   |
         
     | 
| 23 | 
         
            +
            | ------------------ | ---------- | -------------- | ----------- | ------------------------------------------ |
         
     | 
| 24 | 
         
            +
            | **Qianfan-VL-3B**  | 3B         | 32k            | ❌           | Edge deployment, real-time OCR             |
         
     | 
| 25 | 
         
            +
            | **Qianfan-VL-8B**  | 8B         | 32k            | ✅           | Server-side general scenarios, fine-tuning |
         
     | 
| 26 | 
         
            +
            | **Qianfan-VL-70B** | 70B        | 32k            | ✅           | Complex reasoning, data synthesis          |
         
     | 
| 27 | 
         
            +
             
     | 
| 28 | 
         
            +
            ### Architecture
         
     | 
| 29 | 
         
            +
             
     | 
| 30 | 
         
            +
            - **Language Model**:
         
     | 
| 31 | 
         
            +
              - Qianfan-VL-3B: Based on Qwen2.5-3B
         
     | 
| 32 | 
         
            +
              - Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
         
     | 
| 33 | 
         
            +
              - Enhanced with 3T multilingual corpus
         
     | 
| 34 | 
         
            +
            - **Vision Encoder**: InternViT-based, supports dynamic patching up to 4K resolution
         
     | 
| 35 | 
         
            +
            - **Cross-modal Fusion**: MLP adapter for efficient vision-language bridging
         
     | 
| 36 | 
         
            +
             
     | 
| 37 | 
         
            +
            ## Key Capabilities
         
     | 
| 38 | 
         
            +
             
     | 
| 39 | 
         
            +
            ### 🔍 OCR & Document Understanding
         
     | 
| 40 | 
         
            +
             
     | 
| 41 | 
         
            +
            - **Full-Scenario OCR**: Handwriting, formulas, natural scenes, cards/documents
         
     | 
| 42 | 
         
            +
            - **Document Intelligence**: Layout analysis, table parsing, chart understanding, document Q&A
         
     | 
| 43 | 
         
            +
            - **High Precision**: Industry-leading performance on OCR benchmarks
         
     | 
| 44 | 
         
            +
             
     | 
| 45 | 
         
            +
            ### 🧮 Chain-of-Thought Reasoning (8B & 70B)
         
     | 
| 46 | 
         
            +
             
     | 
| 47 | 
         
            +
            - Complex chart analysis and reasoning
         
     | 
| 48 | 
         
            +
            - Mathematical problem-solving with step-by-step derivation
         
     | 
| 49 | 
         
            +
            - Visual reasoning and logical inference
         
     | 
| 50 | 
         
            +
            - Statistical computation and trend prediction
         
     | 
| 51 | 
         
            +
             
     | 
| 52 | 
         
            +
            ### 📊 Benchmark Performance
         
     | 
| 53 | 
         
            +
             
     | 
| 54 | 
         
            +
            #### General Vision-Language Benchmarks
         
     | 
| 55 | 
         
            +
             
     | 
| 56 | 
         
            +
            | Benchmark       | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
         
     | 
| 57 | 
         
            +
            | --------------- | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
         
     | 
| 58 | 
         
            +
            | A-Bench_VAL     | 75.65         | 75.72         | **78.1**       | 75.86         | 75.86          | 76.49         | **79.22**      |
         
     | 
| 59 | 
         
            +
            | CCBench         | 66.86         | 70.39         | **80.98**      | 77.84         | 70.78          | 57.65         | 73.73          |
         
     | 
| 60 | 
         
            +
            | SEEDBench_IMG   | 76.55         | 78.02         | **79.13**      | 77.0          | 77.52          | 76.98         | 78.34          |
         
     | 
| 61 | 
         
            +
            | SEEDBench2_Plus | 67.59         | 70.97         | **73.17**      | 69.52         | 68.47          | 70.93         | 73.25          |
         
     | 
| 62 | 
         
            +
            | MMVet           | 48.17         | 53.21         | 67.34          | **80.28**     | 78.9           | 70.64         | 75.69          |
         
     | 
| 63 | 
         
            +
            | MMMU_VAL        | 46.44         | 47.11         | 58.33          | 56.11         | **60.78**      | 51.0          | **65.78**      |
         
     | 
| 64 | 
         
            +
            | ScienceQA_TEST  | 95.19         | 97.62         | **98.76**      | 97.97         | 97.17          | 85.47         | 92.51          |
         
     | 
| 65 | 
         
            +
            | ScienceQA_VAL   | 93.85         | 97.62         | **98.81**      | **97.81**     | 95.14          | 83.59         | 91.32          |
         
     | 
| 66 | 
         
            +
            | MMT-Bench_VAL   | 62.23         | 63.22         | **71.06**      | 65.17         | 63.67          | 61.4          | 69.49          |
         
     | 
| 67 | 
         
            +
            | MTVQA_TEST      | 26.5          | 30.14         | **32.18**      | 30.3          | 27.62          | 29.08         | **31.48**      |
         
     | 
| 68 | 
         
            +
            | BLINK           | 49.97         | 56.81         | **59.44**      | 55.87         | 51.87          | 54.55         | **63.02**      |
         
     | 
| 69 | 
         
            +
            | MMStar          | 57.93         | 64.07         | **69.47**      | 68.4          | 66.07          | 61.53         | 66.0           |
         
     | 
| 70 | 
         
            +
            | RealWorldQA     | 65.75         | 70.59         | 71.63          | 71.11         | **74.25**      | 69.28         | **73.86**      |
         
     | 
| 71 | 
         
            +
            | Q-Bench1_VAL    | 73.51         | 75.25         | 77.46          | 75.99         | **77.99**      | **78.1**      | **79.93**      |
         
     | 
| 72 | 
         
            +
            | POPE            | 85.08         | 86.06         | 88.97          | **90.59**     | 88.87          | 85.97         | 83.35          |
         
     | 
| 73 | 
         
            +
            | RefCOCO (Avg)   | 85.94         | 89.37         | **91.01**      | 89.65         | **91.40**      | 86.56         | 90.25          |
         
     | 
| 74 | 
         
            +
             
     | 
| 75 | 
         
            +
            #### OCR & Document Understanding
         
     | 
| 76 | 
         
            +
             
     | 
| 77 | 
         
            +
            | Benchmark    | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
         
     | 
| 78 | 
         
            +
            | ------------ | ------------- | ------------- | -------------- | ------------- | -------------- | ------------- | ------------- | -------------- |
         
     | 
| 79 | 
         
            +
            | OCRBench     | 831           | 854           | 873            | **881**       | 847            | 810           | **883**       | 874            |
         
     | 
| 80 | 
         
            +
            | AI2D_TEST    | 81.38         | **85.07**     | **87.23**      | **85.07**     | 83.55          | 77.07         | 80.472        | 83.84          |
         
     | 
| 81 | 
         
            +
            | OCRVQA_TEST  | 66.15         | 68.98         | **74.06**      | 39.03         | 35.58          | 69.24         | **71.02**     | 66.8           |
         
     | 
| 82 | 
         
            +
            | TextVQA_VAL  | 80.11         | 82.13         | **84.48**      | 82.15         | 83.52          | 79.09         | **84.962**    | 83.26          |
         
     | 
| 83 | 
         
            +
            | DocVQA_VAL   | 90.85         | 93.54         | 94.75          | 92.04         | 83.82          | 92.71         | **94.91**     | **95.75**      |
         
     | 
| 84 | 
         
            +
            | ChartQA_TEST | 81.79         | **87.72**     | **89.6**       | 85.76         | 82.04          | 83.4          | 86.68         | 87.16          |
         
     | 
| 85 | 
         
            +
             
     | 
| 86 | 
         
            +
            #### Mathematical Reasoning
         
     | 
| 87 | 
         
            +
             
     | 
| 88 | 
         
            +
            | Benchmark         | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
         
     | 
| 89 | 
         
            +
            | ----------------- | ------------- | -------------- | ------------- | -------------- | ------------- | -------------- |
         
     | 
| 90 | 
         
            +
            | Mathvista-mini    | 69.19         | **78.6**       | 69.5          | 70.1           | 67.2          | 73.9           |
         
     | 
| 91 | 
         
            +
            | Mathvision        | 32.82         | **50.29**      | 29.61         | 34.8           | 25.95         | 39.34          |
         
     | 
| 92 | 
         
            +
            | Mathverse         | 48.4          | **61.04**      | 43.68         | 49.26          | 44.21         | 55.18          |
         
     | 
| 93 | 
         
            +
            | ChartQA Pro       | 50.43         | **52**         | 37.32         | 44.43          | 43.73         | 45.3           |
         
     | 
| 94 | 
         
            +
            | HallusionBench    | 51.72         | **54.52**      | 49.2          | 40.2           | 47.9          | 49.9           |
         
     | 
| 95 | 
         
            +
            | InHouse Dataset A | 59.87         | **71.78**      | 40.64         | 41.47          | 45.58         | 57.2           |
         
     | 
| 96 | 
         
            +
            | InHouse Dataset B | 61.33         | **75.6**       | 36.25         | 42.65          | 30.62         | 59.68          |
         
     | 
| 97 | 
         
            +
             
     | 
| 98 | 
         
            +
            ## Quick Start
         
     | 
| 99 | 
         
            +
             
     | 
| 100 | 
         
            +
            ### Installation
         
     | 
| 101 | 
         
            +
             
     | 
| 102 | 
         
            +
            ```bash
         
     | 
| 103 | 
         
            +
            pip install transformers accelerate torch torchvision pillow einops 
         
     | 
| 104 | 
         
            +
            ```
         
     | 
| 105 | 
         
            +
             
     | 
| 106 | 
         
            +
            ### Using Transformers
         
     | 
| 107 | 
         
            +
             
     | 
| 108 | 
         
            +
            ```python
         
     | 
| 109 | 
         
            +
            import torch
         
     | 
| 110 | 
         
            +
            import torchvision.transforms as T
         
     | 
| 111 | 
         
            +
            from torchvision.transforms.functional import InterpolationMode
         
     | 
| 112 | 
         
            +
            from transformers import AutoModel, AutoTokenizer
         
     | 
| 113 | 
         
            +
            from PIL import Image
         
     | 
| 114 | 
         
            +
             
     | 
| 115 | 
         
            +
            IMAGENET_MEAN = (0.485, 0.456, 0.406)
         
     | 
| 116 | 
         
            +
            IMAGENET_STD = (0.229, 0.224, 0.225)
         
     | 
| 117 | 
         
            +
             
     | 
| 118 | 
         
            +
            def build_transform(input_size):
         
     | 
| 119 | 
         
            +
                MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
         
     | 
| 120 | 
         
            +
                transform = T.Compose([
         
     | 
| 121 | 
         
            +
                    T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
         
     | 
| 122 | 
         
            +
                    T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
         
     | 
| 123 | 
         
            +
                    T.ToTensor(),
         
     | 
| 124 | 
         
            +
                    T.Normalize(mean=MEAN, std=STD)
         
     | 
| 125 | 
         
            +
                ])
         
     | 
| 126 | 
         
            +
                return transform
         
     | 
| 127 | 
         
            +
             
     | 
| 128 | 
         
            +
            def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
         
     | 
| 129 | 
         
            +
                best_ratio_diff = float('inf')
         
     | 
| 130 | 
         
            +
                best_ratio = (1, 1)
         
     | 
| 131 | 
         
            +
                area = width * height
         
     | 
| 132 | 
         
            +
                for ratio in target_ratios:
         
     | 
| 133 | 
         
            +
                    target_aspect_ratio = ratio[0] / ratio[1]
         
     | 
| 134 | 
         
            +
                    ratio_diff = abs(aspect_ratio - target_aspect_ratio)
         
     | 
| 135 | 
         
            +
                    if ratio_diff < best_ratio_diff:
         
     | 
| 136 | 
         
            +
                        best_ratio_diff = ratio_diff
         
     | 
| 137 | 
         
            +
                        best_ratio = ratio
         
     | 
| 138 | 
         
            +
                    elif ratio_diff == best_ratio_diff:
         
     | 
| 139 | 
         
            +
                        if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
         
     | 
| 140 | 
         
            +
                            best_ratio = ratio
         
     | 
| 141 | 
         
            +
                return best_ratio
         
     | 
| 142 | 
         
            +
             
     | 
| 143 | 
         
            +
            def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
         
     | 
| 144 | 
         
            +
                orig_width, orig_height = image.size
         
     | 
| 145 | 
         
            +
                aspect_ratio = orig_width / orig_height
         
     | 
| 146 | 
         
            +
             
     | 
| 147 | 
         
            +
                # calculate the existing image aspect ratio
         
     | 
| 148 | 
         
            +
                target_ratios = set(
         
     | 
| 149 | 
         
            +
                    (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
         
     | 
| 150 | 
         
            +
                    i * j <= max_num and i * j >= min_num)
         
     | 
| 151 | 
         
            +
                target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
         
     | 
| 152 | 
         
            +
             
     | 
| 153 | 
         
            +
                # find the closest aspect ratio to the target
         
     | 
| 154 | 
         
            +
                target_aspect_ratio = find_closest_aspect_ratio(
         
     | 
| 155 | 
         
            +
                    aspect_ratio, target_ratios, orig_width, orig_height, image_size)
         
     | 
| 156 | 
         
            +
             
     | 
| 157 | 
         
            +
                # calculate the target width and height
         
     | 
| 158 | 
         
            +
                target_width = image_size * target_aspect_ratio[0]
         
     | 
| 159 | 
         
            +
                target_height = image_size * target_aspect_ratio[1]
         
     | 
| 160 | 
         
            +
                blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
         
     | 
| 161 | 
         
            +
             
     | 
| 162 | 
         
            +
                # resize the image
         
     | 
| 163 | 
         
            +
                resized_img = image.resize((target_width, target_height))
         
     | 
| 164 | 
         
            +
                processed_images = []
         
     | 
| 165 | 
         
            +
                for i in range(blocks):
         
     | 
| 166 | 
         
            +
                    box = (
         
     | 
| 167 | 
         
            +
                        (i % (target_width // image_size)) * image_size,
         
     | 
| 168 | 
         
            +
                        (i // (target_width // image_size)) * image_size,
         
     | 
| 169 | 
         
            +
                        ((i % (target_width // image_size)) + 1) * image_size,
         
     | 
| 170 | 
         
            +
                        ((i // (target_width // image_size)) + 1) * image_size
         
     | 
| 171 | 
         
            +
                    )
         
     | 
| 172 | 
         
            +
                    # split the image
         
     | 
| 173 | 
         
            +
                    split_img = resized_img.crop(box)
         
     | 
| 174 | 
         
            +
                    processed_images.append(split_img)
         
     | 
| 175 | 
         
            +
                assert len(processed_images) == blocks
         
     | 
| 176 | 
         
            +
                if use_thumbnail and len(processed_images) != 1:
         
     | 
| 177 | 
         
            +
                    thumbnail_img = image.resize((image_size, image_size))
         
     | 
| 178 | 
         
            +
                    processed_images.append(thumbnail_img)
         
     | 
| 179 | 
         
            +
                return processed_images
         
     | 
| 180 | 
         
            +
             
     | 
| 181 | 
         
            +
            def load_image(image_file, input_size=448, max_num=12):
         
     | 
| 182 | 
         
            +
                image = Image.open(image_file).convert('RGB')
         
     | 
| 183 | 
         
            +
                transform = build_transform(input_size=input_size)
         
     | 
| 184 | 
         
            +
                images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
         
     | 
| 185 | 
         
            +
                pixel_values = [transform(image) for image in images]
         
     | 
| 186 | 
         
            +
                pixel_values = torch.stack(pixel_values)
         
     | 
| 187 | 
         
            +
                return pixel_values
         
     | 
| 188 | 
         
            +
             
     | 
| 189 | 
         
            +
            # Load model
         
     | 
| 190 | 
         
            +
            MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
         
     | 
| 191 | 
         
            +
            model = AutoModel.from_pretrained(
         
     | 
| 192 | 
         
            +
                MODEL_PATH,
         
     | 
| 193 | 
         
            +
                torch_dtype=torch.bfloat16,
         
     | 
| 194 | 
         
            +
                trust_remote_code=True,
         
     | 
| 195 | 
         
            +
                device_map="auto"
         
     | 
| 196 | 
         
            +
            ).eval()
         
     | 
| 197 | 
         
            +
            tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
         
     | 
| 198 | 
         
            +
             
     | 
| 199 | 
         
            +
            # Load and process image
         
     | 
| 200 | 
         
            +
            pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)
         
     | 
| 201 | 
         
            +
             
     | 
| 202 | 
         
            +
            # Inference
         
     | 
| 203 | 
         
            +
            prompt = "<image>请识别图中所有文字"
         
     | 
| 204 | 
         
            +
            with torch.no_grad():
         
     | 
| 205 | 
         
            +
                response = model.chat(
         
     | 
| 206 | 
         
            +
                    tokenizer,
         
     | 
| 207 | 
         
            +
                    pixel_values=pixel_values,
         
     | 
| 208 | 
         
            +
                    question=prompt,
         
     | 
| 209 | 
         
            +
                    generation_config={"max_new_tokens": 512},
         
     | 
| 210 | 
         
            +
                    verbose=False
         
     | 
| 211 | 
         
            +
                )
         
     | 
| 212 | 
         
            +
            print(response)
         
     | 
| 213 | 
         
            +
            ```
         
     | 
| 214 | 
         
            +
             
     | 
| 215 | 
         
            +
            ### Using vLLM
         
     | 
| 216 | 
         
            +
             
     | 
| 217 | 
         
            +
            You can deploy Qianfan-VL using vLLM's official Docker image for high-performance inference with an OpenAI-compatible API:
         
     | 
| 218 | 
         
            +
             
     | 
| 219 | 
         
            +
            #### Start vLLM Service
         
     | 
| 220 | 
         
            +
             
     | 
| 221 | 
         
            +
            ```bash
         
     | 
| 222 | 
         
            +
            docker run -d --name qianfan-vl \
         
     | 
| 223 | 
         
            +
              --gpus all \
         
     | 
| 224 | 
         
            +
              -v /path/to/Qianfan-VL-8B:/model \
         
     | 
| 225 | 
         
            +
              -p 8000:8000 \
         
     | 
| 226 | 
         
            +
              --ipc=host \
         
     | 
| 227 | 
         
            +
              vllm/vllm-openai:latest \
         
     | 
| 228 | 
         
            +
              --model /model \
         
     | 
| 229 | 
         
            +
              --served-model-name qianfan-vl \
         
     | 
| 230 | 
         
            +
              --trust-remote-code \
         
     | 
| 231 | 
         
            +
              --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'
         
     | 
| 232 | 
         
            +
            ```
         
     | 
| 233 | 
         
            +
             
     | 
| 234 | 
         
            +
            #### Call the API
         
     | 
| 235 | 
         
            +
             
     | 
| 236 | 
         
            +
            ```bash
         
     | 
| 237 | 
         
            +
            curl 'http://127.0.0.1:8000/v1/chat/completions' \
         
     | 
| 238 | 
         
            +
              --header 'Content-Type: application/json' \
         
     | 
| 239 | 
         
            +
              --data '{
         
     | 
| 240 | 
         
            +
                "model": "qianfan-vl",
         
     | 
| 241 | 
         
            +
                "messages": [
         
     | 
| 242 | 
         
            +
                  {
         
     | 
| 243 | 
         
            +
                    "role": "user",
         
     | 
| 244 | 
         
            +
                    "content": [
         
     | 
| 245 | 
         
            +
                      {
         
     | 
| 246 | 
         
            +
                        "type": "image_url",
         
     | 
| 247 | 
         
            +
                        "image_url": {
         
     | 
| 248 | 
         
            +
                          "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
         
     | 
| 249 | 
         
            +
                        }
         
     | 
| 250 | 
         
            +
                      },
         
     | 
| 251 | 
         
            +
                      {
         
     | 
| 252 | 
         
            +
                        "type": "text",
         
     | 
| 253 | 
         
            +
                        "text": "<image>请识别图中所有文字"
         
     | 
| 254 | 
         
            +
                      }
         
     | 
| 255 | 
         
            +
                    ]
         
     | 
| 256 | 
         
            +
                  }
         
     | 
| 257 | 
         
            +
                ]
         
     | 
| 258 | 
         
            +
              }'
         
     | 
| 259 | 
         
            +
            ```
         
     | 
| 260 | 
         
            +
             
     | 
| 261 | 
         
            +
            Or use Python with OpenAI SDK:
         
     | 
| 262 | 
         
            +
             
     | 
| 263 | 
         
            +
            ```python
         
     | 
| 264 | 
         
            +
            from openai import OpenAI
         
     | 
| 265 | 
         
            +
             
     | 
| 266 | 
         
            +
            client = OpenAI(
         
     | 
| 267 | 
         
            +
                api_key="EMPTY",
         
     | 
| 268 | 
         
            +
                base_url="http://127.0.0.1:8000/v1"
         
     | 
| 269 | 
         
            +
            )
         
     | 
| 270 | 
         
            +
             
     | 
| 271 | 
         
            +
            response = client.chat.completions.create(
         
     | 
| 272 | 
         
            +
                model="qianfan-vl",
         
     | 
| 273 | 
         
            +
                messages=[
         
     | 
| 274 | 
         
            +
                    {
         
     | 
| 275 | 
         
            +
                        "role": "user",
         
     | 
| 276 | 
         
            +
                        "content": [
         
     | 
| 277 | 
         
            +
                            {
         
     | 
| 278 | 
         
            +
                                "type": "image_url",
         
     | 
| 279 | 
         
            +
                                "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
         
     | 
| 280 | 
         
            +
                            },
         
     | 
| 281 | 
         
            +
                            {
         
     | 
| 282 | 
         
            +
                                "type": "text",
         
     | 
| 283 | 
         
            +
                                "text": "<image>请描述这张图片"
         
     | 
| 284 | 
         
            +
                            }
         
     | 
| 285 | 
         
            +
                        ]
         
     | 
| 286 | 
         
            +
                    }
         
     | 
| 287 | 
         
            +
                ],
         
     | 
| 288 | 
         
            +
                max_tokens=512
         
     | 
| 289 | 
         
            +
            )
         
     | 
| 290 | 
         
            +
            print(response.choices[0].message.content)
         
     | 
| 291 | 
         
            +
            ```
         
     | 
| 292 | 
         
            +
             
     | 
| 293 | 
         
            +
            ## Training Details
         
     | 
| 294 | 
         
            +
             
     | 
| 295 | 
         
            +
            ### Four-Stage Progressive Training
         
     | 
| 296 | 
         
            +
             
     | 
| 297 | 
         
            +
            1. **Cross-modal Alignment** (100B tokens): Establishes vision-language connections
         
     | 
| 298 | 
         
            +
            2. **General Knowledge Injection** (3.5T tokens): Builds strong foundational capabilities
         
     | 
| 299 | 
         
            +
            3. **Domain Enhancement** (300B tokens): Specialized OCR and reasoning capabilities
         
     | 
| 300 | 
         
            +
            4. **Post-training** (1B tokens): Instruction following and preference alignment
         
     | 
| 301 | 
         
            +
             
     | 
| 302 | 
         
            +
            ### Infrastructure
         
     | 
| 303 | 
         
            +
             
     | 
| 304 | 
         
            +
            - Trained on 5000+ Baidu Kunlun chips
         
     | 
| 305 | 
         
            +
            - Single-task parallel training with 5000 chips demonstrating unprecedented scale
         
     | 
| 306 | 
         
            +
            - 90%+ scaling efficiency for large-scale distributed training
         
     | 
| 307 | 
         
            +
            - Innovative communication-computation fusion technology
         
     | 
| 308 | 
         
            +
             
     | 
| 309 | 
         
            +
            ## Model Card
         
     | 
| 310 | 
         
            +
             
     | 
| 311 | 
         
            +
            - **Developed by**: Baidu AI Cloud Qianfan Team
         
     | 
| 312 | 
         
            +
            - **Model type**: Vision-Language Transformer
         
     | 
| 313 | 
         
            +
            - **Languages**: Multilingual support
         
     | 
| 314 | 
         
            +
            - **License**: [Please check model card for specific license]
         
     | 
| 315 | 
         
            +
            - **Base Architecture**: Please Reference Technical Report
         
     | 
| 316 | 
         
            +
             
     | 
| 317 | 
         
            +
            ## Citation
         
     | 
| 318 | 
         
            +
             
     | 
| 319 | 
         
            +
            If you use Qianfan-VL in your research, please cite:
         
     | 
| 320 | 
         
            +
             
     | 
| 321 | 
         
            +
            ```bibtex
         
     | 
| 322 | 
         
            +
            @misc{qianfan-vl-2025,
         
     | 
| 323 | 
         
            +
              title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
         
     | 
| 324 | 
         
            +
              author={Qianfan Team},
         
     | 
| 325 | 
         
            +
              year={2025},
         
     | 
| 326 | 
         
            +
              publisher={Baidu}
         
     | 
| 327 | 
         
            +
            }
         
     | 
| 328 | 
         
            +
            ```
         
     | 
| 329 | 
         
            +
             
     | 
| 330 | 
         
            +
            ## Contact
         
     | 
| 331 | 
         
            +
             
     | 
| 332 | 
         
            +
            For more information and API access, visit: [Baidu Qianfan Platform](https://qianfan.cloud.baidu.com/)
         
     | 
| 333 | 
         
            +
             
     | 
| 334 | 
         
            +
            ## Acknowledgments
         
     | 
| 335 | 
         
            +
             
     | 
| 336 | 
         
            +
            This model series represents a significant advancement in multimodal AI, combining general capabilities with domain-specific enhancements for real-world applications.
         
     | 
    	
        configuration_qianfanvl_chat.py
    CHANGED
    
    | 
         @@ -1,4 +1,4 @@ 
     | 
|
| 1 | 
         
            -
            # Copyright (c) 2025  
     | 
| 2 | 
         
             
            # Licensed under the MIT License. See LICENSE file in the project root for full license information.
         
     | 
| 3 | 
         
             
            import copy
         
     | 
| 4 | 
         | 
| 
         | 
|
| 1 | 
         
            +
            # Copyright (c) 2025 Baidu
         
     | 
| 2 | 
         
             
            # Licensed under the MIT License. See LICENSE file in the project root for full license information.
         
     | 
| 3 | 
         
             
            import copy
         
     | 
| 4 | 
         | 
    	
        modeling_qianfanvl_chat.py
    CHANGED
    
    | 
         @@ -1,4 +1,4 @@ 
     | 
|
| 1 | 
         
            -
            # Copyright (c) 2025  
     | 
| 2 | 
         
             
            # Licensed under the MIT License. See LICENSE file in the project root for full license information.
         
     | 
| 3 | 
         
             
            import warnings
         
     | 
| 4 | 
         
             
            from typing import List, Optional, Tuple, Union
         
     | 
| 
         | 
|
| 1 | 
         
            +
            # Copyright (c) 2025 Baidu
         
     | 
| 2 | 
         
             
            # Licensed under the MIT License. See LICENSE file in the project root for full license information.
         
     | 
| 3 | 
         
             
            import warnings
         
     | 
| 4 | 
         
             
            from typing import List, Optional, Tuple, Union
         
     |