---
language:
  - en
license: other
license_name: wan-license
library_name: diffusers
pipeline_tag: image-to-video
tags:
  - video-generation
  - image-to-video
  - text-to-video
  - diffusion
  - video-diffusion
  - camera-control
  - lora
  - wan
  - wan22
  - fp8
  - quantized
  - gguf
base_model: wan22
base_model_relation: quantized
inference: true
model-index:
  - name: WAN 2.2 FP8/GGUF - I2V/T2V Models
    results:
      - task:
          type: image-to-video
          name: Image-to-Video Generation
        metrics:
          - name: Inference Steps
            type: steps
            value: 50
            verified: false
          - name: VRAM Usage (FP8)
            type: memory_gb
            value: 16
            verified: false
      - task:
          type: text-to-video
          name: Text-to-Video Generation
        metrics:
          - name: Inference Steps
            type: steps
            value: 50
            verified: false
          - name: VRAM Usage (FP8)
            type: memory_gb
            value: 16
            verified: false
---

# WAN 2.2 FP8 - Image-to-Video and Text-to-Video Models

High-quality image-to-video (I2V) and text-to-video (T2V) generation models in FP8 and GGUF quantized formats, with advanced camera control and enhancement LoRAs for memory-efficient deployment.

## Model Description

WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization and GGUF formats for efficient deployment on consumer-grade hardware. This repository contains FP8 and GGUF quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models.

**Key Features**:
- 14B parameter diffusion-based architecture
- FP8 and GGUF quantized formats for memory efficiency (~50% smaller than FP16)
- Dedicated VAE for video latent encoding/decoding
- Extensive LoRA ecosystem for camera control (v2) and visual enhancement
- Support for both high-noise (creative) and low-noise (faithful) generation modes
- Text-to-video and image-to-video capabilities

**Model Statistics**:
- **Total Repository Size**: ~89GB
- **Model Architecture**: Diffusion-based video generation
- **Supported Formats**: `.safetensors` (FP8), `.gguf` (Q4/Q8 quantized)
- **Parameters**: 14 billion
- **Precision**: FP8 E4M3FN and GGUF Q4/Q8 quantization
- **Input**: Text prompts and/or images
- **Output**: Video sequences (typically 16-24 frames)

## How to Get Started with the Model

Quick start example for image-to-video generation with FP8:

```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
from PIL import Image

# Load your input image
input_image = Image.open("your_image.jpg")

# Load pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
    "base-model-path",
    torch_dtype=torch.float8_e4m3fn
)

# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
    "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)

# Load I2V model (FP8 for balanced performance)
pipe.unet = torch.load(
    "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
)

pipe.to("cuda")

# Generate video
video = pipe(
    image=input_image,
    prompt="cinematic shot, high quality",
    num_inference_steps=50,
    num_frames=16
).frames
```

For detailed usage examples including camera control, GGUF models, and LoRA combinations, see the [Usage](#usage) section below.

## Directory Structure

```
wan22-fp8/
├── diffusion_models/wan/    # FP8 and GGUF quantized I2V and T2V models
├── loras/wan/               # Camera control (v2), action, and enhancement LoRAs
└── vae/wan/                 # Video VAE for latent encoding/decoding
```

## Models

### Base Diffusion Models

Located in `diffusion_models/wan/`

#### Text-to-Video (T2V) Models (FP8)
| Model | Precision | Size | VRAM Required | Use Case |
|-------|-----------|------|---------------|----------|
| `wan22-t2v-high-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, high noise schedule |
| `wan22-t2v-low-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, low noise schedule |

#### Image-to-Video (I2V) Models

**FP8 Precision** (Balanced Quality/Performance):
| Model | Size | VRAM Required | Description |
|-------|------|---------------|-------------|
| `wan22-i2v-high-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Creative generation, higher variance |
| `wan22-i2v-low-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Faithful reproduction, consistent results |

**GGUF Quantized** (Memory Efficient):
| Model | Size | VRAM Required | Quantization | Description |
|-------|------|---------------|--------------|-------------|
| `wan22-i2v-a14b-highnoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, high-noise |
| `wan22-i2v-a14b-lownoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, low-noise |
| `wan22-i2v-a14b-gguf-a14b-high.gguf` | 15GB | 16GB+ | Q8 | Higher precision quantization |

### Video VAE

Located in `vae/wan/`

- **File**: `wan22-vae.safetensors`
- **Size**: 1.4GB
- **Purpose**: Video latent encoder/decoder for compressing video frames

### Enhancement LoRAs

Located in `loras/wan/`

#### Camera Control LoRAs (v2 - Enhanced)
| LoRA | Size | Description | Prompt Examples |
|------|------|-------------|-----------------|
| `wan22-camera-rotation-rank16-v2.safetensors` | 293MB | Rotating camera movements | "rotating camera", "camera circles around subject" |
| `wan22-camera-arcshot-rank16-v2-high.safetensors` | 293MB | Cinematic arc shots | "arc shot", "curved camera movement" |
| `wan22-camera-drone-rank16-v2.safetensors` | 293MB | Aerial drone perspectives | "aerial view", "drone shot", "bird's eye view" |
| `wan22-camera-adr1a-v1.safetensors` | 293MB | Advanced camera control | Custom camera trajectories |
| `wan22-camera-earthzoomout.safetensors` | 293MB | Earth zoom-out effects | "zooming out from earth", "planet zoom" |

#### Visual Enhancement LoRAs
| LoRA | Size | Purpose | Effect |
|------|------|---------|--------|
| `wan22-face-naturalizer.safetensors` | 586MB | Face enhancement | More natural-looking facial movements |
| `wan22-light-volumetric.safetensors` | 293MB | Lighting effects | Volumetric lighting, god rays, atmospheric effects |
| `wan22-light-cinematicflare-i2v-low.safetensors` | 293MB | Lens flare effects | Cinematic lens flares and light blooms for I2V |
| `wan22-upscale-realismboost-t2v-14b.safetensors` | 293MB | Quality boost | Enhanced realism for T2V generation |

#### Action LoRAs
| LoRA | Size | Action Type | Application |
|------|------|-------------|-------------|
| `wan22-action-wink-i2v-v1-low-noise.safetensors` | 147MB | Facial actions | Controlled winking animations |

## Usage

### Basic Image-to-Video Generation (FP8)

```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
from PIL import Image

# Load input image
input_image = Image.open("path/to/your/image.jpg")

# Load I2V pipeline with FP8 support
pipe = DiffusionPipeline.from_pretrained(
    "path-to-base-model",
    torch_dtype=torch.float8_e4m3fn
)

# Load WAN 2.2 FP8 I2V model
pipe.unet = torch.load(
    "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors"
)

# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
    "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)

pipe.to("cuda")

# Generate video from image
video = pipe(
    image=input_image,
    prompt="cinematic shot, high quality",
    num_inference_steps=50,
    num_frames=16
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video, "output.mp4", fps=8)
```

### Text-to-Video Generation (FP8)

```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch

# Load T2V pipeline
pipe = DiffusionPipeline.from_pretrained(
    "path-to-base-model",
    torch_dtype=torch.float8_e4m3fn
)

# Load WAN 2.2 FP8 T2V model
pipe.unet = torch.load(
    "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-t2v-low-noise-14b-fp8-scaled.safetensors"
)

# Load WAN 2.2 VAE
pipe.vae = AutoencoderKL.from_single_file(
    "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors"
)

pipe.to("cuda")

# Generate video from text
video = pipe(
    prompt="a cat walking through a garden, high quality, cinematic",
    num_inference_steps=50,
    num_frames=16
).frames
```

### Using Camera Control LoRAs

```python
# After loading base pipeline, add camera control
pipe.load_lora_weights(
    "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-rotation-rank16-v2.safetensors"
)

# Generate with camera movement
video = pipe(
    image=input_image,
    prompt="rotating camera around a sculpture",
    num_inference_steps=50
).frames
```

### Combining Multiple LoRAs

```python
# Load multiple LoRAs with different weights
pipe.load_lora_weights(
    "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-drone-rank16-v2.safetensors",
    adapter_name="camera_drone"
)
pipe.load_lora_weights(
    "E:/huggingface/wan22-fp8/loras/wan/wan22-light-volumetric.safetensors",
    adapter_name="volumetric_light"
)

# Set LoRA weights
pipe.set_adapters(["camera_drone", "volumetric_light"], adapter_weights=[0.8, 0.6])

# Generate with combined effects
video = pipe(
    image=input_image,
    prompt="aerial drone shot with volumetric lighting at sunset",
    num_inference_steps=50
).frames
```

### Using Cinematic Flare LoRA

```python
# Load cinematic flare LoRA for I2V
pipe.load_lora_weights(
    "E:/huggingface/wan22-fp8/loras/wan/wan22-light-cinematicflare-i2v-low.safetensors"
)

# Generate with lens flare effects
video = pipe(
    image=input_image,
    prompt="cinematic lens flare, light bloom, professional cinematography",
    num_inference_steps=50
).frames
```

## Model Selection Guide

### Precision Trade-offs (This Repository)

**FP8 Models** (Available in this repo):
- ✅ 50% smaller than FP16 (14GB vs 27GB)
- ✅ Minimal quality loss compared to FP16
- ✅ Faster inference on GPUs with tensor cores
- ✅ Balanced quality/performance
- ❌ Requires 16GB+ VRAM
- 🎯 Use for: Production deployment, most users, balanced quality

**GGUF Q4_K_S** (Available in this repo):
- ✅ Smallest size (8.2GB)
- ✅ Works on 12GB VRAM GPUs
- ✅ Fastest inference
- ❌ More quality degradation than FP8
- 🎯 Use for: Memory-constrained systems, rapid prototyping, testing

**GGUF Q8** (Available in this repo):
- ✅ Medium size (15GB)
- ✅ Better quality than Q4
- ✅ Works on 16GB VRAM GPUs
- 🎯 Use for: Balance between Q4 and FP8 quality

**FP16 Models** (Not in this repo):
- See separate wan22-fp16 repository for full precision variants
- 27GB per model, requires 24GB+ VRAM
- Maximum quality for research and archival use

### Noise Schedule Selection

**High-Noise Models**:
- More creative interpretation
- Better for abstract or stylized content
- Higher variance in outputs

**Low-Noise Models**:
- More faithful to input/prompt
- Better for realistic content
- More consistent results

## Hardware Requirements

| Model Type | Minimum VRAM | Recommended VRAM | GPU Examples |
|------------|--------------|------------------|--------------|
| I2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super |
| I2V GGUF Q4 | 12GB | 16GB+ | RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB |
| I2V GGUF Q8 | 16GB | 20GB+ | RTX 4080, RTX 3090 |
| T2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090 |

**System Requirements**:
- CUDA 11.8+ or 12.1+
- PyTorch 2.1+ (with FP8 support)
- diffusers library 0.20+
- 89GB free disk space (full repository)
- 32GB+ system RAM recommended

## Performance Tips

1. **Memory Optimization**:
   - Start with GGUF Q4 models on 12GB GPUs
   - Use FP8 models for 16GB+ GPUs (best quality/VRAM balance)
   - Enable `torch.cuda.amp` for mixed precision if needed
   - Use gradient checkpointing if fine-tuning

2. **Quality Optimization**:
   - FP8 provides best quality in this repository
   - Combine multiple LoRAs at weights 0.6-0.8
   - Experiment with both high and low noise variants
   - For maximum quality, use FP16 models from wan22-fp16 repository

3. **Speed Optimization**:
   - Use GGUF Q4 quantized models for rapid prototyping (fastest)
   - FP8 models perform well on RTX 40 series with tensor cores
   - Reduce num_inference_steps to 20-30 for testing
   - Enable xformers attention: `pipe.enable_xformers_memory_efficient_attention()`

4. **GPU-Specific Tips**:
   - **RTX 40 series**: FP8 models perform excellently with native support
   - **RTX 30 series**: FP8 still faster than FP16, use for 16GB+ cards
   - **12GB GPUs**: Use GGUF Q4 models exclusively
   - **16GB GPUs**: Choose between FP8 or GGUF Q8 based on quality needs

## Prompting Guidelines

### Camera Movement Prompts

**Rotation**: "rotating camera", "camera circles around", "360-degree view", "orbital camera"

**Arc Shot**: "arc shot", "curved camera movement", "sweeping motion", "cinematic arc"

**Drone**: "aerial view", "drone shot", "bird's eye view", "flying camera", "overhead shot"

**Zoom**: "zooming out", "zoom in on subject", "dolly zoom"

### Enhancement Prompts

**Volumetric Lighting**: "volumetric light rays", "god rays", "atmospheric lighting", "light shafts"

**Cinematic Flare**: "lens flare", "cinematic bloom", "light bloom", "flare effects"

**Face Natural**: Use with portrait videos for more realistic facial expressions and movements

## File Formats

- **`.safetensors`**: Secure tensor format, recommended for most use cases
- **`.gguf`**: Quantized format for memory-constrained environments

## Intended Uses

### Direct Use

WAN 2.2 is designed for:
- **Content Creation**: Generate videos from text descriptions or images for creative projects, advertising, and entertainment
- **Prototyping**: Rapid video concept visualization for storyboarding and pre-production
- **Research**: Academic research in video generation, diffusion models, and controllable video synthesis
- **Application Development**: Building video generation features in applications and services

### Downstream Use

- Fine-tuning on domain-specific video datasets
- Integration with video editing pipelines
- Custom LoRA development for specialized camera movements or visual effects
- Video dataset augmentation and synthetic data generation

### Out-of-Scope Use

The model should NOT be used for:
- Generating deceptive, harmful, or misleading video content
- Creating deepfakes or non-consensual content of individuals
- Producing content that violates copyright or intellectual property rights
- Generating content intended to harass, abuse, or discriminate
- Creating videos for illegal purposes or activities

## Bias, Risks, and Limitations

### Known Limitations

**Technical Limitations**:
- **Temporal Consistency**: May produce flickering or inconsistent motion in long sequences
- **Fine Details**: Small objects or intricate textures may lack detail or consistency
- **Physical Realism**: Generated physics may not always follow real-world rules (gravity, momentum, etc.)
- **Text Rendering**: Cannot reliably render readable text within generated videos
- **Face Quality**: Faces may show artifacts or unnatural movements (mitigated by face-naturalizer LoRA)
- **Memory Requirements**: High VRAM requirements limit accessibility (12-32GB depending on precision)

**Content Limitations**:
- Training data biases may affect representation of diverse demographics, cultures, and scenarios
- May struggle with uncommon objects, rare scenarios, or niche content
- Camera control may not always precisely match intended movements
- Generated content may reflect biases present in training data

### Risks and Mitigations

**Misuse Risks**:
- **Deepfakes and Misinformation**: Model could be used to create deceptive content
  - *Mitigation*: Implement content authentication, watermarking, and usage monitoring
- **Copyright Infringement**: May generate content similar to copyrighted material
  - *Mitigation*: Avoid training on copyrighted data, implement content filtering
- **Harmful Content**: Could generate disturbing or inappropriate content
  - *Mitigation*: Implement safety filters, content moderation, and responsible use guidelines

**Ethical Considerations**:
- Users should obtain appropriate permissions before generating videos of identifiable individuals
- Generated content should be clearly labeled as AI-generated to prevent deception
- Consider environmental impact of compute-intensive inference
- Respect privacy, consent, and intellectual property rights

### Recommendations

- Implement content moderation and safety filters in production deployments
- Add visible/invisible watermarks to identify AI-generated content
- Provide clear disclaimers that content is AI-generated
- Monitor for misuse and implement usage policies
- Consider accessibility trade-offs when selecting model precision
- Validate outputs for unintended biases or harmful content before distribution

## Training Details

### Training Data

Training data details are not publicly available. Typical video diffusion models are trained on:
- Large-scale video datasets with diverse content
- Text-video pairs for caption conditioning
- Image-video pairs for image-to-video tasks

**Note**: Specific training dataset information should be obtained from the original model authors.

### Training Procedure

**Training Hyperparameters** (typical for models of this scale):
- Architecture: Diffusion transformer with 14B parameters
- Precision formats: FP16, FP8, GGUF quantization
- Video VAE: Separate encoder/decoder for latent compression
- LoRA adapters: Rank-16 to rank-64 for camera control

**Noise Schedules**:
- **High-noise models**: Greater noise variance for creative generation
- **Low-noise models**: Lower noise variance for faithful reproduction

### Compute Infrastructure

**Inference Requirements (This Repository)**:
- **FP8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090, RTX 4070 Ti Super)
- **GGUF Q4**: 12-16GB VRAM (NVIDIA RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB)
- **GGUF Q8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090)

## Environmental Impact

Video generation models require significant computational resources. This FP8/GGUF repository provides more efficient alternatives:

- **Model Size**: 89GB total (FP8 + GGUF variants + LoRAs)
- **Inference Power**: 100-350W depending on GPU and model precision
- **Carbon Footprint**: Varies by energy source and usage patterns
- **Efficiency**: ~40% VRAM reduction vs FP16, enabling use on consumer GPUs

**Recommendations for Reducing Impact**:
- Use GGUF Q4 quantized models for maximum efficiency (8.2GB vs 27GB FP16)
- FP8 models provide excellent quality/efficiency balance
- Batch process multiple requests to amortize overhead
- Use energy-efficient hardware (RTX 40 series with tensor cores)
- Use renewable energy sources when possible
- Consider carbon offset for production deployments

## License

Please check the original WAN 2.2 model repository for specific license terms and usage restrictions. This repository uses the "other" license tag pending clarification of the original license.

## Citation

If you use WAN 2.2 in your research or applications, please cite the original model repository.

**BibTeX** (template):
```bibtex
@misc{wan22,
  title={WAN 2.2: Image-to-Video and Text-to-Video Generation},
  author={[Original Authors]},
  year={2024},
  howpublished={\url{https://huggingface.co/[original-repo]}},
}
```

## Model Card Authors

This model card was created by the repository maintainer based on available model information and standard Hugging Face model card guidelines.

## Model Card Contact

For questions about this model card or repository, please open an issue in the repository or contact the original model authors.

## Troubleshooting

**Out of Memory Errors**:
- Switch from FP8 to GGUF Q4 quantized models (12GB VRAM)
- Switch from GGUF Q8 to Q4 if still out of memory
- Reduce `num_frames` (try 8 or 12 instead of 16)
- Reduce batch size to 1
- Enable CPU offloading: `pipe.enable_model_cpu_offload()`
- Enable sequential CPU offload: `pipe.enable_sequential_cpu_offload()`

**Quality Issues**:
- Try both high-noise and low-noise variants
- If using GGUF Q4, try FP8 for better quality (requires 16GB+ VRAM)
- If using FP8 and need maximum quality, see wan22-fp16 repository
- Adjust LoRA weights (0.5-1.0 range)
- Increase `num_inference_steps` (50-100)

**Slow Generation**:
- GGUF Q4 models are fastest for rapid iteration
- Enable xformers: `pipe.enable_xformers_memory_efficient_attention()`
- Reduce inference steps to 20-30 for testing
- FP8 performs best on RTX 40 series GPUs with native support

**GGUF Model Loading Issues**:
- Ensure you're using a GGUF-compatible loader
- GGUF models may require specific diffusers versions
- Check llama.cpp or gguf-specific loading documentation

## Support

For issues, questions, or contributions, please refer to the main Hugging Face model repository.

## Related Repositories

- **wan22-fp16**: Full precision FP16 variants (27GB per model, maximum quality)
- **wan21-fp8**: WAN 2.1 FP8 models (camera control v1, I2V only)
- **wan21-fp16**: WAN 2.1 FP16 models (camera control v1, I2V only)

## Summary

This repository contains WAN 2.2 models optimized for deployment on consumer-grade hardware through FP8 and GGUF quantization:

- **89GB total** (vs 142GB for full precision variants)
- **FP8 models**: 14GB each, excellent quality/VRAM balance
- **GGUF Q4 models**: 8.2GB each, maximum memory efficiency
- **Camera Control v2**: Enhanced camera LoRAs vs v1 in WAN 2.1
- **10 Enhancement LoRAs**: Camera control (5), lighting (2), face enhancement (1), quality boost (1), actions (1)
- **Both I2V and T2V**: Image-to-video and text-to-video capabilities

**Recommended for**: Production deployment, consumer GPUs (12GB+), balanced quality/performance needs

---

**Last Updated**: October 2024
**Model Version**: WAN 2.2 FP8/GGUF
**Repository Type**: Quantized Model Weights Storage
**Repository Size**: ~89GB