--- language: - en license: other license_name: wan-license library_name: diffusers pipeline_tag: image-to-video tags: - video-generation - image-to-video - text-to-video - diffusion - video-diffusion - camera-control - lora - wan - wan22 - fp8 - quantized - gguf base_model: wan22 base_model_relation: quantized inference: true model-index: - name: WAN 2.2 FP8/GGUF - I2V/T2V Models results: - task: type: image-to-video name: Image-to-Video Generation metrics: - name: Inference Steps type: steps value: 50 verified: false - name: VRAM Usage (FP8) type: memory_gb value: 16 verified: false - task: type: text-to-video name: Text-to-Video Generation metrics: - name: Inference Steps type: steps value: 50 verified: false - name: VRAM Usage (FP8) type: memory_gb value: 16 verified: false --- # WAN 2.2 FP8 - Image-to-Video and Text-to-Video Models High-quality image-to-video (I2V) and text-to-video (T2V) generation models in FP8 and GGUF quantized formats, with advanced camera control and enhancement LoRAs for memory-efficient deployment. ## Model Description WAN 2.2 FP8 is a 14-billion parameter video generation model based on diffusion architecture, optimized with FP8 quantization and GGUF formats for efficient deployment on consumer-grade hardware. This repository contains FP8 and GGUF quantized variants that provide excellent quality with significantly reduced VRAM requirements compared to FP16 models. **Key Features**: - 14B parameter diffusion-based architecture - FP8 and GGUF quantized formats for memory efficiency (~50% smaller than FP16) - Dedicated VAE for video latent encoding/decoding - Extensive LoRA ecosystem for camera control (v2) and visual enhancement - Support for both high-noise (creative) and low-noise (faithful) generation modes - Text-to-video and image-to-video capabilities **Model Statistics**: - **Total Repository Size**: ~89GB - **Model Architecture**: Diffusion-based video generation - **Supported Formats**: `.safetensors` (FP8), `.gguf` (Q4/Q8 quantized) - **Parameters**: 14 billion - **Precision**: FP8 E4M3FN and GGUF Q4/Q8 quantization - **Input**: Text prompts and/or images - **Output**: Video sequences (typically 16-24 frames) ## How to Get Started with the Model Quick start example for image-to-video generation with FP8: ```python from diffusers import DiffusionPipeline, AutoencoderKL import torch from PIL import Image # Load your input image input_image = Image.open("your_image.jpg") # Load pipeline with FP8 support pipe = DiffusionPipeline.from_pretrained( "base-model-path", torch_dtype=torch.float8_e4m3fn ) # Load WAN 2.2 VAE pipe.vae = AutoencoderKL.from_single_file( "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" ) # Load I2V model (FP8 for balanced performance) pipe.unet = torch.load( "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors" ) pipe.to("cuda") # Generate video video = pipe( image=input_image, prompt="cinematic shot, high quality", num_inference_steps=50, num_frames=16 ).frames ``` For detailed usage examples including camera control, GGUF models, and LoRA combinations, see the [Usage](#usage) section below. ## Directory Structure ``` wan22-fp8/ ├── diffusion_models/wan/ # FP8 and GGUF quantized I2V and T2V models ├── loras/wan/ # Camera control (v2), action, and enhancement LoRAs └── vae/wan/ # Video VAE for latent encoding/decoding ``` ## Models ### Base Diffusion Models Located in `diffusion_models/wan/` #### Text-to-Video (T2V) Models (FP8) | Model | Precision | Size | VRAM Required | Use Case | |-------|-----------|------|---------------|----------| | `wan22-t2v-high-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, high noise schedule | | `wan22-t2v-low-noise-14b-fp8-scaled.safetensors` | FP8 | 14GB | 16GB+ | General T2V, low noise schedule | #### Image-to-Video (I2V) Models **FP8 Precision** (Balanced Quality/Performance): | Model | Size | VRAM Required | Description | |-------|------|---------------|-------------| | `wan22-i2v-high-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Creative generation, higher variance | | `wan22-i2v-low-noise-14b-fp8-scaled.safetensors` | 14GB | 16GB+ | Faithful reproduction, consistent results | **GGUF Quantized** (Memory Efficient): | Model | Size | VRAM Required | Quantization | Description | |-------|------|---------------|--------------|-------------| | `wan22-i2v-a14b-highnoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, high-noise | | `wan22-i2v-a14b-lownoise-q4-k-s.gguf` | 8.2GB | 12GB+ | Q4_K_S | Most memory efficient, low-noise | | `wan22-i2v-a14b-gguf-a14b-high.gguf` | 15GB | 16GB+ | Q8 | Higher precision quantization | ### Video VAE Located in `vae/wan/` - **File**: `wan22-vae.safetensors` - **Size**: 1.4GB - **Purpose**: Video latent encoder/decoder for compressing video frames ### Enhancement LoRAs Located in `loras/wan/` #### Camera Control LoRAs (v2 - Enhanced) | LoRA | Size | Description | Prompt Examples | |------|------|-------------|-----------------| | `wan22-camera-rotation-rank16-v2.safetensors` | 293MB | Rotating camera movements | "rotating camera", "camera circles around subject" | | `wan22-camera-arcshot-rank16-v2-high.safetensors` | 293MB | Cinematic arc shots | "arc shot", "curved camera movement" | | `wan22-camera-drone-rank16-v2.safetensors` | 293MB | Aerial drone perspectives | "aerial view", "drone shot", "bird's eye view" | | `wan22-camera-adr1a-v1.safetensors` | 293MB | Advanced camera control | Custom camera trajectories | | `wan22-camera-earthzoomout.safetensors` | 293MB | Earth zoom-out effects | "zooming out from earth", "planet zoom" | #### Visual Enhancement LoRAs | LoRA | Size | Purpose | Effect | |------|------|---------|--------| | `wan22-face-naturalizer.safetensors` | 586MB | Face enhancement | More natural-looking facial movements | | `wan22-light-volumetric.safetensors` | 293MB | Lighting effects | Volumetric lighting, god rays, atmospheric effects | | `wan22-light-cinematicflare-i2v-low.safetensors` | 293MB | Lens flare effects | Cinematic lens flares and light blooms for I2V | | `wan22-upscale-realismboost-t2v-14b.safetensors` | 293MB | Quality boost | Enhanced realism for T2V generation | #### Action LoRAs | LoRA | Size | Action Type | Application | |------|------|-------------|-------------| | `wan22-action-wink-i2v-v1-low-noise.safetensors` | 147MB | Facial actions | Controlled winking animations | ## Usage ### Basic Image-to-Video Generation (FP8) ```python from diffusers import DiffusionPipeline, AutoencoderKL import torch from PIL import Image # Load input image input_image = Image.open("path/to/your/image.jpg") # Load I2V pipeline with FP8 support pipe = DiffusionPipeline.from_pretrained( "path-to-base-model", torch_dtype=torch.float8_e4m3fn ) # Load WAN 2.2 FP8 I2V model pipe.unet = torch.load( "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-i2v-high-noise-14b-fp8-scaled.safetensors" ) # Load WAN 2.2 VAE pipe.vae = AutoencoderKL.from_single_file( "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" ) pipe.to("cuda") # Generate video from image video = pipe( image=input_image, prompt="cinematic shot, high quality", num_inference_steps=50, num_frames=16 ).frames # Save video from diffusers.utils import export_to_video export_to_video(video, "output.mp4", fps=8) ``` ### Text-to-Video Generation (FP8) ```python from diffusers import DiffusionPipeline, AutoencoderKL import torch # Load T2V pipeline pipe = DiffusionPipeline.from_pretrained( "path-to-base-model", torch_dtype=torch.float8_e4m3fn ) # Load WAN 2.2 FP8 T2V model pipe.unet = torch.load( "E:/huggingface/wan22-fp8/diffusion_models/wan/wan22-t2v-low-noise-14b-fp8-scaled.safetensors" ) # Load WAN 2.2 VAE pipe.vae = AutoencoderKL.from_single_file( "E:/huggingface/wan22-fp8/vae/wan/wan22-vae.safetensors" ) pipe.to("cuda") # Generate video from text video = pipe( prompt="a cat walking through a garden, high quality, cinematic", num_inference_steps=50, num_frames=16 ).frames ``` ### Using Camera Control LoRAs ```python # After loading base pipeline, add camera control pipe.load_lora_weights( "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-rotation-rank16-v2.safetensors" ) # Generate with camera movement video = pipe( image=input_image, prompt="rotating camera around a sculpture", num_inference_steps=50 ).frames ``` ### Combining Multiple LoRAs ```python # Load multiple LoRAs with different weights pipe.load_lora_weights( "E:/huggingface/wan22-fp8/loras/wan/wan22-camera-drone-rank16-v2.safetensors", adapter_name="camera_drone" ) pipe.load_lora_weights( "E:/huggingface/wan22-fp8/loras/wan/wan22-light-volumetric.safetensors", adapter_name="volumetric_light" ) # Set LoRA weights pipe.set_adapters(["camera_drone", "volumetric_light"], adapter_weights=[0.8, 0.6]) # Generate with combined effects video = pipe( image=input_image, prompt="aerial drone shot with volumetric lighting at sunset", num_inference_steps=50 ).frames ``` ### Using Cinematic Flare LoRA ```python # Load cinematic flare LoRA for I2V pipe.load_lora_weights( "E:/huggingface/wan22-fp8/loras/wan/wan22-light-cinematicflare-i2v-low.safetensors" ) # Generate with lens flare effects video = pipe( image=input_image, prompt="cinematic lens flare, light bloom, professional cinematography", num_inference_steps=50 ).frames ``` ## Model Selection Guide ### Precision Trade-offs (This Repository) **FP8 Models** (Available in this repo): - ✅ 50% smaller than FP16 (14GB vs 27GB) - ✅ Minimal quality loss compared to FP16 - ✅ Faster inference on GPUs with tensor cores - ✅ Balanced quality/performance - ❌ Requires 16GB+ VRAM - 🎯 Use for: Production deployment, most users, balanced quality **GGUF Q4_K_S** (Available in this repo): - ✅ Smallest size (8.2GB) - ✅ Works on 12GB VRAM GPUs - ✅ Fastest inference - ❌ More quality degradation than FP8 - 🎯 Use for: Memory-constrained systems, rapid prototyping, testing **GGUF Q8** (Available in this repo): - ✅ Medium size (15GB) - ✅ Better quality than Q4 - ✅ Works on 16GB VRAM GPUs - 🎯 Use for: Balance between Q4 and FP8 quality **FP16 Models** (Not in this repo): - See separate wan22-fp16 repository for full precision variants - 27GB per model, requires 24GB+ VRAM - Maximum quality for research and archival use ### Noise Schedule Selection **High-Noise Models**: - More creative interpretation - Better for abstract or stylized content - Higher variance in outputs **Low-Noise Models**: - More faithful to input/prompt - Better for realistic content - More consistent results ## Hardware Requirements | Model Type | Minimum VRAM | Recommended VRAM | GPU Examples | |------------|--------------|------------------|--------------| | I2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090, RTX 4070 Ti Super | | I2V GGUF Q4 | 12GB | 16GB+ | RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB | | I2V GGUF Q8 | 16GB | 20GB+ | RTX 4080, RTX 3090 | | T2V FP8 | 16GB | 20GB+ | RTX 4080, RTX 3090 | **System Requirements**: - CUDA 11.8+ or 12.1+ - PyTorch 2.1+ (with FP8 support) - diffusers library 0.20+ - 89GB free disk space (full repository) - 32GB+ system RAM recommended ## Performance Tips 1. **Memory Optimization**: - Start with GGUF Q4 models on 12GB GPUs - Use FP8 models for 16GB+ GPUs (best quality/VRAM balance) - Enable `torch.cuda.amp` for mixed precision if needed - Use gradient checkpointing if fine-tuning 2. **Quality Optimization**: - FP8 provides best quality in this repository - Combine multiple LoRAs at weights 0.6-0.8 - Experiment with both high and low noise variants - For maximum quality, use FP16 models from wan22-fp16 repository 3. **Speed Optimization**: - Use GGUF Q4 quantized models for rapid prototyping (fastest) - FP8 models perform well on RTX 40 series with tensor cores - Reduce num_inference_steps to 20-30 for testing - Enable xformers attention: `pipe.enable_xformers_memory_efficient_attention()` 4. **GPU-Specific Tips**: - **RTX 40 series**: FP8 models perform excellently with native support - **RTX 30 series**: FP8 still faster than FP16, use for 16GB+ cards - **12GB GPUs**: Use GGUF Q4 models exclusively - **16GB GPUs**: Choose between FP8 or GGUF Q8 based on quality needs ## Prompting Guidelines ### Camera Movement Prompts **Rotation**: "rotating camera", "camera circles around", "360-degree view", "orbital camera" **Arc Shot**: "arc shot", "curved camera movement", "sweeping motion", "cinematic arc" **Drone**: "aerial view", "drone shot", "bird's eye view", "flying camera", "overhead shot" **Zoom**: "zooming out", "zoom in on subject", "dolly zoom" ### Enhancement Prompts **Volumetric Lighting**: "volumetric light rays", "god rays", "atmospheric lighting", "light shafts" **Cinematic Flare**: "lens flare", "cinematic bloom", "light bloom", "flare effects" **Face Natural**: Use with portrait videos for more realistic facial expressions and movements ## File Formats - **`.safetensors`**: Secure tensor format, recommended for most use cases - **`.gguf`**: Quantized format for memory-constrained environments ## Intended Uses ### Direct Use WAN 2.2 is designed for: - **Content Creation**: Generate videos from text descriptions or images for creative projects, advertising, and entertainment - **Prototyping**: Rapid video concept visualization for storyboarding and pre-production - **Research**: Academic research in video generation, diffusion models, and controllable video synthesis - **Application Development**: Building video generation features in applications and services ### Downstream Use - Fine-tuning on domain-specific video datasets - Integration with video editing pipelines - Custom LoRA development for specialized camera movements or visual effects - Video dataset augmentation and synthetic data generation ### Out-of-Scope Use The model should NOT be used for: - Generating deceptive, harmful, or misleading video content - Creating deepfakes or non-consensual content of individuals - Producing content that violates copyright or intellectual property rights - Generating content intended to harass, abuse, or discriminate - Creating videos for illegal purposes or activities ## Bias, Risks, and Limitations ### Known Limitations **Technical Limitations**: - **Temporal Consistency**: May produce flickering or inconsistent motion in long sequences - **Fine Details**: Small objects or intricate textures may lack detail or consistency - **Physical Realism**: Generated physics may not always follow real-world rules (gravity, momentum, etc.) - **Text Rendering**: Cannot reliably render readable text within generated videos - **Face Quality**: Faces may show artifacts or unnatural movements (mitigated by face-naturalizer LoRA) - **Memory Requirements**: High VRAM requirements limit accessibility (12-32GB depending on precision) **Content Limitations**: - Training data biases may affect representation of diverse demographics, cultures, and scenarios - May struggle with uncommon objects, rare scenarios, or niche content - Camera control may not always precisely match intended movements - Generated content may reflect biases present in training data ### Risks and Mitigations **Misuse Risks**: - **Deepfakes and Misinformation**: Model could be used to create deceptive content - *Mitigation*: Implement content authentication, watermarking, and usage monitoring - **Copyright Infringement**: May generate content similar to copyrighted material - *Mitigation*: Avoid training on copyrighted data, implement content filtering - **Harmful Content**: Could generate disturbing or inappropriate content - *Mitigation*: Implement safety filters, content moderation, and responsible use guidelines **Ethical Considerations**: - Users should obtain appropriate permissions before generating videos of identifiable individuals - Generated content should be clearly labeled as AI-generated to prevent deception - Consider environmental impact of compute-intensive inference - Respect privacy, consent, and intellectual property rights ### Recommendations - Implement content moderation and safety filters in production deployments - Add visible/invisible watermarks to identify AI-generated content - Provide clear disclaimers that content is AI-generated - Monitor for misuse and implement usage policies - Consider accessibility trade-offs when selecting model precision - Validate outputs for unintended biases or harmful content before distribution ## Training Details ### Training Data Training data details are not publicly available. Typical video diffusion models are trained on: - Large-scale video datasets with diverse content - Text-video pairs for caption conditioning - Image-video pairs for image-to-video tasks **Note**: Specific training dataset information should be obtained from the original model authors. ### Training Procedure **Training Hyperparameters** (typical for models of this scale): - Architecture: Diffusion transformer with 14B parameters - Precision formats: FP16, FP8, GGUF quantization - Video VAE: Separate encoder/decoder for latent compression - LoRA adapters: Rank-16 to rank-64 for camera control **Noise Schedules**: - **High-noise models**: Greater noise variance for creative generation - **Low-noise models**: Lower noise variance for faithful reproduction ### Compute Infrastructure **Inference Requirements (This Repository)**: - **FP8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090, RTX 4070 Ti Super) - **GGUF Q4**: 12-16GB VRAM (NVIDIA RTX 4070 Ti, RTX 3080, RTX 4060 Ti 16GB) - **GGUF Q8**: 16-20GB VRAM (NVIDIA RTX 4080, RTX 3090) ## Environmental Impact Video generation models require significant computational resources. This FP8/GGUF repository provides more efficient alternatives: - **Model Size**: 89GB total (FP8 + GGUF variants + LoRAs) - **Inference Power**: 100-350W depending on GPU and model precision - **Carbon Footprint**: Varies by energy source and usage patterns - **Efficiency**: ~40% VRAM reduction vs FP16, enabling use on consumer GPUs **Recommendations for Reducing Impact**: - Use GGUF Q4 quantized models for maximum efficiency (8.2GB vs 27GB FP16) - FP8 models provide excellent quality/efficiency balance - Batch process multiple requests to amortize overhead - Use energy-efficient hardware (RTX 40 series with tensor cores) - Use renewable energy sources when possible - Consider carbon offset for production deployments ## License Please check the original WAN 2.2 model repository for specific license terms and usage restrictions. This repository uses the "other" license tag pending clarification of the original license. ## Citation If you use WAN 2.2 in your research or applications, please cite the original model repository. **BibTeX** (template): ```bibtex @misc{wan22, title={WAN 2.2: Image-to-Video and Text-to-Video Generation}, author={[Original Authors]}, year={2024}, howpublished={\url{https://huggingface.co/[original-repo]}}, } ``` ## Model Card Authors This model card was created by the repository maintainer based on available model information and standard Hugging Face model card guidelines. ## Model Card Contact For questions about this model card or repository, please open an issue in the repository or contact the original model authors. ## Troubleshooting **Out of Memory Errors**: - Switch from FP8 to GGUF Q4 quantized models (12GB VRAM) - Switch from GGUF Q8 to Q4 if still out of memory - Reduce `num_frames` (try 8 or 12 instead of 16) - Reduce batch size to 1 - Enable CPU offloading: `pipe.enable_model_cpu_offload()` - Enable sequential CPU offload: `pipe.enable_sequential_cpu_offload()` **Quality Issues**: - Try both high-noise and low-noise variants - If using GGUF Q4, try FP8 for better quality (requires 16GB+ VRAM) - If using FP8 and need maximum quality, see wan22-fp16 repository - Adjust LoRA weights (0.5-1.0 range) - Increase `num_inference_steps` (50-100) **Slow Generation**: - GGUF Q4 models are fastest for rapid iteration - Enable xformers: `pipe.enable_xformers_memory_efficient_attention()` - Reduce inference steps to 20-30 for testing - FP8 performs best on RTX 40 series GPUs with native support **GGUF Model Loading Issues**: - Ensure you're using a GGUF-compatible loader - GGUF models may require specific diffusers versions - Check llama.cpp or gguf-specific loading documentation ## Support For issues, questions, or contributions, please refer to the main Hugging Face model repository. ## Related Repositories - **wan22-fp16**: Full precision FP16 variants (27GB per model, maximum quality) - **wan21-fp8**: WAN 2.1 FP8 models (camera control v1, I2V only) - **wan21-fp16**: WAN 2.1 FP16 models (camera control v1, I2V only) ## Summary This repository contains WAN 2.2 models optimized for deployment on consumer-grade hardware through FP8 and GGUF quantization: - **89GB total** (vs 142GB for full precision variants) - **FP8 models**: 14GB each, excellent quality/VRAM balance - **GGUF Q4 models**: 8.2GB each, maximum memory efficiency - **Camera Control v2**: Enhanced camera LoRAs vs v1 in WAN 2.1 - **10 Enhancement LoRAs**: Camera control (5), lighting (2), face enhancement (1), quality boost (1), actions (1) - **Both I2V and T2V**: Image-to-video and text-to-video capabilities **Recommended for**: Production deployment, consumer GPUs (12GB+), balanced quality/performance needs --- **Last Updated**: October 2024 **Model Version**: WAN 2.2 FP8/GGUF **Repository Type**: Quantized Model Weights Storage **Repository Size**: ~89GB