Nthuku-Fast: Efficient Multimodal Vision-Language Model

NOTE: This model is not FULLY TRAINED. The model architecture is correct, but the weights are random (not learned from data). It's like asking someone who has never seen language to write - they'll produce random characters. CAN I GET SOMEONE TO HELP ME TRAIN THIS MODEL INTESIVELY ON THE DATASETS PROVIDED AS I USED A FEW OF THEM TO TRAIN THE MODEL IN GOOGLE COLAB WHICH HAVE LIMITED STORAGE. THAT'S WHY THE MODEL IS NOT FULLY TRAINED

THIS DATASET IS ALSO INCLUDED: HuggingFaceH4/stack-exchange-preferences

βš™οΈ Best Strategy

  1. Pretrain on large-scale datasets: β†’ LAION-5B, Conceptual Captions, SBU.
  2. Align & Fine-tune on instruction datasets: β†’ LLaVA, InstructBLIP, ShareGPT4V.
  3. Specialize for task (QA, OCR, etc.): β†’ Add VQA, DocVQA, GQA, TextVQA.

Model Description

Nthuku-Fast is a lightweight, efficient multimodal model designed for fast vision-to-text generation. It combines:

  • Mixture of Experts (MoE) architecture with 8 experts and top-2 routing
  • Grouped Query Attention (GQA) for 4x memory efficiency
  • Depthwise separable convolutions for efficient vision processing
  • Optimized for speed inspired by xAI's Grok Code Fast 1

Key Statistics

  • Total Parameters: 81,393,027
  • Active Parameters: 53,035,395 (~53.0M)
  • Efficiency Ratio: 65.2%
  • Vision Encoder: 4 layers, 6 attention heads
  • Text Decoder: 4 layers, 6 attention heads

Architecture Highlights

Mixture of Experts (MoE)

  • 8 experts with top-2 routing
  • 8x model capacity with only 25% active compute
  • Automatic load balancing for uniform expert utilization

Grouped Query Attention (GQA)

  • 8 query heads, 2 key-value heads
  • 4x smaller KV cache = 4x faster inference
  • Maintains quality while reducing memory

Efficient Vision Processing

  • Depthwise separable patch embeddings
  • 8-9x fewer parameters than standard convolutions
  • 224x224 images β†’ 196 patches (14Γ—14)

Usage

# -*- coding: utf-8 -*-
"""nthuku-fast interactive inference script"""

from transformers import AutoTokenizer, AutoModelForCausalLM
from PIL import Image
import torch
import os

# Step 1: Load model and tokenizer
model_name = "Qybera/nthuku-fast-1.5"
print(f"Loading model '{model_name}'...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    trust_remote_code=True,
    torch_dtype=torch.float32  # Use float32 for stability
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()  # Set to evaluation mode

print(f" Model loaded successfully on {device.upper()}")
print("\n  WARNING: This model has RANDOM WEIGHTS (not trained)!")
print("   Expected behavior: Gibberish outputs")
print("   To get coherent responses: Train the model on your dataset first")
print("   This script demonstrates the MODEL ARCHITECTURE, not trained inference\n")

# Step 2: Define inference function
def predict_answer(prompt, image_path=None):
    """
    Performs inference with optional image input.
    """
    try:
        # Prepare image if provided
        if image_path and os.path.exists(image_path):
            from torchvision import transforms
            image = Image.open(image_path).convert("RGB")
            
            # Standard image preprocessing (224x224 for vision encoder)
            transform = transforms.Compose([
                transforms.Resize((224, 224)),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
            ])
            pixel_values = transform(image).unsqueeze(0).to(device)
            
            # Encode image through vision encoder
            with torch.no_grad():
                # Vision encoder returns (output, attention_weights)
                vision_output = model.model.vision_encoder(pixel_values)
                if isinstance(vision_output, tuple):
                    vision_features = vision_output[0]  # Get only the output tensor
                else:
                    vision_features = vision_output
                
                # Project to common dimension
                vision_features = model.model.vision_projection(vision_features)
            
            print(f" Image processed: {vision_features.shape}")
            
        else:
            # Text-only: Create proper dummy vision features
            # Must match the output shape of vision_projection
            batch_size = 1
            seq_len = 196  # 14x14 patches
            hidden_dim = 384  # projection_dim from config
            
            vision_features = torch.zeros(batch_size, seq_len, hidden_dim, device=device)
            
            if image_path and not os.path.exists(image_path):
                print(f"[WARNING] Image not found: {image_path}, using text-only mode")

        # Tokenize prompt
        inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
        input_ids = inputs.input_ids.to(device)
        
        print(f" Prompt tokens: {input_ids.shape[1]}")
        
        # Simple greedy generation (safer than sampling for untrained model)
        with torch.no_grad():
            generated_ids = input_ids.clone()
            max_new_tokens = 50
            
            for step in range(max_new_tokens):
                # Get embeddings
                embeddings = model.model.text_decoder.embedding(generated_ids)
                
                # Pass through decoder layers with vision features
                hidden_states = embeddings
                for layer in model.model.text_decoder.layers:
                    hidden_states, _ = layer(hidden_states, vision_features)
                
                # Get logits
                hidden_states = model.model.text_decoder.layer_norm(hidden_states)
                logits = model.model.text_decoder.lm_head(hidden_states)
                
                # Greedy decoding (most probable token)
                next_token_logits = logits[:, -1, :]
                next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
                
                # Stop if EOS token
                if next_token.item() == tokenizer.eos_token_id:
                    break
                
                # Append token
                generated_ids = torch.cat([generated_ids, next_token], dim=1)
                
                # Progress indicator
                if step % 10 == 0:
                    print(f" Generated {step} tokens...")
        
        # Decode output
        response = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
        return response
        
    except Exception as e:
        print(f"[ERROR] Error during inference: {str(e)}")
        import traceback
        traceback.print_exc()
        return None


# Step 3: Interactive CLI
if __name__ == "__main__":
    print("\n===  NTHUKU-FAST AI Inference ===")
    while True:
        prompt = input("\n Enter your text prompt (or type 'exit' to quit): ").strip()
        if prompt.lower() == "exit":
            print(" Exiting...")
            break

        use_image = input(" Does your prompt involve an image? (yes/no): ").strip().lower()
        image_path = None
        if use_image == "yes":
            image_path = input(" Enter image file path: ").strip()

        print("\n Running inference...\n")
        result = predict_answer(prompt, image_path)

        if result:
            print(" Model Response:")
            print(result)
        else:
            print("[ERROR] No result (check your inputs or image path).")

Training Details

  • Optimizer: AdamW with cosine scheduling
  • Mixed Precision: Automatic mixed precision (AMP) for speed
  • Gradient Accumulation: Configurable for large effective batch sizes
  • Load Balancing: Auxiliary loss for expert utilization

Performance Characteristics

  • Speed: ~20 tokens/sec on modern GPUs
  • Memory: Low KV cache due to GQA
  • Scalability: MoE allows easy scaling of model capacity
  • Efficiency: 25% compute usage vs full model activation

Applications

  • πŸ–ΌοΈ Image Captioning: Fast, accurate scene descriptions
  • πŸ€– Vision-Language AI: Multimodal chatbots and assistants
  • πŸ“± Edge Deployment: Mobile and embedded applications
  • ⚑ Real-time Systems: Low-latency vision understanding

Model Architecture

Input Image (224Γ—224Γ—3)
    ↓
Patch Embedding (14Γ—14 patches)
    ↓
Vision Encoder (6 layers + MoE)
    ↓
Cross-Attention Projection
    ↓
Text Decoder (6 layers + MoE)
    ↓
Text Generation Output

Technical Specifications

  • Vision Input: 224Γ—224 RGB images
  • Patch Size: 16Γ—16 pixels
  • Context Length: 256 tokens
  • Vocabulary: GPT-2 tokenizer (50,257 tokens)
  • Audio Support: Wav2Vec2 integration (optional)

Citation

@misc{nthuku-fast-2025,
  title={Nthuku-Fast: Efficient Multimodal Vision-Language Model with MoE Architecture},
  author={Nthuku-Fast Team},
  year={2025},
  note={Efficient multimodal model inspired by xAI Grok Code Fast 1}
}

License

Apache 2.0


Built with efficiency and speed in mind. Perfect for applications requiring fast multimodal understanding.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Qybera/nthuku-fast-1.5

Finetuned
(1)
this model

Datasets used to train Qybera/nthuku-fast-1.5