You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to not use the model to for government surveillance or law enforcement purposes.

Log in or Sign Up to review the conditions and access this model content.

Crashout: Real-time Human Gait Analysis

Crashout is a hybrid deep learning system for real-time human gait analysis, built entirely in Rust using the Burn deep learning framework. The system combines computer vision (pose detection) with temporal sequence modeling (LSTM + Transformer) to analyze human walking patterns for medical, sports, and research applications.

Architecture Overview

Input Video Stream β†’ Pose Detection β†’ LSTM Temporal Processing β†’ Transformer Attention β†’ Gait Analysis

Data Flow:

  1. Video Input: RGB frames [batch, 3, 640, 640]
  2. Pose Detection: Extract 17 keypoints β†’ [batch, seq_len, 17, 3] (x, y, confidence)
  3. Sequential Processing: LSTM processes flattened poses [batch, seq_len, 51]
  4. Attention Layer: Transformer attends to important temporal moments
  5. Gait Analysis: Classification, quality scoring, and feature extraction

Quick Start

Building the Library

# Build the library
cargo build --lib

# Check for compilation errors
cargo check --lib

# Run tests (including doctests)
cargo test

# Build with optimizations
cargo build --release --lib

# Run clippy for linting
cargo clippy

# Format code
cargo fmt

Training Pipeline

Train gait analysis models directly on video data with CSV labels:

# Build the training binary
cargo build --bin crashout

# Train a quality scoring model
cargo run --bin crashout train --data ./training_data --model-type quality --epochs 50

# Train a pathology classification model
cargo run --bin crashout train --data ./training_data --model-type classification --num-classes 5 --epochs 100

# Train a multi-task model with custom parameters
cargo run --bin crashout train \
  --data ./training_data \
  --model-type multi-task \
  --num-classes 5 \
  --epochs 200 \
  --batch-size 8 \
  --learning-rate 0.0005 \
  --max-seq-len 150 \
  --val-split 0.15 \
  --output ./models \
  --model-name gait_classifier

Video Resolution Standardization

All videos are automatically resized to 640x640 for consistent processing:

  • Input: Any size MP4 video (720p, 1080p, 4K, etc.)
  • Processing: Frames resized to 640x640 during extraction
  • Pose Detection: Operates on 640x640 frames (optimal input size)
  • Crashout Model: Processes 640x640 frames (consistent with pose detection)
  • Output: Labeled video at 640x640 resolution

This ensures:

  • βœ… Consistent model performance across all input videos
  • βœ… Optimal processing speed and memory usage
  • βœ… No resolution mismatches between components
  • βœ… Standardized training data format

CLI Options

The training command supports extensive customization:

# Full training command with all options
cargo run --bin crashout train \
  --data ./training_data \
  --model-type multi-task \
  --num-classes 5 \
  --epochs 200 \
  --batch-size 8 \
  --learning-rate 0.0005 \
  --max-seq-len 150 \
  --val-split 0.15 \
  --output ./models \
  --model-name gait_classifier \
  --skip-frames 2 \
  --video-extensions mp4,avi,mov

Available Commands:

  • train: Train gait analysis models on video data
  • inference: Run inference on new videos (coming soon)

Real-Time Streaming Analysis (Experimental)

Crashout's architecture is designed to support real-time streaming gait analysis using buffered sequences, though this capability is currently untested. The system could theoretically process live video streams with sufficient buffering to capture complete gait cycles.

Streaming Potential

Supported Stream Types (theoretical):

  • RTMP/RTMPS live streams
  • WebRTC video streams
  • Direct camera feeds (/dev/video0)
  • Network video streams (HTTP/HTTPS)

Minimum Buffer Requirements:

  • Technical minimum: ~30-60 frames (1-2 seconds at 30fps) for basic gait detection
  • Recommended: ~90-150 frames (3-5 seconds at 30fps) for robust analysis
  • Optimal: ~120-180 frames (4-6 seconds at 30fps) for highest accuracy

Expected Performance (untested):

  • Latency: 3-6 seconds (buffer fill time + inference)
  • Update frequency: Every 0.5-1 seconds using sliding windows
  • Memory per stream: ~50-100MB buffer overhead

Implementation Approach

The streaming system would use a sliding window buffer:

  1. Continuous buffering: Maintain rolling window of recent frames
  2. Gait cycle capture: Buffer length ensures complete step cycles are captured
  3. Overlapped inference: Run predictions on overlapping windows for smooth output
  4. Real-time pose extraction: Use pose detection on each incoming frame

Potential Applications

  • Security monitoring: Real-time person identification at checkpoints
  • Healthcare monitoring: Continuous gait quality assessment in facilities
  • Sports analysis: Live biomechanical feedback during training
  • Accessibility: Real-time mobility assistance and fall prevention

Current Status

⚠️ This functionality is theoretical and untested. The current implementation focuses on offline video analysis. Real-time streaming would require:

  • Streaming video input integration
  • Sliding window buffer implementation
  • Real-time inference pipeline optimization
  • Latency and throughput testing

The existing tensor pipeline and variable sequence length handling provide a solid foundation for future streaming implementation.

Pose Detection Architecture

Crashout implements a dual approach for pose detection to maximize flexibility:

πŸ”₯ Internal Pose Detection (Rust/Burn)

  • Purpose: Core gait analysis pipeline for real-time inference
  • Implementation: YOLOv5-inspired architecture in pure Rust using Burn
  • Backend: Burn deep learning framework with WGPU acceleration
  • Use case: Production gait analysis inference with full control over the pipeline

🌐 External YOLOv11 (ONNX)

  • Purpose: High-quality pose data extraction for training dataset creation
  • Implementation: Latest Ultralytics YOLOv11 models via ONNX Runtime
  • Models: YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11l, YOLOv11x pose models
  • Use case: Preprocessing videos to create training datasets (default behavior)

This design allows you to:

  1. Extract training data using state-of-the-art YOLOv11 models (default)
  2. Train your gait models on high-quality pose sequences
  3. Deploy for inference using the fast, self-contained Rust implementation

Implementation Status

βœ… Completed Components

  • Real-Time Video Processing: Live MP4 β†’ labeled MP4 pipeline with pose visualization
  • 640x640 Standardization: Automatic video resizing for consistent model input
  • YOLOv11 Pose Detection: External YOLO model integration via ONNX Runtime
  • Frame Labeling System: Keypoint and skeleton overlay on video frames
  • Streaming Pipeline: Decode β†’ detect β†’ label β†’ encode without intermediate files
  • LSTM Temporal Processing: Bidirectional LSTM with sequence-to-sequence support
  • Transformer Attention Layers: Multi-head self-attention with positional encoding
  • End-to-End Gait Model: Complete pipeline from pose data to gait predictions
  • Video Training Pipeline: Direct video β†’ pose β†’ training without preprocessing
  • Multi-Command CLI: Extract, train, and inference commands with full parameter control
  • Model Downloading: Automatic YOLOv11 model download and caching system
  • Person Tracking: Spatial proximity-based tracking across video frames
  • Self-Contained Processing: Pure Rust implementation using FFmpeg-next

πŸ”§ Architecture Features

  • Dual Pose Detection: Internal architecture (Rust/Burn) + External YOLOv11 (ONNX) for data extraction
  • Input Size: Correctly configured for 51 flattened pose features (17 keypoints Γ— 3 values)
  • Bidirectional LSTM: Captures both past and future temporal context
  • Transformer Integration: LSTM outputs feed directly into transformer attention layers
  • Flexible Prediction Heads: Configurable for different gait analysis tasks
  • Device Support: Full WGPU backend support for GPU acceleration
  • Memory Efficient: Optimized for real-time inference with <50ms target latency
  • Self-Contained: No system dependencies required for video processing

πŸ“Š Model Configurations

Quality Scoring Model:

  • LSTM: 256 hidden units, 2 layers, bidirectional
  • Transformer: 512 d_model, 8 heads, 4 layers
  • Output: Single quality score (0.0-1.0)

Pathology Classification Model:

  • LSTM: 256 hidden units, 3 layers, bidirectional
  • Transformer: 512 d_model, 8 heads, 6 layers
  • Output: Multi-class pathology probabilities

Multi-Task Model:

  • LSTM: 320 hidden units, 3 layers, bidirectional
  • Transformer: 640 d_model, 8 heads, 6 layers
  • Output: Quality scores + classification + feature vectors

Dataset Format

Crashout uses a video-based training pipeline that processes videos directly with CSV label files. No preprocessing or JSON conversion is needed.

Directory Structure

Organize your training data like this:

training_data/
β”œβ”€β”€ quality_scores.csv    # Optional: video_name,quality_score
β”œβ”€β”€ pathology_labels.csv  # Optional: video_name,pathology_class
β”œβ”€β”€ class_names.txt       # Optional: class names, one per line
└── videos/
    β”œβ”€β”€ subject1_walk1.mp4
    β”œβ”€β”€ subject2_walk1.mp4
    β”œβ”€β”€ subject3_walk2.mp4
    └── ...

Label Files

quality_scores.csv (for quality scoring models):

video_name,quality_score
subject1_walk1.mp4,0.85
subject2_walk1.mp4,0.92
subject3_walk2.mp4,0.73

pathology_labels.csv (for classification models):

video_name,pathology_class
subject1_walk1.mp4,0
subject2_walk1.mp4,2
subject3_walk2.mp4,1

class_names.txt (for human-readable class names):

normal
limp
parkinson
arthritis
post_surgery

Video Requirements

  • Format: MP4, AVI, MOV (any size)
  • Resolution: Automatically resized to 640x640 during processing
  • Content: Walking sequences with clearly visible people
  • Duration: Variable length (handled automatically with padding/truncation)
  • Quality: Higher quality videos improve pose detection accuracy

COCO-17 Keypoint Format

Crashout uses the standard COCO-17 keypoint format:

Index Keypoint Description
0 nose Face center
1 left_eye Left eye
2 right_eye Right eye
3 left_ear Left ear
4 right_ear Right ear
5 left_shoulder Left shoulder
6 right_shoulder Right shoulder
7 left_elbow Left elbow
8 right_elbow Right elbow
9 left_wrist Left wrist
10 right_wrist Right wrist
11 left_hip Left hip
12 right_hip Right hip
13 left_knee Left knee
14 right_knee Right knee
15 left_ankle Left ankle
16 right_ankle Right ankle

Each keypoint is represented as [x, y, confidence] where:

  • x, y: Pixel coordinates in the original video frame
  • confidence: Detection confidence score (0.0-1.0)

Lower body keypoints (hips, knees, ankles) are particularly important for gait analysis.

Training Process

Crashout handles the complete training pipeline automatically:

  1. Video Loading: Reads MP4/AVI/MOV files from the videos/ directory
  2. Pose Extraction: Runs pose detection on each frame
  3. Sequence Creation: Groups consecutive frames into walking sequences
  4. Data Augmentation: Handles variable sequence lengths with padding/truncation
  5. Model Training: Uses LSTM + Transformer architecture with multi-task loss

Automatic Handling

  • Variable lengths: Sequences are automatically padded to max_seq_len or truncated
  • Missing frames: Gaps in pose detection are handled gracefully
  • Quality filtering: Low-confidence poses are filtered automatically
  • Batch processing: Efficient batching for GPU training
  • Validation split: Automatic train/validation splitting

Usage Examples

Creating a Gait Quality Model

use crashout::model::gait_model::{GaitModelConfig, utils};
use burn::backend::wgpu::WgpuDevice;
use burn::backend::Wgpu;

let device = WgpuDevice::default();

// Create a model optimized for gait quality scoring
let model = utils::create_quality_model::<Wgpu>(&device)?;

// Or create with custom configuration
let config = GaitModelConfig::quality_scoring();
let model = config.init::<Wgpu>(&device)?;

Gait-Based Person Identification

Crashout can be configured for person identification using gait as a biometric. Each person's walking pattern is unique, making this suitable for security, healthcare monitoring, and behavioral analysis applications.

How It Works

Gait biometrics leverage unique characteristics in how people walk:

  • Temporal patterns: Walking rhythm, step frequency, cadence
  • Spatial patterns: Stride length, step width, body movement
  • Biomechanical signatures: Joint angles, limb coordination, balance
  • Individual variations: Height, leg length, muscle strength, injuries

Training Data Structure for Person ID

person_labels.csv:
video_name,person_id,session,environment
subject001_session1.mp4,person_001,indoor_treadmill,controlled
subject001_session2.mp4,person_001,outdoor_natural,uncontrolled
subject002_session1.mp4,person_002,indoor_treadmill,controlled
subject003_session1.mp4,person_003,outdoor_natural,uncontrolled
...

Each person should have multiple walking sequences recorded across different:

  • Sessions: Different days/times to capture consistency
  • Environments: Indoor/outdoor, treadmill/natural walking
  • Conditions: Normal speed, fast walking, different clothing

Model Configuration for Person ID

use crashout::model::gait_model::GaitModelConfig;

// Person identification model (100 people)
let config = GaitModelConfig {
    // LSTM Configuration
    lstm_hidden_size: 256,
    lstm_num_layers: 3,
    lstm_bidirectional: true,

    // Transformer Configuration
    transformer_d_model: 512,
    transformer_num_heads: 8,
    transformer_num_layers: 6,

    // Classification head for person IDs
    enable_quality_head: false,
    enable_classification_head: true,
    num_classes: 100, // Number of unique people

    // Feature extraction for similarity matching
    enable_feature_head: true,
    feature_dim: 256, // Gait embeddings

    final_dropout: 0.1,
};

let person_id_model = config.init::<Wgpu>(&device)?;

Training for Person Identification

use crashout::model::training::{VideoGaitDataset, TrainingConfig, GaitTrainer};

// Create dataset with person ID labels
let mut dataset_config = TrainingDatasetConfig::default();
dataset_config.data_root = PathBuf::from("./person_id_data");

let dataset = VideoGaitDataset::from_directory(dataset_config)?;

// Training focused on classification accuracy
let training_config = TrainingConfig {
    num_epochs: 150,
    learning_rate: 1e-4,
    loss_config: LossConfig {
        task_weights: TaskWeights {
            quality: 0.0,          // Disable quality loss
            classification: 1.0,   // Focus on person ID classification
            temporal_consistency: 0.1, // Smooth gait patterns
        },
        use_focal_loss: true,      // Handle person ID imbalance
        focal_alpha: 0.25,
        focal_gamma: 2.0,
    },
    ..Default::default()
};

let mut trainer = GaitTrainer::new(person_id_model, dataset, training_config, device);
let metrics = trainer.train()?;

Inference for Person Recognition

// Single person identification
let prediction = model.predict_single(&unknown_sequence, &device, 100);

if let Some(class_probs) = prediction.class_probabilities {
    let person_id = class_probs.argmax(1).into_scalar();
    let confidence = class_probs.max_dim(1).into_scalar();

    println!("Identified as person_{:03}: {:.2}% confidence",
             person_id, confidence * 100.0);
}

// Feature-based similarity matching
if let Some(features) = prediction.features {
    // Compare against known person embeddings
    let similarities = compute_cosine_similarity(features, known_embeddings);
    let most_similar = similarities.argmax(0).into_scalar();

    println!("Most similar to person_{:03}", most_similar);
}

Multi-Task: Health + Identity

Combine person identification with health monitoring:

// Multi-task model: identify person AND assess their gait health
let config = GaitModelConfig {
    enable_quality_head: true,     // Health assessment
    enable_classification_head: true, // Person ID
    enable_feature_head: true,     // Similarity matching
    num_classes: 50,               // 50 people in system
    // ... other config
};

let health_id_model = config.init::<Wgpu>(&device)?;

// Inference provides both identification and health status
let prediction = health_id_model.predict_single(&sequence, &device, 100);

if let (Some(person_probs), Some(quality)) =
    (prediction.class_probabilities, prediction.quality_score) {

    let person_id = person_probs.argmax(1).into_scalar();
    let health_score = quality.into_scalar();

    println!("Person {}: Health score {:.3}/1.0", person_id, health_score);

    // Track health changes over time
    if health_score < previous_scores[person_id] - 0.1 {
        println!("⚠️  Health decline detected for person {}", person_id);
    }
}

Real-World Applications

Security & Access Control:

  • Identify individuals at security checkpoints without face visibility
  • Long-range person recognition for perimeter security
  • Continuous authentication while walking through facilities

Healthcare Monitoring:

  • Track specific patients' gait changes over time
  • Early detection of mobility issues or neurological conditions
  • Personalized rehabilitation progress monitoring

Research Applications:

  • Longitudinal studies of gait changes with age
  • Biomechanical analysis for sports performance
  • Population health studies with privacy preservation

Privacy Considerations:

  • Gait data can identify individuals - ensure proper data protection
  • Consider anonymization techniques for research applications
  • Implement access controls for person identification databases

Performance Expectations

Training Requirements:

  • Minimum: 10-20 walking sequences per person across multiple sessions
  • Recommended: 50+ sequences per person in varied conditions
  • Training time: 2-4 hours for 100 people on modern GPU

Identification Accuracy:

  • Controlled environment: High accuracy for known individuals
  • Natural conditions: Degraded but high accuracy with environmental variation
  • Degradation factors: Clothing changes, injuries, extreme weather

Real-time Performance (to be tested):

  • Identification latency: <100ms for sequence classification
  • Memory usage: ~500MB for 100-person model
  • Throughput: 10+ simultaneous video streams on GPU

Direct Video Inference

use crashout::model::gait_model::GaitModel;
use crashout::video_processor::FrameIterator;

// Load trained model
let model: GaitModel<Wgpu> = GaitModel::load_from_file("./models/gait_model.burn", &device)?;

// Process video directly
let mut frame_iter = FrameIterator::new("./test_video.mp4")?;
let mut pose_sequence = Vec::new();

while let Some(frame) = frame_iter.decode_frame()? {
    // Extract pose and add to sequence
    // (pose extraction details handled internally during training)
}

// Make prediction on video
let prediction = model.forward(pose_tensor, Some(&sequence_lengths), false);

match prediction.quality_score {
    Some(score) => println!("Gait quality: {:.3}", score.into_scalar()),
    None => println!("Quality head not enabled"),
}

Multi-Task Learning with Video Data

// Train multi-task model on video dataset
let config = GaitModelConfig::multi_task(5); // 5 pathology classes

// Use the CLI for easy training
cargo run --bin crashout train \
  --data ./medical_videos \
  --model-type multi-task \
  --num-classes 5 \
  --epochs 100 \
  --output ./trained_models

Key Design Decisions

Why Video-Based Training?

  • No preprocessing: Direct video input eliminates intermediate steps
  • Real-time pipeline: Same pose detection used for training and inference
  • Simple setup: Just videos + CSV labels - no complex data preparation
  • Flexible labeling: Easy to add new label types with CSV files

Why 51 Features?

  • 17 keypoints Γ— 3 values each (x, y, confidence) = 51 features
  • Flattened format is optimal for LSTM input
  • Preserves all spatial and confidence information

Why Per-Video Training?

  • Natural units: Each video represents one complete walking sequence
  • Variable lengths: Videos have different durations, handled automatically
  • Efficient processing: Batch multiple videos for GPU training
  • Simple labeling: One label per video file

Tensor Shape Flow

Understanding the tensor transformations through the video training pipeline:

Video Frames:       [height, width, 3] β†’ MP4/AVI/MOV files
↓
Pose Detection:     Extracts keypoints β†’ [[x,y,c], [x,y,c], ...] Γ— 17
↓
Flattened:          [x,y,c,x,y,c,x,y,c,...] β†’ 51 features per frame
↓
Sequence:           [[51], [51], [51], ...] β†’ [seq_len, 51] per video
↓
Batched:            Multiple videos β†’ [batch, seq_len, 51]
↓
LSTM:               Temporal processing β†’ [batch, seq_len, hidden_size]
↓
Transformer:        Attention mechanism β†’ [batch, seq_len, d_model]
↓
Gait Analysis:      Final prediction β†’ [batch, output_size]

Contributing

  1. Ensure your changes maintain tensor shape compatibility
  2. Add tests for new data structures
  3. Update documentation for any format changes
  4. Run cargo test before submitting PRs

Citation

[Add citation information if this becomes a research project]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support