Human Action Classification v2.0

State-of-the-art human action recognition model trained on Stanford 40 Actions dataset. GitHub project link -> human-action-classification

Demo

Model Description

This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.

  • Developed by: Saumya Kumaar Saksena (@dronefreak)
  • Model type: Image Classification (Action Recognition)
  • Language(s): English (action labels)
  • Finetuned from: ImageNet pretrained ResNet34

Key Features

  • 🎯 86% accuracy on Stanford 40 Actions test set
  • ⚑ Real-time inference (~25ms per image on GTX 1050 Ti)
  • 🎨 Pose-aware optional MediaPipe integration
  • πŸ“¦ Easy to use with simple Python API
  • πŸ”§ Production-ready with comprehensive evaluation metrics

Model Variants

All models trained on Stanford 40 Actions dataset:

Model Accuracy Macro F1 Parameters Size Inference Time*
ResNet50 88.5% 0.8842 23.5M 94MB ~30ms
ResNet34 (this model) 86.4% 0.8618 21.3M 85MB ~25ms
ResNet18 82.3% 0.8178 11.2M 45MB ~18ms
MobileNet V3 Large 82.1% 0.8169 5.4M 20MB ~15ms
ViT Base 76.8% 0.7650 86M 330MB ~45ms
MobileNet V3 Small 74.35% 0.7350 2.5M 10MB ~10ms

*Single image on NVIDIA GTX 1050 Ti

Detailed Performance Comparison

Model Accuracy (%) Macro Precision Macro Recall Macro F1 Weighted F1
ResNet50 88.5 0.8874 0.8850 0.8842 0.8842
ResNet34 86.4 0.8686 0.8640 0.8618 0.8618
ResNet18 82.3 0.8211 0.8230 0.8178 0.8178
MobileNet V3 Large 82.1 0.8216 0.8210 0.8169 0.8169
ViT Base Patch16 76.8 0.7774 0.7680 0.7650 0.7650
MobileNet V3 Small 74.35 0.7382 0.7435 0.7350 0.7350

Trade-offs:

  • ResNet50: Best accuracy but slower and larger
  • ResNet34: Optimal balance of accuracy and speed ⭐
  • MobileNet V3 Large: Best mobile/edge deployment option
  • MobileNet V3 Small: Fastest inference for resource-constrained devices

Supported Actions (40 Classes)

Click to expand full list
  • applauding
  • blowing_bubbles
  • brushing_teeth
  • cleaning_the_floor
  • climbing
  • cooking
  • cutting_trees
  • cutting_vegetables
  • drinking
  • feeding_a_horse
  • fishing
  • fixing_a_bike
  • fixing_a_car
  • gardening
  • holding_an_umbrella
  • jumping
  • looking_through_a_microscope
  • looking_through_a_telescope
  • playing_guitar
  • playing_violin
  • pouring_liquid
  • pushing_a_cart
  • reading
  • phoning
  • riding_a_bike
  • riding_a_horse
  • rowing_a_boat
  • running
  • shooting_an_arrow
  • smoking
  • taking_photos
  • texting_message
  • throwing_frisby
  • using_a_computer
  • walking_the_dog
  • washing_dishes
  • watching_TV
  • waving_hands
  • writing_on_a_board
  • writing_on_a_book

Quick Start

Installation

pip install git+https://github.com/dronefreak/human-action-classification.git

Basic Usage

from hac import ActionPredictor

# Initialize predictor
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    device='cuda'
)

# Predict on image
result = predictor.predict_image('photo.jpg', top_k=3)

# Print results
print(f"Action: {result['action']['top_class']}")
print(f"Confidence: {result['action']['top_confidence']:.2%}")

# Top 3 predictions
for pred in result['action']['predictions']:
    print(f"  {pred['class']}: {pred['confidence']:.2%}")

With Pose Estimation

predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    use_pose_estimation=True,  # Enable MediaPipe
    device='cuda'
)

result = predictor.predict_image('photo.jpg', return_pose=True)

print(f"Detected pose: {result['pose']['class']}")
print(f"Action: {result['action']['top_class']}")

Batch Prediction

from pathlib import Path

image_paths = list(Path('images/').glob('*.jpg'))
results = predictor.predict_batch(image_paths, batch_size=32)

for img_path, result in zip(image_paths, results):
    print(f"{img_path.name}: {result['action']['top_class']}")

Performance Metrics

Evaluated on Stanford 40 Actions test set (5,532 images):

Metric Score
Accuracy 86.4%
Macro F1-Score 0.8618
Weighted F1-Score 0.8618
Macro Precision 0.8686
Macro Recall 0.8640

Top Performing Classes

Class F1-Score
Applauding 0.935
Jumping 0.925
Running 0.918
Waving Hands 0.912
Drinking 0.905

Confusion Analysis

Most commonly confused actions:

  • Cooking ↔ Washing Dishes (similar kitchen setting)
  • Reading ↔ Using Computer (similar seated poses)
  • Fixing Bike ↔ Fixing Car (similar repair actions)

Full metrics available in metrics.json

Training Details

Training Data

  • Dataset: Stanford 40 Actions
  • Training split: ~4,000 images
  • Test split: ~5,532 images
  • Classes: 40 human action categories
  • Image resolution: 224Γ—224 (resized)

Please note that the proposed train-test split is a bit unconventional, which is why I had to create a custom train-test split of 80-20, which is a standard in machine learning practises.

Training Procedure

Preprocessing

# Training augmentation
transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

Training Hyperparameters

  • Backbone: ResNet34 (ImageNet pretrained)
  • Optimizer: AdamW
  • Learning rate: 1e-3 β†’ 1e-5 (cosine decay)
  • Weight decay: 1e-3
  • Batch size: 32
  • Epochs: 200
  • Augmentation: Mixup (Ξ±=0.4)
  • Scheduler: CosineAnnealingLR

Training Hardware

  • GPU: NVIDIA RTX 4070 Super (12GB)
  • Training time: ~0.5 hours
  • Framework: PyTorch 2.0+

This approach reduced overfitting from 99% train / 62% test β†’ 82% train / 86% test.

Evaluation

from hac.evaluation import evaluate_model

# Evaluate on test set
metrics = evaluate_model(
    checkpoint='resnet34_best.pth',
    data_dir='stanford40/',
    split='test'
)

print(f"Accuracy: {metrics['accuracy']:.2%}")
print(f"F1-Score: {metrics['f1_macro']:.4f}")

Limitations

  • Trained on Stanford 40 which has limited diversity
  • Best performance on indoor/outdoor daily activities
  • May struggle with unusual camera angles or occlusions
  • Requires clear view of person performing action
  • Not suitable for fine-grained action recognition (e.g., different sports moves)

Bias and Fairness

The model inherits biases from the Stanford 40 dataset:

  • Limited demographic diversity
  • Western-centric activities
  • Imbalanced class distribution

Users should evaluate performance on their specific use case.

Citation

@software{saksena2025hac,
  author = {Saksena, Saumya Kumaar},
  title = {Human Action Classification v2.0},
  year = {2025},
  url = {https://github.com/dronefreak/human-action-classification},
  version = {2.0}
}

Model Card Authors

Saumya Kumaar Saksena

Model Card Contact

Additional Resources

License

Apache License 2.0 - Free for research and commercial use.

See LICENSE for full details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results