Human Action Classification v2.0

State-of-the-art human action recognition model trained on Stanford 40 Actions dataset. GitHub project link -> human-action-classification

Model Description

This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.

Developed by: Saumya Kumaar Saksena (@dronefreak)
Model type: Image Classification (Action Recognition)
Language(s): English (action labels)
Finetuned from: ImageNet pretrained ResNet34

Key Features

🎯 86% accuracy on Stanford 40 Actions test set
⚡ Real-time inference (~25ms per image on GTX 1050 Ti)
🎨 Pose-aware optional MediaPipe integration
📦 Easy to use with simple Python API
🔧 Production-ready with comprehensive evaluation metrics

Model Variants

All models trained on Stanford 40 Actions dataset:

Model	Accuracy	Macro F1	Parameters	Size	Inference Time*
ResNet50	88.5%	0.8842	23.5M	94MB	~30ms
ResNet34 (this model)	86.4%	0.8618	21.3M	85MB	~25ms
ResNet18	82.3%	0.8178	11.2M	45MB	~18ms
MobileNet V3 Large	82.1%	0.8169	5.4M	20MB	~15ms
ViT Base	76.8%	0.7650	86M	330MB	~45ms
MobileNet V3 Small	74.35%	0.7350	2.5M	10MB	~10ms

*Single image on NVIDIA GTX 1050 Ti

Detailed Performance Comparison

Model	Accuracy (%)	Macro Precision	Macro Recall	Macro F1	Weighted F1
ResNet50	88.5	0.8874	0.8850	0.8842	0.8842
ResNet34	86.4	0.8686	0.8640	0.8618	0.8618
ResNet18	82.3	0.8211	0.8230	0.8178	0.8178
MobileNet V3 Large	82.1	0.8216	0.8210	0.8169	0.8169
ViT Base Patch16	76.8	0.7774	0.7680	0.7650	0.7650
MobileNet V3 Small	74.35	0.7382	0.7435	0.7350	0.7350

Trade-offs:

ResNet50: Best accuracy but slower and larger
ResNet34: Optimal balance of accuracy and speed ⭐
MobileNet V3 Large: Best mobile/edge deployment option
MobileNet V3 Small: Fastest inference for resource-constrained devices

Supported Actions (40 Classes)

Click to expand full list

applauding
blowing_bubbles
brushing_teeth
cleaning_the_floor
climbing
cooking
cutting_trees
cutting_vegetables
drinking
feeding_a_horse
fishing
fixing_a_bike
fixing_a_car
gardening
holding_an_umbrella
jumping
looking_through_a_microscope
looking_through_a_telescope
playing_guitar
playing_violin
pouring_liquid
pushing_a_cart
reading
phoning
riding_a_bike
riding_a_horse
rowing_a_boat
running
shooting_an_arrow
smoking
taking_photos
texting_message
throwing_frisby
using_a_computer
walking_the_dog
washing_dishes
watching_TV
waving_hands
writing_on_a_board
writing_on_a_book

Quick Start

Installation

pip install git+https://github.com/dronefreak/human-action-classification.git

Basic Usage

from hac import ActionPredictor

# Initialize predictor
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    device='cuda'
)

# Predict on image
result = predictor.predict_image('photo.jpg', top_k=3)

# Print results
print(f"Action: {result['action']['top_class']}")
print(f"Confidence: {result['action']['top_confidence']:.2%}")

# Top 3 predictions
for pred in result['action']['predictions']:
    print(f"  {pred['class']}: {pred['confidence']:.2%}")

With Pose Estimation

predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    use_pose_estimation=True,  # Enable MediaPipe
    device='cuda'
)

result = predictor.predict_image('photo.jpg', return_pose=True)

print(f"Detected pose: {result['pose']['class']}")
print(f"Action: {result['action']['top_class']}")

Batch Prediction

from pathlib import Path

image_paths = list(Path('images/').glob('*.jpg'))
results = predictor.predict_batch(image_paths, batch_size=32)

for img_path, result in zip(image_paths, results):
    print(f"{img_path.name}: {result['action']['top_class']}")

Performance Metrics

Evaluated on Stanford 40 Actions test set (5,532 images):

Metric	Score
Accuracy	86.4%
Macro F1-Score	0.8618
Weighted F1-Score	0.8618
Macro Precision	0.8686
Macro Recall	0.8640

Top Performing Classes

Class	F1-Score
Applauding	0.935
Jumping	0.925
Running	0.918
Waving Hands	0.912
Drinking	0.905

Confusion Analysis

Most commonly confused actions:

Cooking ↔ Washing Dishes (similar kitchen setting)
Reading ↔ Using Computer (similar seated poses)
Fixing Bike ↔ Fixing Car (similar repair actions)

Full metrics available in metrics.json

Training Details

Training Data

Dataset: Stanford 40 Actions
Training split: ~4,000 images
Test split: ~5,532 images
Classes: 40 human action categories
Image resolution: 224×224 (resized)

Please note that the proposed train-test split is a bit unconventional, which is why I had to create a custom train-test split of 80-20, which is a standard in machine learning practises.

Training Procedure

Preprocessing

# Training augmentation
transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

Training Hyperparameters

Backbone: ResNet34 (ImageNet pretrained)
Optimizer: AdamW
Learning rate: 1e-3 → 1e-5 (cosine decay)
Weight decay: 1e-3
Batch size: 32
Epochs: 200
Augmentation: Mixup (α=0.4)
Scheduler: CosineAnnealingLR

Training Hardware

GPU: NVIDIA RTX 4070 Super (12GB)
Training time: ~0.5 hours
Framework: PyTorch 2.0+

This approach reduced overfitting from 99% train / 62% test → 82% train / 86% test.

Evaluation

from hac.evaluation import evaluate_model

# Evaluate on test set
metrics = evaluate_model(
    checkpoint='resnet34_best.pth',
    data_dir='stanford40/',
    split='test'
)

print(f"Accuracy: {metrics['accuracy']:.2%}")
print(f"F1-Score: {metrics['f1_macro']:.4f}")

Limitations

Trained on Stanford 40 which has limited diversity
Best performance on indoor/outdoor daily activities
May struggle with unusual camera angles or occlusions
Requires clear view of person performing action
Not suitable for fine-grained action recognition (e.g., different sports moves)

Bias and Fairness

The model inherits biases from the Stanford 40 dataset:

Limited demographic diversity
Western-centric activities
Imbalanced class distribution

Users should evaluate performance on their specific use case.

Citation

@software{saksena2025hac,
  author = {Saksena, Saumya Kumaar},
  title = {Human Action Classification v2.0},
  year = {2025},
  url = {https://github.com/dronefreak/human-action-classification},
  version = {2.0}
}

Model Card Authors

Saumya Kumaar Saksena

Model Card Contact

GitHub: @dronefreak
Repository: human-action-classification

Additional Resources

License

Apache License 2.0 - Free for research and commercial use.

See LICENSE for full details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Accuracy on Stanford 40 Actions
self-reported

86.400
Macro F1-Score on Stanford 40 Actions
self-reported

0.862

View on Papers With Code