Human Action Classification v2.0
State-of-the-art human action recognition model trained on Stanford 40 Actions dataset. GitHub project link -> human-action-classification
Model Description
This model performs real-time human action classification from images, recognizing 40 different human activities. It combines a ResNet34 backbone with optional MediaPipe pose estimation for enhanced accuracy.
- Developed by: Saumya Kumaar Saksena (@dronefreak)
- Model type: Image Classification (Action Recognition)
- Language(s): English (action labels)
- Finetuned from: ImageNet pretrained ResNet34
Key Features
- π― 86% accuracy on Stanford 40 Actions test set
- β‘ Real-time inference (~25ms per image on GTX 1050 Ti)
- π¨ Pose-aware optional MediaPipe integration
- π¦ Easy to use with simple Python API
- π§ Production-ready with comprehensive evaluation metrics
Model Variants
All models trained on Stanford 40 Actions dataset:
| Model | Accuracy | Macro F1 | Parameters | Size | Inference Time* | 
|---|---|---|---|---|---|
| ResNet50 | 88.5% | 0.8842 | 23.5M | 94MB | ~30ms | 
| ResNet34 (this model) | 86.4% | 0.8618 | 21.3M | 85MB | ~25ms | 
| ResNet18 | 82.3% | 0.8178 | 11.2M | 45MB | ~18ms | 
| MobileNet V3 Large | 82.1% | 0.8169 | 5.4M | 20MB | ~15ms | 
| ViT Base | 76.8% | 0.7650 | 86M | 330MB | ~45ms | 
| MobileNet V3 Small | 74.35% | 0.7350 | 2.5M | 10MB | ~10ms | 
*Single image on NVIDIA GTX 1050 Ti
Detailed Performance Comparison
| Model | Accuracy (%) | Macro Precision | Macro Recall | Macro F1 | Weighted F1 | 
|---|---|---|---|---|---|
| ResNet50 | 88.5 | 0.8874 | 0.8850 | 0.8842 | 0.8842 | 
| ResNet34 | 86.4 | 0.8686 | 0.8640 | 0.8618 | 0.8618 | 
| ResNet18 | 82.3 | 0.8211 | 0.8230 | 0.8178 | 0.8178 | 
| MobileNet V3 Large | 82.1 | 0.8216 | 0.8210 | 0.8169 | 0.8169 | 
| ViT Base Patch16 | 76.8 | 0.7774 | 0.7680 | 0.7650 | 0.7650 | 
| MobileNet V3 Small | 74.35 | 0.7382 | 0.7435 | 0.7350 | 0.7350 | 
Trade-offs:
- ResNet50: Best accuracy but slower and larger
- ResNet34: Optimal balance of accuracy and speed β
- MobileNet V3 Large: Best mobile/edge deployment option
- MobileNet V3 Small: Fastest inference for resource-constrained devices
Supported Actions (40 Classes)
Click to expand full list
- applauding
- blowing_bubbles
- brushing_teeth
- cleaning_the_floor
- climbing
- cooking
- cutting_trees
- cutting_vegetables
- drinking
- feeding_a_horse
- fishing
- fixing_a_bike
- fixing_a_car
- gardening
- holding_an_umbrella
- jumping
- looking_through_a_microscope
- looking_through_a_telescope
- playing_guitar
- playing_violin
- pouring_liquid
- pushing_a_cart
- reading
- phoning
- riding_a_bike
- riding_a_horse
- rowing_a_boat
- running
- shooting_an_arrow
- smoking
- taking_photos
- texting_message
- throwing_frisby
- using_a_computer
- walking_the_dog
- washing_dishes
- watching_TV
- waving_hands
- writing_on_a_board
- writing_on_a_book
Quick Start
Installation
pip install git+https://github.com/dronefreak/human-action-classification.git
Basic Usage
from hac import ActionPredictor
# Initialize predictor
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    device='cuda'
)
# Predict on image
result = predictor.predict_image('photo.jpg', top_k=3)
# Print results
print(f"Action: {result['action']['top_class']}")
print(f"Confidence: {result['action']['top_confidence']:.2%}")
# Top 3 predictions
for pred in result['action']['predictions']:
    print(f"  {pred['class']}: {pred['confidence']:.2%}")
With Pose Estimation
predictor = ActionPredictor(
    model_path="hf://dronefreak/human-action-classification",
    use_pose_estimation=True,  # Enable MediaPipe
    device='cuda'
)
result = predictor.predict_image('photo.jpg', return_pose=True)
print(f"Detected pose: {result['pose']['class']}")
print(f"Action: {result['action']['top_class']}")
Batch Prediction
from pathlib import Path
image_paths = list(Path('images/').glob('*.jpg'))
results = predictor.predict_batch(image_paths, batch_size=32)
for img_path, result in zip(image_paths, results):
    print(f"{img_path.name}: {result['action']['top_class']}")
Performance Metrics
Evaluated on Stanford 40 Actions test set (5,532 images):
| Metric | Score | 
|---|---|
| Accuracy | 86.4% | 
| Macro F1-Score | 0.8618 | 
| Weighted F1-Score | 0.8618 | 
| Macro Precision | 0.8686 | 
| Macro Recall | 0.8640 | 
Top Performing Classes
| Class | F1-Score | 
|---|---|
| Applauding | 0.935 | 
| Jumping | 0.925 | 
| Running | 0.918 | 
| Waving Hands | 0.912 | 
| Drinking | 0.905 | 
Confusion Analysis
Most commonly confused actions:
- Cooking β Washing Dishes (similar kitchen setting)
- Reading β Using Computer (similar seated poses)
- Fixing Bike β Fixing Car (similar repair actions)
Full metrics available in metrics.json
Training Details
Training Data
- Dataset: Stanford 40 Actions
- Training split: ~4,000 images
- Test split: ~5,532 images
- Classes: 40 human action categories
- Image resolution: 224Γ224 (resized)
Please note that the proposed train-test split is a bit unconventional, which is why I had to create a custom train-test split of 80-20, which is a standard in machine learning practises.
Training Procedure
Preprocessing
# Training augmentation
transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])
Training Hyperparameters
- Backbone: ResNet34 (ImageNet pretrained)
- Optimizer: AdamW
- Learning rate: 1e-3 β 1e-5 (cosine decay)
- Weight decay: 1e-3
- Batch size: 32
- Epochs: 200
- Augmentation: Mixup (Ξ±=0.4)
- Scheduler: CosineAnnealingLR
Training Hardware
- GPU: NVIDIA RTX 4070 Super (12GB)
- Training time: ~0.5 hours
- Framework: PyTorch 2.0+
This approach reduced overfitting from 99% train / 62% test β 82% train / 86% test.
Evaluation
from hac.evaluation import evaluate_model
# Evaluate on test set
metrics = evaluate_model(
    checkpoint='resnet34_best.pth',
    data_dir='stanford40/',
    split='test'
)
print(f"Accuracy: {metrics['accuracy']:.2%}")
print(f"F1-Score: {metrics['f1_macro']:.4f}")
Limitations
- Trained on Stanford 40 which has limited diversity
- Best performance on indoor/outdoor daily activities
- May struggle with unusual camera angles or occlusions
- Requires clear view of person performing action
- Not suitable for fine-grained action recognition (e.g., different sports moves)
Bias and Fairness
The model inherits biases from the Stanford 40 dataset:
- Limited demographic diversity
- Western-centric activities
- Imbalanced class distribution
Users should evaluate performance on their specific use case.
Citation
@software{saksena2025hac,
  author = {Saksena, Saumya Kumaar},
  title = {Human Action Classification v2.0},
  year = {2025},
  url = {https://github.com/dronefreak/human-action-classification},
  version = {2.0}
}
Model Card Authors
Saumya Kumaar Saksena
Model Card Contact
- GitHub: @dronefreak
- Repository: human-action-classification
Additional Resources
License
Apache License 2.0 - Free for research and commercial use.
See LICENSE for full details.
Evaluation results
- Accuracy on Stanford 40 Actionsself-reported86.400
- Macro F1-Score on Stanford 40 Actionsself-reported0.862

