๐ง Edge-AI YOLOv8n Demo โ Precision and Performance Across Platforms
This repository documents my hands-on exploration of edge-AI inference performance across platforms and numerical precisions.
Each phase benchmarks YOLOv8n under different hardware, precision modes, and input resolutions to understand how each factor impacts real-time performance.
๐ฅ Input Dataset
All tests used a single publicly available annotated video:
A Busy Intersection Road in London โ by Mikhail Nilov (Pexels)
- License: Free to use under the Pexels License
- Cropped to 640ร640 for model input
- Contains moving pedestrians, vehicles, and buses in an urban setting
- Provides a consistent benchmark for comparing precision and performance
๐ฉ Phase 1 โ MacBook Pro (M2) Baseline โ CPU FP32
The first phase establishes an FP32 CPU baseline using a 640ร640 video input.
It reflects an unoptimized, CPU-only inference path to provide a realistic reference for later GPU acceleration.
โ๏ธ Setup
- Hardware: MacBook Pro (M2, CPU)
- Framework: Ultralytics YOLOv8n (PyTorch + ONNX, FP32)
- Input: 640ร640 video (
test.mp4, from Pexels โ London Intersection)
๐ Results
| Resolution | FPS | Precision | Accelerator | Framework |
|---|---|---|---|---|
| 320ร320 | 27.38 | FP32 | CPU | Ultralytics YOLOv8n |
| 480ร480 | 25.53 | FP32 | CPU | Ultralytics YOLOv8n |
| 640ร640 | 22.15 | FP32 | CPU | Ultralytics YOLOv8n |
๐ฌ Demo: demo_clip.mp4
โก Phase 2 โ Jetson Orin Nano Acceleration โ FP32 vs FP16 (TensorRT)
Phase 2 compares FP32 vs FP16 inference on the Jetson Orin Nano across three input sizes (640, 480, 320).
The goal is to quantify both precision impact and resolution scaling using TensorRT optimization.
โ๏ธ Setup
- Model: YOLOv8n (ONNX โ TensorRT FP32 & FP16)
- Hardware: Jetson Orin Nano 8 GB
- Frameworks: Ultralytics YOLO + TensorRT 10.3 + CUDA 12.6
- Input: 640ร640 video (
test.mp4, from Pexels โ London Intersection)
๐ Jetson GPU Benchmarks
| Resolution | FPS (FP32) | FPS (FP16) | Accelerator | Framework |
|---|---|---|---|---|
| 320ร320 | 195.684 | 298.535 | GPU (TensorRT) | YOLOv8n |
| 480ร480 | 128.831 | 205.119 | GPU (TensorRT) | YOLOv8n |
| 640ร640 | 75.7129 | 132.516 | GPU (TensorRT) | YOLOv8n |
๐ฌ Demo: jetson_demo.mp4
๐ Visual Comparison โ FPS Across Precision and Resolution
To illustrate the performance scaling between FP32 and FP16, the following chart compares YOLOv8n throughput (FPS) across three resolutions on the Jetson Orin Nano.
๐งฎ Observations
- FP16 inference yields ~1.5โ1.8ร higher throughput than FP32 on the same GPU.
- Resolution directly scales with computational cost โ doubling image size roughly halves FPS.
- Compared to Mac CPU FP32, the Jetson achieves ~7ร speedup at the same 640ร640 resolution.
๐งฉ Precision & Resolution Awareness
| Precision | Description | Typical Use |
|---|---|---|
| FP32 | Full-precision; highest accuracy, slower throughput | CPU / training |
| FP16 | Half-precision; faster compute, small accuracy trade-off | GPU / TensorRT |
| INT8 (Planned) | Quantized; fastest, needs calibration | NPU / Hailo / edge |
Key insight: both precision and input resolution are powerful levers for optimizing edge AI latency and power draw.
๐ญ Phase 3 (Planned) โ Raspberry Pi 5 + Hailo 8 NPU
The next step is to benchmark the same YOLOv8n pipeline on a Raspberry Pi 5 + Hailo 8 accelerator to compare:- NPU vs GPU vs CPU performance- Power efficiency / throughput trade-offs- Layer-wise operator mapping (Conv / BN / ReLU etc.)
Phase 3 โ Live Webcam Inference on Jetson Orin Nano (CPU Path)
While resolving CUDA/PyTorch compatibility for JetPack 6.x, I ran live YOLOv8n inference directly on the CPU to keep progress moving forward. Below are the end-to-end frame stage timings captured during live webcam inference.
| Resolution | Preprocessing (ms) | Inference (ms) | Postprocessing (ms) | Approx FPS |
|---|---|---|---|---|
| 640ร640 | 2.0 ms | 435.6 ms | 2.3 ms | ~2.2 FPS |
| 480ร480 | 4.2 ms | 289.8 ms | 1.8 ms | ~3.4 FPS |
| 320ร320 | 1.6 ms | 148.5 ms | 1.4 ms | ~6.4 FPS |
Demo video: Live_webcam
Live video Frame (CPU): Frame_Capture
Key Takeaways
- The CPU bottleneck is entirely in inference, not preprocessing or drawing.
- Reducing resolution from 640 โ 320 provides a 3ร improvement in FPS.
- This benchmark establishes a baseline for measuring TensorRT FP16 GPU speed-up in the next phase.
Next Step (Phase 4)
Run the same live webcam pipeline using the TensorRT FP16 engine inside the NVIDIA l4t-ml container to demonstrate GPU acceleration and quantify the performance gain.
๐ง Key Takeaways
- TensorRT FP16 provides near-double throughput with negligible accuracy loss.
- Lower input resolutions drastically improve FPS โ ideal for mobile robotics.
- Edge AI optimization = balancing precision, resolution, and hardware parallelism.
- This experiment demonstrates real, measurable acceleration gains achievable through practical deployment steps.
๐ How to Reproduce
# Export model
yolo export model=yolov8n.pt format=onnx simplify=True
# Build engines
# FP32 engine
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8n.onnx \
--saveEngine=yolov8n_fp32.engine \
--minShapes=images:1x3x320x320 \
--optShapes=images:1x3x480x480 \
--maxShapes=images:1x3x640x640 \
--avgRuns=50
# FP16 engine
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8n.onnx \
--saveEngine=yolov8n_fp32.engine \
--minShapes=images:1x3x320x320 \
--optShapes=images:1x3x480x480 \
--maxShapes=images:1x3x640x640 \
--fp16 \
--avgRuns=50
# Run inference for each resolution
# FP32 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x640x640 --avgRuns=50
# FP16 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x640x640 --avgRuns=50
# Phase 3
pip install ultralytics opencv-python
#640x640
yolo detect model=yolov8n.pt source=0 imgsz=640 device=cpu save=True
#480x480
yolo detect model=yolov8n.pt source=0 imgsz=480 device=cpu save=True
#320x320
yolo detect model=yolov8n.pt source=0 imgsz=320 device=cpu save=True
- Downloads last month
- -
