🧠 Edge-AI YOLOv8n Demo — Precision and Performance Across Platforms

This repository documents my hands-on exploration of edge-AI inference performance across platforms and numerical precisions.
Each phase benchmarks YOLOv8n under different hardware, precision modes, and input resolutions to understand how each factor impacts real-time performance.

🎥 Input Dataset

All tests used a single publicly available annotated video:

A Busy Intersection Road in London – by Mikhail Nilov (Pexels)

License: Free to use under the Pexels License
Cropped to 640×640 for model input
Contains moving pedestrians, vehicles, and buses in an urban setting
Provides a consistent benchmark for comparing precision and performance

🚩 Phase 1 — MacBook Pro (M2) Baseline — CPU FP32

The first phase establishes an FP32 CPU baseline using a 640×640 video input.
It reflects an unoptimized, CPU-only inference path to provide a realistic reference for later GPU acceleration.

⚙️ Setup

Hardware: MacBook Pro (M2, CPU)
Framework: Ultralytics YOLOv8n (PyTorch + ONNX, FP32)
Input: 640×640 video (test.mp4, from Pexels – London Intersection)

📊 Results

Resolution	FPS	Precision	Accelerator	Framework
320×320	27.38	FP32	CPU	Ultralytics YOLOv8n
480×480	25.53	FP32	CPU	Ultralytics YOLOv8n
640×640	22.15	FP32	CPU	Ultralytics YOLOv8n

🎬 Demo: demo_clip.mp4

⚡ Phase 2 — Jetson Orin Nano Acceleration — FP32 vs FP16 (TensorRT)

Phase 2 compares FP32 vs FP16 inference on the Jetson Orin Nano across three input sizes (640, 480, 320).
The goal is to quantify both precision impact and resolution scaling using TensorRT optimization.

⚙️ Setup

Model: YOLOv8n (ONNX → TensorRT FP32 & FP16)
Hardware: Jetson Orin Nano 8 GB
Frameworks: Ultralytics YOLO + TensorRT 10.3 + CUDA 12.6
Input: 640×640 video (test.mp4, from Pexels – London Intersection)

📊 Jetson GPU Benchmarks

Resolution	FPS (FP32)	FPS (FP16)	Accelerator	Framework
320×320	195.684	298.535	GPU (TensorRT)	YOLOv8n
480×480	128.831	205.119	GPU (TensorRT)	YOLOv8n
640×640	75.7129	132.516	GPU (TensorRT)	YOLOv8n

🎬 Demo: jetson_demo.mp4

📈 Visual Comparison — FPS Across Precision and Resolution

To illustrate the performance scaling between FP32 and FP16, the following chart compares YOLOv8n throughput (FPS) across three resolutions on the Jetson Orin Nano.

🧮 Observations

FP16 inference yields ~1.5–1.8× higher throughput than FP32 on the same GPU.
Resolution directly scales with computational cost — doubling image size roughly halves FPS.
Compared to Mac CPU FP32, the Jetson achieves ~7× speedup at the same 640×640 resolution.

🧩 Precision & Resolution Awareness

Precision	Description	Typical Use
FP32	Full-precision; highest accuracy, slower throughput	CPU / training
FP16	Half-precision; faster compute, small accuracy trade-off	GPU / TensorRT
INT8 (Planned)	Quantized; fastest, needs calibration	NPU / Hailo / edge

Key insight: both precision and input resolution are powerful levers for optimizing edge AI latency and power draw.

~~🔭 Phase 3 (Planned) — Raspberry Pi 5 + Hailo 8 NPU~~

~~The next step is to benchmark the same YOLOv8n pipeline on a Raspberry Pi 5 + Hailo 8 accelerator to compare:~~
~~- NPU vs GPU vs CPU performance~~
~~- Power efficiency / throughput trade-offs~~
~~- Layer-wise operator mapping (Conv / BN / ReLU etc.)~~

Phase 3 — Live Webcam Inference on Jetson Orin Nano (CPU Path)

While resolving CUDA/PyTorch compatibility for JetPack 6.x, I ran live YOLOv8n inference directly on the CPU to keep progress moving forward. Below are the end-to-end frame stage timings captured during live webcam inference.

Resolution	Preprocessing (ms)	Inference (ms)	Postprocessing (ms)	Approx FPS
640×640	2.0 ms	435.6 ms	2.3 ms	~2.2 FPS
480×480	4.2 ms	289.8 ms	1.8 ms	~3.4 FPS
320×320	1.6 ms	148.5 ms	1.4 ms	~6.4 FPS

Demo video: Live_webcam

Live video Frame (CPU): Frame_Capture

Key Takeaways

The CPU bottleneck is entirely in inference, not preprocessing or drawing.
Reducing resolution from 640 → 320 provides a 3× improvement in FPS.
This benchmark establishes a baseline for measuring TensorRT FP16 GPU speed-up in the next phase.

Next Step (Phase 4)
Run the same live webcam pipeline using the TensorRT FP16 engine inside the NVIDIA l4t-ml container to demonstrate GPU acceleration and quantify the performance gain.

🧠 Key Takeaways

TensorRT FP16 provides near-double throughput with negligible accuracy loss.
Lower input resolutions drastically improve FPS — ideal for mobile robotics.
Edge AI optimization = balancing precision, resolution, and hardware parallelism.
This experiment demonstrates real, measurable acceleration gains achievable through practical deployment steps.

🛠 How to Reproduce

# Export model
yolo export model=yolov8n.pt format=onnx simplify=True

# Build engines

# FP32 engine
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_fp32.engine \
  --minShapes=images:1x3x320x320 \ 
  --optShapes=images:1x3x480x480 \
  --maxShapes=images:1x3x640x640 \
  --avgRuns=50

# FP16 engine 
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_fp32.engine \
  --minShapes=images:1x3x320x320 \ 
  --optShapes=images:1x3x480x480 \
  --maxShapes=images:1x3x640x640 \
  --fp16 \
  --avgRuns=50

# Run inference for each resolution

# FP32 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x640x640 --avgRuns=50

# FP16 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x640x640 --avgRuns=50

# Phase 3
pip install ultralytics opencv-python

#640x640
yolo detect model=yolov8n.pt source=0 imgsz=640 device=cpu save=True

#480x480
yolo detect model=yolov8n.pt source=0 imgsz=480 device=cpu save=True

#320x320
yolo detect model=yolov8n.pt source=0 imgsz=320 device=cpu save=True

Downloads last month: -