๐Ÿง  Edge-AI YOLOv8n Demo โ€” Precision and Performance Across Platforms

This repository documents my hands-on exploration of edge-AI inference performance across platforms and numerical precisions.
Each phase benchmarks YOLOv8n under different hardware, precision modes, and input resolutions to understand how each factor impacts real-time performance.


๐ŸŽฅ Input Dataset

All tests used a single publicly available annotated video:

A Busy Intersection Road in London โ€“ by Mikhail Nilov (Pexels)

  • License: Free to use under the Pexels License
  • Cropped to 640ร—640 for model input
  • Contains moving pedestrians, vehicles, and buses in an urban setting
  • Provides a consistent benchmark for comparing precision and performance

๐Ÿšฉ Phase 1 โ€” MacBook Pro (M2) Baseline โ€” CPU FP32

The first phase establishes an FP32 CPU baseline using a 640ร—640 video input.
It reflects an unoptimized, CPU-only inference path to provide a realistic reference for later GPU acceleration.

โš™๏ธ Setup

๐Ÿ“Š Results

Resolution FPS Precision Accelerator Framework
320ร—320 27.38 FP32 CPU Ultralytics YOLOv8n
480ร—480 25.53 FP32 CPU Ultralytics YOLOv8n
640ร—640 22.15 FP32 CPU Ultralytics YOLOv8n

๐ŸŽฌ Demo: demo_clip.mp4


โšก Phase 2 โ€” Jetson Orin Nano Acceleration โ€” FP32 vs FP16 (TensorRT)

Phase 2 compares FP32 vs FP16 inference on the Jetson Orin Nano across three input sizes (640, 480, 320).
The goal is to quantify both precision impact and resolution scaling using TensorRT optimization.

โš™๏ธ Setup

  • Model: YOLOv8n (ONNX โ†’ TensorRT FP32 & FP16)
  • Hardware: Jetson Orin Nano 8 GB
  • Frameworks: Ultralytics YOLO + TensorRT 10.3 + CUDA 12.6
  • Input: 640ร—640 video (test.mp4, from Pexels โ€“ London Intersection)

๐Ÿ“Š Jetson GPU Benchmarks

Resolution FPS (FP32) FPS (FP16) Accelerator Framework
320ร—320 195.684 298.535 GPU (TensorRT) YOLOv8n
480ร—480 128.831 205.119 GPU (TensorRT) YOLOv8n
640ร—640 75.7129 132.516 GPU (TensorRT) YOLOv8n

๐ŸŽฌ Demo: jetson_demo.mp4


๐Ÿ“ˆ Visual Comparison โ€” FPS Across Precision and Resolution

To illustrate the performance scaling between FP32 and FP16, the following chart compares YOLOv8n throughput (FPS) across three resolutions on the Jetson Orin Nano.

FPS Comparison Chart

๐Ÿงฎ Observations

  • FP16 inference yields ~1.5โ€“1.8ร— higher throughput than FP32 on the same GPU.
  • Resolution directly scales with computational cost โ€” doubling image size roughly halves FPS.
  • Compared to Mac CPU FP32, the Jetson achieves ~7ร— speedup at the same 640ร—640 resolution.

๐Ÿงฉ Precision & Resolution Awareness

Precision Description Typical Use
FP32 Full-precision; highest accuracy, slower throughput CPU / training
FP16 Half-precision; faster compute, small accuracy trade-off GPU / TensorRT
INT8 (Planned) Quantized; fastest, needs calibration NPU / Hailo / edge

Key insight: both precision and input resolution are powerful levers for optimizing edge AI latency and power draw.


๐Ÿ”ญ Phase 3 (Planned) โ€” Raspberry Pi 5 + Hailo 8 NPU

The next step is to benchmark the same YOLOv8n pipeline on a Raspberry Pi 5 + Hailo 8 accelerator to compare:
- NPU vs GPU vs CPU performance
- Power efficiency / throughput trade-offs
- Layer-wise operator mapping (Conv / BN / ReLU etc.)


Phase 3 โ€” Live Webcam Inference on Jetson Orin Nano (CPU Path)

While resolving CUDA/PyTorch compatibility for JetPack 6.x, I ran live YOLOv8n inference directly on the CPU to keep progress moving forward. Below are the end-to-end frame stage timings captured during live webcam inference.

Resolution Preprocessing (ms) Inference (ms) Postprocessing (ms) Approx FPS
640ร—640 2.0 ms 435.6 ms 2.3 ms ~2.2 FPS
480ร—480 4.2 ms 289.8 ms 1.8 ms ~3.4 FPS
320ร—320 1.6 ms 148.5 ms 1.4 ms ~6.4 FPS

Demo video: Live_webcam

Live video Frame (CPU): Frame_Capture

Key Takeaways

  • The CPU bottleneck is entirely in inference, not preprocessing or drawing.
  • Reducing resolution from 640 โ†’ 320 provides a 3ร— improvement in FPS.
  • This benchmark establishes a baseline for measuring TensorRT FP16 GPU speed-up in the next phase.

Next Step (Phase 4)
Run the same live webcam pipeline using the TensorRT FP16 engine inside the NVIDIA l4t-ml container to demonstrate GPU acceleration and quantify the performance gain.


๐Ÿง  Key Takeaways

  • TensorRT FP16 provides near-double throughput with negligible accuracy loss.
  • Lower input resolutions drastically improve FPS โ€” ideal for mobile robotics.
  • Edge AI optimization = balancing precision, resolution, and hardware parallelism.
  • This experiment demonstrates real, measurable acceleration gains achievable through practical deployment steps.

๐Ÿ›  How to Reproduce

# Export model
yolo export model=yolov8n.pt format=onnx simplify=True

# Build engines

# FP32 engine
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_fp32.engine \
  --minShapes=images:1x3x320x320 \ 
  --optShapes=images:1x3x480x480 \
  --maxShapes=images:1x3x640x640 \
  --avgRuns=50

# FP16 engine 
/usr/src/tensorrt/bin/trtexec \
  --onnx=yolov8n.onnx \
  --saveEngine=yolov8n_fp32.engine \
  --minShapes=images:1x3x320x320 \ 
  --optShapes=images:1x3x480x480 \
  --maxShapes=images:1x3x640x640 \
  --fp16 \
  --avgRuns=50

# Run inference for each resolution

# FP32 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp32.engine --shapes=images:1x3x640x640 --avgRuns=50

# FP16 benchmarks
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x320x320 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x480x480 --avgRuns=50
/usr/src/tensorrt/bin/trtexec --loadEngine=yolov8n_fp16.engine --shapes=images:1x3x640x640 --avgRuns=50

# Phase 3
pip install ultralytics opencv-python

#640x640
yolo detect model=yolov8n.pt source=0 imgsz=640 device=cpu save=True

#480x480
yolo detect model=yolov8n.pt source=0 imgsz=480 device=cpu save=True

#320x320
yolo detect model=yolov8n.pt source=0 imgsz=320 device=cpu save=True
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support