Quentin PRO

Qvelard

AI & ML interests

AI for industry Quantum Machine Learning Robotics

Recent Activity

liked a model 14 days ago

pyannote/speaker-diarization-3.1

upvoted an article 26 days ago

Continuous batching from first principles

liked a Space about 1 month ago

mlc-ai/MLC-Weight-Conversion

View all activity

Organizations

liked a model 14 days ago

pyannote/speaker-diarization-3.1

Automatic Speech Recognition • Updated May 10, 2024 • 15.3M • 1.39k

upvoted an article 26 days ago

Article

Continuous batching from first principles

Nov 25

•

288

liked a Space about 1 month ago

MLC Weight Conversion

⚙

Convert Hugging Face models for local use

replied to their post about 1 month ago

Thanks 😃 !

liked 9 models about 1 month ago

upvoted a collection about 1 month ago

Qwen3-VL

Collection

37 items • Updated Nov 1 • 541

liked a model about 1 month ago

Qwen/Qwen-VL

Text Generation • Updated Jan 25, 2024 • 17.4k • 271

posted an update about 1 month ago

Post

262

Hey !

I'm working on a small-scale multi-drone control system and I'm looking for an open-source VLM that can run in real time on a Jetson Orin. If anyone knows a model or is personally interested in this kind of edge robotics problem, I'd love pointers.

What I'm trying to solve :

I have 4 simultaneous video streams coming from four drones (grayscale, 320×320 ). I can feed the model either:
• a 2×2 mosaic frame, or
• 4 separate frames as a batch.
Along with this, I provide a short text instruction describing the mission state.

What I need from the model :

A single structured JSON command representing the next action for the swarm controller. Something like (not decided yet):

{
  "action": "move_forward",
  "confidence": 0.87,
  "reason": "front corridor detected, no obstacles in drone_2 and drone_4 views"
}

So I need a VLM that can:
• handle multi-image or mosaic image input
• run efficiently on a Jetson Orin (ideally INT4/INT8 friendly, TensorRT-compatible)
• generate stable JSON outputs based on visual + textual context

I would really appreciate suggestions, or even just thoughts on what architectures make sense here.

Models like openbmb/MiniCPM-V-4_5, dustnehowl/nanoVLM, and Qwen/Qwen3-VL-8B-Instruct look promising, but I'm still exploring what’s actually viable on-device.

Happy to share benchmarks or test anything people want to throw at this problem. The multi-drone video + action JSON setup is niche but potentially useful to others building edge-deployed agents.