Over the past year I’ve been building X-AnyLabeling, an open-source project that started as a labeling tool and gradually evolved into a multimodal annotation ecosystem.
The original problem wasn’t drawing boxes or masks. It was that annotation, inference, and training are usually fragmented into separate tools, which makes iteration slow and painful — especially for small teams.
X-AnyLabeling tries to unify these pieces:
- A desktop-first annotation client (cross-platform, pure Python) - Pluggable AI inference (local or remote GPU servers) - Built-in support for multimodal data construction (VQA, image–text dialogs) - Direct integration with training pipelines (Ultralytics), forming a label–train–infer loop
The system is designed to be modular: heavy models run remotely, annotation stays lightweight, and users can integrate anything from SAM and OCR to VLMs like Qwen or GPT-style APIs.
Project: https://github.com/CVHub520/X-AnyLabeling
I’m interested in feedback from people who work on real-world CV or multimodal pipelines — especially around how annotation tools should evolve beyond manual labeling.