I built a support ticket classifier using a fine-tuned Qwen2.5-0.5B model. It determines intent, category, urgency, sentiment, and routing — all in a single inference.
*Why I built this:* A company needed to automate ticket routing but couldn't use cloud LLM APIs due to data privacy requirements. Self-hosted was the only option.
*Stack:*
- Qwen2.5-0.5B-Instruct (fine-tuned, not LoRA)
- GGUF Q4_K_M quantization (350MB)
- llama-cpp-python + FastAPI
- Docker on a $10/mo VPS
*Results:*
- ~90% accuracy on intent/category (on synthetic ~4K dataset — with real data and 5-10K examples, accuracy improves)
- 150ms on Apple Silicon, 3-5s on budget VPS (old Xeon without AVX2)
*When this makes sense vs cloud APIs:*
- Data must stay on-premise
- High volume (>10K/month) where API costs add up
- Narrow classification task (not general chat)
molchanovartem•1h ago
I built a support ticket classifier using a fine-tuned Qwen2.5-0.5B model. It determines intent, category, urgency, sentiment, and routing — all in a single inference.
*Why I built this:* A company needed to automate ticket routing but couldn't use cloud LLM APIs due to data privacy requirements. Self-hosted was the only option.
*Stack:* - Qwen2.5-0.5B-Instruct (fine-tuned, not LoRA) - GGUF Q4_K_M quantization (350MB) - llama-cpp-python + FastAPI - Docker on a $10/mo VPS
*Results:* - ~90% accuracy on intent/category (on synthetic ~4K dataset — with real data and 5-10K examples, accuracy improves) - 150ms on Apple Silicon, 3-5s on budget VPS (old Xeon without AVX2)
*When this makes sense vs cloud APIs:* - Data must stay on-premise - High volume (>10K/month) where API costs add up - Narrow classification task (not general chat)
*Try it:* - Demo: https://silentworks.tech/test - API docs: https://silentworks.tech/docs
Happy to discuss the implementation details, training approach, or deployment setup.
---
Contact: https://t.me/var_molchanov