I wanted to test the new Qwen 3.5 Small models (released March 2) for a structured output task. I fine-tuned the 0.8B, 2B and 4B on text-to-SQL using LoRA on a Mac (64 GB, MLX), and added Mistral-Nemo 12B as a baseline.
The 2B beat the 12B by 19 percentage points (50% vs 31% semantic accuracy). Larger models are "too smart"? They compute the answer mentally and output "42" instead of writing SQL. 81% of the 12B's errors were plain numbers.
Everything runs locally, zero cloud compute. The repo has scripts, data and full results to reproduce it.
sciences44•1h ago
The 2B beat the 12B by 19 percentage points (50% vs 31% semantic accuracy). Larger models are "too smart"? They compute the answer mentally and output "42" instead of writing SQL. 81% of the 12B's errors were plain numbers.
Everything runs locally, zero cloud compute. The repo has scripts, data and full results to reproduce it.