OP here. I wanted to test whether the vision encoder's pre-training strategy matters when you stitch it into an LLM. So I froze three encoders (CLIP, I-JEPA, supervised ViT), stitched each into Qwen2.5 with a small trainable projector + LoRA (~3M params), and compared.
Key findings:
- CLIP dominates on average (language-aligned embeddings make the projector's job trivial).
- But I-JEPA — which has never seen text during pre-training — ties CLIP on compositional reasoning (CLEVR). And scaling the LLM from 0.5B to 1.5B helped more than swapping any encoder.
teendifferent•1h ago
Key findings: - CLIP dominates on average (language-aligned embeddings make the projector's job trivial). - But I-JEPA — which has never seen text during pre-training — ties CLIP on compositional reasoning (CLEVR). And scaling the LLM from 0.5B to 1.5B helped more than swapping any encoder.
Code, trained weights, and eval scripts are all open: https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM/tree/main
Blog: https://teendifferent.substack.com/p/stitching-vision-into-l...
Curious what others think about I-JEPA-style representations for VLMs — the spatial reasoning results surprised me.