We released a Cosmos-Reason2-2B W4A16 + FlashHead build optimized for Jetson devices. FlashHead is a drop-in replacement for the LM head that increases token generation throughput without sacrificing reasoning quality, on top of techniques like quantization.
Embedl-Wilhelm•1h ago
Try it with vllm-serve:
ssh <your-orin>
docker run --rm -it \ --network host \ --runtime=nvidia \ --name=vllm-serve \ -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN_HERE> \ embedl/vllm:latest-jetson-orin-flashhead \ vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \ --gpu-memory-utilization 0.75 \ --trust-remote-code
curl localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead","messages":[{"role":"user","content":"Hi"}]}' Jetson video inference benchmark (TPS with batch size = 1, 12 frames, 1280×720):
Device FP16 W4A16 FlashHead Orin Nano OOM 43.7 53.5 AGX Orin 39.6 74.4 92.2 AGX Thor 56.2 88.3 128.2 Model: https://huggingface.co/embedl/Cosmos-Reason2-2B-W4A16-Edge2-...
We’re Embedl, a research startup from Gothenburg, Sweden and the team behind FlashHead. Let us know what other models you’d like to see it applied to.