I built this because llama.cpp crashes with a type-36 error on
BitNet I2_S weights, and bitnet.cpp has no LoRA support.
The core problem: LoRA deltas (~1e-5 magnitude) are erased when
you merge them into ternary weights (~1.2 scale) and
re-quantize. The fine-tuning silently disappears. The only
solution is to never merge — keep the adapter separate, apply
it at full F32 precision at load time, then cast to F16.
ternative loads a base I2_S GGUF + a LoRA adapter GGUF, merges
in F32, and serves via OpenAI-compatible HTTP. Runs all 30
layers of a 2B model on a 4GB GPU at ~6-7 tok/s.
I used it to train and benchmark Orchid 1.0
(https://huggingface.co/MicheRomChis/orchid-1.0) — a BitNet
fine-tune aligned with ORPO. ARC-Challenge: 56.0% (+6.1pp over
base). Technical paper: https://huggingface.co/MicheRomChis/orc
hid-1.0/blob/main/orchid-1-0-technical-paper.pdf
michelangeloro•36m ago