frontpage.

Show HN: Glq LLM quantization using E8 lattice

https://github.com/cnygaard/glq

1•acd•1h ago

I have with the help of AI create an open source method of E8 LLM code book quantization library called glq. I was interested in creating Glq as a PC gamer and devops, interested in both LLMs and AI. The current high RAM prices and LLM resource usage also inspired me to write glq. A question arises could you try and squeeze more out a gaming GPU with limited VRAM size by using alternative LLM compression methods?

Glq is effective compared to other LLM quantization algorithms at between 2-bits per weight up to 4 bits per weight. The effectiveness of glq at low bits per words is due to the properties of the E8 lattice compared to linear methods. Glq also supports mixed precision quantization where different LLM layers uses different compression bit weight depending on how sensitive the LLM layers are to quantization. Think of mixed precision a bit like MP3 or MP4 variable bit rate encoding.

I currently develop glq using g7e AWS spot instances to keep the cost more reasonable.

Glq uses vllm

4 bit Key value cache by E8 was inspired by NexusQuant. I try and squeeze in about four times as much Key value cache as normally would fit by BF16 in VRAM, or about two times compared to INT8.

I somehow wrongly at start picked a E8 code book size of 65536 entries instead of 4096 code book entries which better fits in GPU L1 cache. Having 65535 code book entries it turns out leads to higher LLM compression rate but at trade of of decode speed. I am trying to compensate by using Nvidia Cuda graphs and optimize the decode, currently work in progress.

To install glq in a python virtual environment on Linux with a Nvidia GPU: pip install glq

Python PIP package https://pypi.org/project/glq/

Glq source code. https://github.com/cnygaard/glq

Current PC RAM Prices that inspired the library. https://pcpartpicker.com/trends/price/memory/

https://en.wikipedia.org/wiki/E8_lattice Eight dimensional lattice that provides optimal solution to the sphere packing problems. Think about it a bit like stacking cannon balls or stacking apples in an optimal way. Only you swap the apples for LLM weights.

Picture of an E8 lattice https://en.wikipedia.org/wiki/E8_polytope#/media/File:E8_gra...

Credits: GLQ was inspired by E8 Quip# and Key value E8 compression was inspired by NexusQuant.

Math: The sphere packing problem in dimension 8, Maryna Viazovska https://arxiv.org/abs/1603.04246

4bpw glq Quantization of Gemma 4 E4b-instruction tuned https://huggingface.co/xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw

3.5bpw mixed precision quantization of SmolLM3 https://huggingface.co/xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Docker image of glq on Nvidia GPU with Nvidia container toolkit. docker run --rm --gpus all \ -v "$HOME/.cache/huggingface:/cache/hf" \ ghcr.io/cnygaard/glq-env:0.5.0 \ python -c ' import glq.hf_integration, torch # registers GLQ with HF from transformers import AutoModelForCausalLM, AutoTokenizer mid = "xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained( mid, device_map="cuda", torch_dtype=torch.float16) ids = tok("The capital of France is", return_tensors="pt").to("cuda") print(tok.decode(model.generate(*ids, max_new_tokens=20)[0], skip_special_tokens=True)) '

Currently work in progress on glq in getting the decode speed up and supporting more LLM model architectures.

Open question, Does glq work on Nvidia DGX spark and gaming Nvidia hardware such as 4070-5090?

Tesla May registrations jump in several European markets as recovery continues

Show HN: I reduced LLM inference GPU calls by 94% using semantic routing

Macroscale Connectivity in the Octopus Brain

Implicit.js, a way for agents to do 3D design with math

Show HN: Going from 1+1=2 to Quantum Mechanics

Open source project contains hidden instruction for "AI" agents: delete my code

75 years of the Fender Telecaster: 12 guitarists who defined the Tele

BYD plans to bring all-solid-state batteries to EVs by 2027, but it's not alone

Show HN: QR Boarding Pass Generator++

Florida sues OpenAI and Sam Altman over marketing ChatGPT despite serious risks

A tale of two weekend projects

Jenesis – A modern Java build tool

Every Byte Matters

The architect who became the king of bank robberies

For Goldman's Top Bankers, It's All AI Data Centers All the Time

Natural tissue immortality: Indefinite survival of sea cucumber explants

The MCP Context Tax – Notes from Running 605 Tool Packs

Speech Studio – I open-sourced a local voice cloning Mac app (free, no API keys)

Hatch: Write agent rules/skills once, generate for all

The SpaceX Squeeze

We built a 12-step verification pipeline. It caught zero real errors

I tracked 68 automation metrics. Only 3 changed my behavior

NeuROK: Generative 4D Neural Object Kinematics

Repo explainer – For coders that can't read good

Why Are Human Teeth So Messed Up? (2017)

Famous Photo of Chernobyl's Dangerous Radioactive Material Was a Selfie (2016)

OpenAI frontier models and Codex are now available on AWS

AI's reality check has arrived

Terrascan: Explore public deep earth scan datasets

AI Grifters Are Making Anti-Data Center Slop with AI