Hi HN,
I built a system to run 35B parameter language models on older Pascal GPUs (P100 +
GTX 1080 Ti) using multi-GPU memory spillover.
Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM
(~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits
idle.
Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system
automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on
hardware that normally maxes at 13B.
Benchmarks (P100 16GB + GTX 1080 Ti 11GB):
- Qwen-14B: 13.7 tokens/sec (9.4GB VRAM)
- OPT-30B: 5.4 tokens/sec (15.2GB VRAM)
- CodeLlama-34B: 0.8 tokens/sec (16.7GB VRAM)
Quick start:
docker pull rickeshtn/large-model-international_release:latest
docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1
--ulimit stack=268435456 -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache
rickeshtn/large-model-international_release:latest python
/app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct
Technical details:
- QLoRA 4-bit NF4 quantization (75% memory reduction)
- HuggingFace Transformers + Accelerate + bitsandbytes
- Automatic device mapping with CPU offload
- Interactive chat with conversation persistence
GitHub: https://github.com/rickeshtn/locallm-pascal
Docker Hub: https://hub.docker.com/r/rickeshtn/large-model-international_release
34 users already running it. Happy to answer technical questions!