Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant

2•freakynit•1h ago

https://w418ufqpha7gzj-80.proxy.runpod.net

Started for myself, but since Im not using it continuously, sharing it:

Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance).

5 parallel requests supported.. full context available (please don't misuse..there are no safety guards in place)

Open till spot instance lasts or max 4 hours.

And yes, no request logging (I don't even know how to do it with llama-server)

Prompt processing and generation speeds (at 8K context): 900t/s and 60t/s. And at 100K context: 450t/s and 30t/s.

Command used:

    ./build/bin/llama-server \
      -m ../Qwen3.6-35B-A3B-UD-Q5_K_M.gguf \
      --alias 'Qwen3-6-35B-A3B-turbo' \
      --ctx-size 262144 \
      --no-mmproj \
      --host 0.0.0.0 \
      --port 80 \
      --jinja \
      --flash-attn on \
      --cache-type-k turbo3 \
      --cache-type-v turbo3 \
      --reasoning off \
      --temp 0.6 \
      --top-p 0.95 \
      --top-k 20 \
      --min-p 0.0 \
      --presence-penalty 0.0 \
      --repeat-penalty 1.0 \
      --parallel 5.0 \
      --cont-batching \
      --threads 16 \
      --threads-batch 16

Thanks..

Comments

freakynit•37m ago

Update: spot terminated

EnthrallingEmil•5m ago

Also 3090. using Q4_XL, reduced max context size, with 100k prompt length, I get 2520 tk/s for prompt processing, 68 token/s generation:

  llama-server \
    --model /mnt/ubuntu/models/llama-cpp-qwen/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    --ctx-size 150000 \
    --n-gpu-layers 99 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --parallel 3 \
    --kv-unified \
    --ctx-checkpoints 32 \
    --checkpoint-every-n-tokens 8192 \
    --checkpoint-min-tokens 64 \
    --flash-attn on \
    --batch-size 4096 \
    --ubatch-size 1024 \
    --reasoning on \
    --temp 0.6 \
    --top-p 0.95 \
    --top-k 20

I was wondering if turboquant is worth the effort right now, but I'm not yet seeing it speed wise.

checkpoint-min-tokens is a local patch I have so that small background tasks don't wreck my checkpoint cache.

AWS Security Agent on-demand penetration testing now generally available

The Strait of Hormuz is now open

AI Tool Blindness

Solo founders and indie hackers should have a backup plan

Two US citizens sentenced for running North Korean laptop farms

Stop Killing Games at the European Parliament Full Hearing [video]

Show HN: Using an AI agent to refine a ML model for Zephyr RTOS

Cloudflare: The Agent Readiness score. Is your site agent-ready?

Consider sending a list of everything you did to your coworkers everyday

Scientists Develop "Molecular Scissors" Alternative to Cas9

Rejoice: A concatenative multiset language built on Fractran-like primitives

Why Amazon Is Buying Globalstar–and What It Means for Your iPhone

Chinese fabs import US chipmaking equipment via Singapore and Malaysia

How should you change your life if we are being watched by alien drone probes?

Distill MCP – Turn your reading queue into a podcast, via Claude Code MCP

Is 1 Nit Enough? – Phone Minimum Display Brightness

Linux 7.1 Crypto Code Rework Enables More Optimizations by Default

What Is Infrastructure from Code?

A third of Americans don't drive. So why is our transportation so car-centric?

Teaching a Model to Code

Replaced Official Release Date Trailer [video]

Anthropic Quadruples London Office Amid US Regulatory Tensions

White House Investigating Wave of Missing or Dead Scientists

High Amplitude Disagreeableness – Stay SaaSy

Reflections on Trusting Trust [pdf]

Twilio Account Hacked

Show HN: Use real handwriting for messages and forums (Write Me, Maybe)

WorldSeed – define a world in YAML, let AI agents live in it

Great Docs for Python Project Documentation

PostgreSQL MVCC, Byte by Byte