Hi HN,
I built gflow, a single-node GPU job scheduler as a lightweight alternative to SLURM. If you've ever shared a multi-GPU machine with teammates and dealt
with GPU conflicts, or just wanted a simple way to queue up training jobs overnight, this is for that.
The problem: SLURM is designed for clusters. Setting it up on a single machine is overkill, and most ML teams sharing a workstation end up with ad-hoc
solutions — checking nvidia-smi, shouting in Slack, or writing hacky bash scripts.
gflow gives you:
- Job queue with GPU-aware scheduling — auto-detects GPUs via NVML, handles allocation and sets CUDA_VISIBLE_DEVICES for you
- Job dependencies — with AND/OR logic, so you can chain training → eval → export
- Job arrays — for hyperparameter sweeps
- tmux-based execution — jobs run in tmux sessions, so you can attach to see live output, and they survive terminal disconnects
- Conda environment support — specify the env per job
- Webhook notifications — get pinged when jobs finish or fail
- SLURM-inspired CLI — gbatch, gqueue, gcancel, gjob — familiar if you've used SLURM, simpler if you haven't
It's a single Rust binary. Install via pip install runqd (pre-built binaries), cargo install, or grab a release from GitHub. Run gflowd init to set up,
gflowd up to start, and you're scheduling jobs.
Example:
gbatch --gpus 2 --conda myenv python train.py --lr 0.001
gqueue
gjob log 1
Tech stack: Rust, Tokio, Axum (REST API), NVML, MessagePack for state persistence.
GitHub: https://github.com/AndPuQing/gflow
Happy to answer any questions.