Predict your distributed LLM training time before you burn GPU hours

https://github.com/DebarghaG/estimate-train-time

2•barthelomew•2w ago

Comments

barthelomew•2w ago

Predict your distributed LLM training time before you burn GPU hours.

We've open-sourced a tool (https://github.com/DebarghaG/estimate-train-time) that estimates wall-clock time for LLM training across multi-GPU setups with 3D parallelism (pipeline, tensor, and data).

This problem is extremely hard: you're modeling the interplay of thousands of GPU kernels, NCCL collectives across heterogeneous network topologies, pipeline bubbles, activation recomputation, and ZeRO optimizer communication all while these components interact in non-obvious ways at scale. Even off-by-2x estimates are useless for capacity planning.

Two years of painstaking work, ~$100k worth of cluster time, validated on real workloads at Perlmutter (NERSC) and Vista (TACC) some of the largest HPC clusters available for open science.

How it works: 1. Kernel-level profiling: We sample execution times for kernels like Flash Attention, fused GEMM (QKV/FFN projections), RMSNorm, embedding lookups, and cross-entropy loss across the (batch, seq_len, hidden_dim, num_heads, MP degree) parameter space. 2. Communication modeling: NCCL benchmarks capture ring all-reduce (tensor/data parallel sync), all-gather (ZeRO-1 parameter collection), and P2P send/recv (pipeline stage activation transfers) across intra-node NVLink and inter-node InfiniBand topologies. 3. Analytical composition: Operator predictions feed into a pipeline scheduling model (AF-AB / 1F1B) that accounts for bubble overhead: (PP - 1) / (num_microbatches + PP - 1) idle fraction, layer distribution across head/middle/tail stages, and overlapped DP gradient sync. 4. Runs on CPU (post-sampling) no GPU access needed for inference of training time.

This is highly extensible as a recipe. You may profile your own hardware with bundled kernel-sampling and NCCL-benchmarking scripts. You can add custom operators by implementing the regressor interface.

This work builds on our HiPC 2025 paper on fine-grained GPU performance modeling. Earlier code to reproduce results in paper: https://github.com/ICICLE-ai/distributed_training_estimator_...

Looking for early adopters and feedback especially teams doing parallelism strategy search or capacity planning at scale.

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

X said it would give $1M to a user who had previously shared racist posts

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge