Ask HN: Is LLM training infra still broken enough to build a company around?

2•harsh020•1h ago

We recently ran into something frustrating while training and fine-tuning open-weight TTS models.

Instead of working on the model itself, we spent days dealing with: - CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually

It felt like we were rebuilding the same orchestration layer every team probably rebuilds. - Cloud providers give raw GPUs. - MLOps tools give experiment tracking. - Open-source gives training scripts.

But the end-to-end workflow (dataset → fine-tune → monitor → evaluate → deploy → retrain) still feels stitched together.

We’re exploring building an opinionated platform that:

Lets you select a base model (e.g. Llama/Mistral-style open models) 1. Upload or connect datasets 2. Choose infra tier 3. Launch LoRA/full fine-tuning 4. Monitor loss + cost in real time 5. Run built-in eval 6. Deploy with one click

Basically: abstract away the CUDA + orchestration layer.

Before we go too deep, I’d love honest feedback: - Is this still a painful problem at your company? - Would serious AI teams use this, or do larger companies just build infra in-house? - Is this doomed to be a hobbyist tool? - Where would the real wedge be — training, evaluation, or continuous retraining?

We’ve launched a simple landing page and started building, but we’re still early and trying to validate whether this is a real infra gap or just our own frustration.

Would appreciate blunt feedback.

Comments

genxy•1h ago

> CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually

This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.

Your business will get replaced with a prompt.

GitHub Actions is left vulnerable to supply chain attacks: Datadog Report

Google Killed the Rent-a-Domain Era

Show HN: Karta – Google Search, for discovering talent

Smallest transformer that can add two 10-digit numbers

A Visual Guide to DNA Sequencing

He saw an abandoned trailer. Then, uncovered a surveillance network

Show HN: I built a local AI-powered Ouija board with a fine-tuned 3B model

Using AI without losing skills

Hyper: a reactive server side rendered web framework for Clojure

Trump, seeking executive power over elections, is urged to declare emergency

TikTok, X link organiser for iOS and Android

Towards a Sovereign Mobile Stack

Show HN: Protection Against Zero-Day Cyber Attacks

Anthropic is giving Claude Opus 3 its own Substack

4Chan knew about Jeffrey Epstein's death 38 minutes before the rest of the world

Ask HN: How are you handling EU AI Act compliance as a developer?

Microsoft announces new "mini PCs" for Windows 365

Stellify – Structured code for AI-assisted development

Study finds that pay differences among top performers can erode cooperation

Show HN: AppLaunchFlow: Create App Store screenshots in minutes

Why does the Hacker News UI never get updated?

iPhone and iPad approved to handle classified NATO information

Show HN: EloPhanto – Video creation, 116 tools

Linux, Product and the Art of Essence

Alberta Learner Test – The Basics of Driving

rlwrap (2013)

The C++ Development Crisis

Tasklet Instant Apps

What Claude Code Chooses

Twitch: "Hey, come back! This commercial break can't play while you're away."