Instead of working on the model itself, we spent days dealing with: - CUDA version mismatches - Driver / PyTorch conflicts - OOM crashes when scaling to multi-GPU - Broken or outdated open-source training scripts - Gluing together tracking + eval + deployment manually
It felt like we were rebuilding the same orchestration layer every team probably rebuilds. - Cloud providers give raw GPUs. - MLOps tools give experiment tracking. - Open-source gives training scripts.
But the end-to-end workflow (dataset → fine-tune → monitor → evaluate → deploy → retrain) still feels stitched together.
We’re exploring building an opinionated platform that:
Lets you select a base model (e.g. Llama/Mistral-style open models) 1. Upload or connect datasets 2. Choose infra tier 3. Launch LoRA/full fine-tuning 4. Monitor loss + cost in real time 5. Run built-in eval 6. Deploy with one click
Basically: abstract away the CUDA + orchestration layer.
Before we go too deep, I’d love honest feedback: - Is this still a painful problem at your company? - Would serious AI teams use this, or do larger companies just build infra in-house? - Is this doomed to be a hobbyist tool? - Where would the real wedge be — training, evaluation, or continuous retraining?
We’ve launched a simple landing page and started building, but we’re still early and trying to validate whether this is a real infra gap or just our own frustration.
Would appreciate blunt feedback.
genxy•1h ago
This shouldn't take days and CC can already setup all of this using whatever level of rigor you need.
Your business will get replaced with a prompt.