Sure, they have huge GPU clusters, but there must be more going on - model optimizations, sharding, custom hardware, clever load balancing, etc.
What engineering tricks make this possible at such massive scale while keeping latency low?
Curious to hear insights from people who've built large-scale ML systems.
minimaxir•6mo ago
That's a really, really big "sure."
Almost every trick to run a LLM at OpenAI's scale is a trade secret and may not be easily understood by mere mortals anyways (e.g. bare-metal CUDA optimizations)
v5v3•6mo ago
With all the staff poaching the trade secrets may have now leaked?
minimaxir•6mo ago
thrown-0825•6mo ago
It's also the reason John Carmack got sued by zenimax when he went to oculus.
handfuloflight•6mo ago