We Made CUDA Optimization Suck Less

https://www.rightnowai.co/

47•jaberjaber23•9mo ago

Comments

jaberjaber23•9mo ago

We’re RightNow AI. We built a tool that automatically profiles, detects bottlenecks, and generates optimized CUDA kernels using AI.

If you’ve written CUDA before, you know how it goes. You spend hours tweaking memory access, digging through profiler dumps, swapping out intrinsics, and praying it’ll run faster. Most of the time, you're guessing.

We got tired of it. So we built something that just works.

What RightNow AI Actually Does Prompt-based CUDA Kernel Generation Describe what you want in plain English. Get fast, optimized CUDA code back. No need to know the difference between global and shared memory layouts.

Serverless GPU Profiling Run your code on real GPUs without having local hardware. Get detailed reports about where it's slow and why.

Performance Optimizations That Deliver Not vague advice like “try more threads.” We return rewritten code. Our users are seeing 2x to 4x improvements out of the box. Some hit 20x.

Why We Built It We needed it for our own work. Our ML stack was bottlenecked by GPU code we didn’t have time to optimize. Existing tools felt ancient. The workflow was slow, clunky, and filled with trial and error.

We thought: what if I could just say "optimize this kernel for A100" and get something useful?

So we built it.

RightNow AI is live. You can try it for freee: https://www.rightnowai.co/

If you use it and hit something rough, tell us. We’ll fix it.

paulirish•8mo ago

What does one of the GPU profiling reports look like?

Edit: oh is it this? https://youtu.be/b-yh3FFpSX8?t=28

jaberjaber23•8mo ago

Yup, that video shows our profiling reports:D

3abiton•8mo ago

Howis this different than what unsloth is doing?

jaberjaber23•8mo ago

We profile and optimize kernels live on real GPUs!! So we’re different than unsloth

PontifexCipher•8mo ago

No examples of before/after? Maybe I missed something.

jaberjaber23•8mo ago

we’re adding those soon!!

godelski•8mo ago

I was expecting something like TensorRT or Triton, but found "Vibe Coding"

The project seems very naive. CUDA programming sucks because there's a lot of little gotchas and nuances that dramatically change performance. These optimizations can also significantly change between GPU architectures: you'll get different performances out of Volta, Ampere, or Blackwell. Parallel programming is hard in the first place, and it gets harder on GPUs because of all these little intricacies. People that have been doing CUDA programming for years are still learning new techniques. It takes a very different type of programming skill. Like actually understanding that Knuth's "premature optimization is the root of evil" means "get a profiler" not "don't optimize". All this is what makes writing good kernels take so long. That's even after Nvidia engineers are spending tons of time trying to simplify it.

So I'm not surprised people are getting 2x or 4x out of the box. I'd expect that much if a person grabbed a profiler. I'd honestly expect more if they spent a week or two with the documentation and serious effort. But nothing in the landing page is convincing me the LLM can actually significantly help. Maybe I'm wrong! But it is unclear if the lead dev has significant CUDA experience. And I don't want something that optimizes a kernel for an A100, I want kernelS that are optimized for multiple architectures. That's the hard part and all those little nuances are exactly what LLM coding tends to be really bad at.

germanjoey•8mo ago

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

godelski•8mo ago

I'm a bit confused. It sounds like you are disagreeing ("TBH") but the content seems like a summary of my comment. So, I agree.

Fwiw, they did say they got up to 20x improvement but given the issues we both mention this may not be surprising given that this seems to be an outlier by their own omission.

jaberjaber23•8mo ago

absolutely. it really depends on the kernel type, target architecture, and what you're optimizing for. the 2x-4x isn’t the limit, it's just what users often see out of the box. we do real-time profiling on actual GPUs, so you get results based on real performance on a specific arch, not guesses. when the baseline is rough, we’ve seen well over 10x

jaberjaber23•8mo ago

totally agree. we're not trying to replace deep CUDA knowledge:) just wanted to skip the constant guess and check.

every time we generate a kernel, we profile it on real GPUs (serverless) so you see how it runs on specific architectures. not just "trust the code" we show you what it does. still early, but it’s helping people move faster

godelski•8mo ago

Btw, I'm not talking deep CUDA knowledge. That takes years. I'm specifically talking about novices. The knowledge you get from a few weeks. I'd be quite hesitant to call someone an expert in a topic when they have less than a few years of experience. There's exceptions but expertise isn't quickly gained. Hell, you could have years of experience but if all you did is read medium blogs and stack overflow you'd probably still be a novice.

I get that you profile. I liked that part. But even as the other commenter says, it's unclear how to evaluate given the examples. Showing some real examples would be critical to sell people on this. Idk, maybe people blindly buy too but personally I'd be worried about integrating significant tech debt. It's easy to do that with kernels or anytime you're close to the metal. The nuances dominate these domains

jaberjaber23•8mo ago

Do you have a place where we can chat? Linkedin,....

godelski•8mo ago

Sorry, I'm not the CUDA expert you should be looking for. My efforts are in ML and I only dabble in CUDA and am friends with systems people. I'd suggest reaching out to system people.

I'd suggest you use that NVIDIA connection and reach out to the HPC teams there. Anyone working on CUTLASS, TensorRT, cuTensor, or maybe event the CuPy team could give you a lot better advice than me.

jaberjaber23•8mo ago

I really appreciate that!! thanks:D

cjbgkagh•8mo ago

The website appears vibe coded, as do the product-hunt reviews with "RightNow AI is an impressive..." appearing more than would be expected by random chance.

Either someone is good at writing CUDA Kernels and a 1-10% perf improvement is impressive, or they're bad at writing CUDA Kernels and a 2x-4x over naïve very often isn't impressive.

What percentage of people who do write custom CUDA kernels are bad at it? How many are so bad at it that they leave 20x on the table as claimed on the website?

What could have helped sell it to me as a concept is an example of a before and after.

EDIT: One of the reviews states "RightNow AI is an innovative tool designed to help developers profile and optimize CUDA code efficiently. Users have praised its ability to identify bottlenecks and enhance GPU performance. For example, one user stated, "RightNow AI is a game-changer for GPU optimization."" I think some of the AI prompt has leaked into the output.

jaberjaber23•8mo ago

2x-4x improvements are normal when starting from a naive kernel, but sometimes we see gains well over 10x. Every kernel is profiled live on real GPUs (serverless), so you get accurate performance data for the specific architecture.

Before-and-after examples would definitely help, and we’re adding those soon. Thanks for the feedback.

godelski•8mo ago

  > helps me optimize  kernels without spending nights debugging.
    - Vender Relations Manager

  > Wishing you best of luck.You managed to take one of the most painful parts of CUDA dev and turn it effortless.
    - Smart Home Innovators

  > No more wrestling with annual performance tuning - just hit go and let AI handle the heavy lifting boosting your CUDA code by up to 20x with zero extra effort.
    - B2B SaaS Growth Marketing Consultant

  > great！This is what I want！
    - Serial entrepreneur, started in finance

I didn't even look at the Product Hut reviews until you mentioned it. Is it always this botty?

cjbgkagh•8mo ago

This is the only product I've ever looked up on Product Hunt and I only looked it up because I was wondering who was giving them positive reviews. It appears that Product Hunt is a growth hacker den and I guess they 'hacked' it with bots. There does appear to be a lack of quality control for reviews on that site - which I guess is Product Hunts own version of growth hack. I think this is why people are retreating to cloistered private communities where users can have a higher degree of confidence that they're interacting with real people.

godelski•8mo ago

Fair enough. I've been even been considering dumping HN for months, and is my only refuge left. Can't seem to find anywhere I can have strong confidence I'm talking to real people anymore. Dark Forest I guess...

cjbgkagh•8mo ago

I've been finding comfort in old books and have largely given up on interfacing with humanity. Hacker news the last place I regularly frequent and it is getting noticeably worse as time goes on.

jaberjaber23•8mo ago

what's up guys, take it easy. Just to clarify: I didn’t add any reviews myself. I’m building this SOLO and barely have time to finish the product, let alone fake comments. I wasn’t even aware of the Product Hunt stuff until people here pointed it out!!!!

I just put the product out there for anyone who wants to try it for FREE and share feedback. I already have a good number of real users, and they’re happy with it

godelski•8mo ago

We're not talking about you or your project. We're talking about a bigger problem. Sure, you're kinda contributing to it but so am I with my research. We're more concerned with how people are abusing the technologies. Sure, I don't like vibe coding, but I'll let the security people explain that one to you. Really my issue with vibe coding is that I don't think we should be shipping products that are below TRL 6 and vibe coding is really around 3 or 4.

cjbgkagh•8mo ago

You've taken on one of the most difficult problems in tech and yet you become indignant when you've not been given sufficient deference. You have not yet demonstrated that you have earned it.

Perhaps if you were not aware of the reviews then you shouldn't be linking to them from your website.

As an aside, I don't care how hard you have worked, that's not relevant to me when assessing if I should use a product.

techbro92•8mo ago

Cuda optimization actually doesn’t suck that much. I think NSight studio is amazing and super helpful for profiling and identifying bottlenecks in kernels

jaberjaber23•8mo ago

Totally, NSight is great. We do something similar: generate kernels, profile them on real GPUs, then optimize based on that:D

saberience•8mo ago

A vibe-coded product on top of a vibe-coded website, with a load of AI generated product hunt comments.

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Making geo joins faster with H3 indexes

Ga68, a GNU Algol 68 Compiler

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: I spent 4 years building a UI design tool with only the features I use

What Is Ruliology?

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Cross-Region MSK Replication: K2K vs. MirrorMaker2

Female Asian Elephant Calf Born at the Smithsonian National Zoo

How to effectively write quality code with AI

Dark Alley Mathematics

Google staff call for firm to cut ties with ICE

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Stories from 25 Years of Software Development

Vocal Guide – belt sing without killing yourself

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

Making geo joins faster with H3 indexes

Ga68, a GNU Algol 68 Compiler

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: I spent 4 years building a UI design tool with only the features I use

What Is Ruliology?

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Cross-Region MSK Replication: K2K vs. MirrorMaker2

Female Asian Elephant Calf Born at the Smithsonian National Zoo

How to effectively write quality code with AI

Dark Alley Mathematics

Google staff call for firm to cut ties with ICE

We Made CUDA Optimization Suck Less

Comments