frontpage.

Hey HN! We're building a profiler for ML inference that actually shows what's happening at the hardware level without having to manually parse through flame graphs, or set up nsys and ncu.

The problem: Current ML profilers either dump too much data (torch.profiler) or abstract away the details you need. You can't see why your model is actually slow - is it memory bandwidth? Kernel launch overhead? Cache misses?

Our approach: We're reverse engineering GPU execution to trace from Python ops down to PTX instructions. One decorator gives you the full execution graph with actual bottlenecks highlighted.

Technical details: - Traces Python → CUDA kernels → PTX with timing breakdowns - Shows memory access patterns and bandwidth utilization - Kernel occupancy and scheduling analysis - Works with PyTorch/JAX, TensorFlow coming

We used this to optimize Llama inference and found bottlenecks we couldn't see before - got 50%+ speedup: https://www.herdora.com/blog/the-overlooked-gpu

Free beta with 10 hours of profiling: https://keysandcaches.com Github: https://github.com/Herdora/kandc Docs: https://www.keysandcaches.com/docs

Curious what inference bottlenecks others are hitting that current tools can't diagnose. What's your experience with existing profilers? Would be very useful to hear thoughts from the community :)

Fitness Landscape

The Welfare Costs of Low-Friction Idea Production

From GPT-2 to GPT-OSS: Analyzing the Architectural Advances

Show HN: PromptMap – map .NET solutions into AI-friendly context

The End of Violence: A Prescription for a Peaceful Society (2026)

California man's plane keeps getting stolen, taken, repaired and returned

About SimulateAI (Quick Overview)

Your LLM Knows the Future

Prompt injection engineering for attackers: Exploiting GitHub Copilot

Roo Code Workflow: An Advanced LLM-Powered Development Setup

Suitely – Your C-Suite, Reimagined by AI

Keep the Terminal Relevant: Patterns for AI Agent Driven CLIs – InfoQ

The rise of the AI native employee

How Streamplace Works: No Microservices

Show HN: Math4Fun – Generate kids' math worksheets from their favorite topics

GPTs and Feeling Left Behind

Why Understanding AI Doesn't Necessarily Lead People to Embrace It

Debian GNU/Hurd 2025 released

'Can you tell us how he died?': Mo Salah criticises Uefa

Every company has the same hiring criteria

Beating Caves of Qud as a Steaming Vent [video]

A Vulkan on Metal Mesa 3D Graphics Driver

Ask HN: Will LLM API costs be negligible in a year?

Formula E wraps up season 11–where does the all-EV series go next?

The Greatest Revolution Since the Printing Press? (1980) [video]

Microsoft Office Lens getting the axe

Armenia at the Crossroads

PCIe 8.0 Announced by the PCI-Sig Will Double Throughput Again – ServeTheHome

UCI-Express Cranks Up Chiplet Interconnect Speeds

Kilo Code: Open-Source VS Code AI Agent- Merged Features from Roo Code and Cline

Show HN: GPU Profiling That's Useful in 60 Seconds