What do y'all think – weeknd project

2•Venkymatam•8mo ago

Today, many software teams are adding AI features into their apps — like customer support bots, writing tools, or internal copilots — by writing “prompts” directly into their code. These prompts tell the AI what to say or do. But once the product is live, there's no visibility into what the AI is actually saying to users, how much it’s costing, or when things silently go wrong — like hallucinations, tone drift, or token overuse. I’m hoping to build a solution that helps teams keep these AI features healthy and reliable in production. They’ll have a central database to store all their prompts, test different versions across multiple AI models, compare costs and outputs, and — most importantly — evaluate the “human touch” of the responses. The platform would enable A/B testing across prompt versions to identify which responses perform best — whether in terms of marketing impact, sales conversion, engagement, or overall usage. It would track every AI response, detect unusual or risky behavior, and suggest — or even apply — fixes automatically. Think of it as a real-time quality control system for the AI layer of your product. The system would be powered by lightweight autonomous agents that watch every model call, flag anomalies, and make context-aware recommendations — or take direct action when safe to do so. These agents would monitor prompt behavior over time, compare version performance, and optimize for clarity, safety, and cost. Technically, it’s a real-time observability and correction runtime — like Datadog + LaunchDarkly, but built specifically for managing AI prompts and agentic behavior in production.

Comments

airylizard•8mo ago

I like the idea, TSCE framework should make the individual agents more reliable and deterministic: https://github.com/AutomationOptimization/tsce_demo

Venkymatam•8mo ago

Thanks for sharing this! I appreciate it. Is it good enough in your opinion for YC?

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices

Seedance 2.0 Is Coming

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Dexterous robotic hands: 2009 – 2014 – 2025

Interop 2025: A Year of Convergence

JobArena – Human Intuition vs. Artificial Intelligence

Concept Artists Say Generative AI References Only Make Their Jobs Harder

Show HN: PaySentry – Open-source control plane for AI agent payments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

Pax Historia – User and AI powered gaming platform

Show HN: I built a RAG engine to search Singaporean laws

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

Porting Doom to My WebAssembly VM

Cognitive Style and Visual Attention in Multimodal Museum Exhibitions

Full-Blown Cross-Assembler in a Bash Script

Logic Puzzles: Why the Liar Is the Helpful One

Optical Combs Help Radio Telescopes Work Together

Show HN: Myanon – fast, deterministic MySQL dump anonymizer

The Tao of Programming

Forcing Rust: How Big Tech Lobbied the Government into a Language Mandate

PanelBench: We evaluated Cursor's Visual Editor on 89 test cases. 43 fail

Can You Draw Every Flag in PowerPoint? (Part 2) [video]

Show HN: MCP-baepsae – MCP server for iOS Simulator automation

Make Trust Irrelevant: A Gamer's Take on Agentic AI Safety

Show HN: Sem – Semantic diffs and patches for Git