Duration: 18.4s Throughput: 704.5 req/s Avg Latency: 17.6ms P50 Latency: 17.9ms P95 Latency: 32.0ms P99 Latency: 33.8ms Category Accuracy Financial Leak Detection 100% PII / Private Data 100% Strategic Data 100% Malicious Code 95% OWASP LLM Top 10 87% Multi-Turn Attacks 67% General Benign (False Positives) 66% Overall 88.7% F1: 0.927 | Precision: 0.922 | Recall: 0.932 | Specificity: 0.740 The problem I'm facing: After 2 months of tuning, I've gone from 74% → 88.7% overall accuracy. But I've hit a wall where improving one category hurts another. Specifically: The false positive rate is too high for general/technical content (the system over-blocks benign code and text) Multi-turn conversation attacks are at 67% — the model doesn't fully leverage conversation context yet Every time I push one metric up, something else drops My actual question: Do I ship a limited Beta now — restricted to the use cases where it performs at 95-100% (financial data, PII, strategic leaks) — or do I keep tuning before any real-world exposure? Why i want to ship: Real-world data will teach me more than synthetic test cases The high-value use cases already work extremely well I've been optimizing against synthetic benchmarks for 2 months Why i want to wait: 34% false positive rate on general content will frustrate users Multi-turn is a known attack vector that's currently weak First impressions matter Website if you want to see more details: https://ebmsovereign.com/ All forms on the website are currently disabled except for emails, which will be available for testing within 24 hours, Genuinely want to hear from people who've shipped security products or ML systems in production. What would you do?