The problem: Most moderation APIs either flag everything with "ass" (including "assistant" and "class") or miss obvious toxic content because they only do keyword matching. They can't tell the difference between "I'll destroy you in this game" (friendly gaming banter) and actual threats.
The Profanity API uses a 5-layer detection pipeline: - L0: Instant blocklist lookup (~2ms) - L1: Fuzzy matching for obfuscated text like "f*ck" or "sh1t" - L2: 768d semantic embeddings to catch meaning, not just words - L3: Context classification (gaming, professional, child_safe, etc.) - L4: LLM-powered intent analysis for edge cases
The key insight: layers can disagree. If the blocklist flags "kill" but semantic analysis scores it low in a gaming context, that disagreement triggers LLM analysis. This catches "I'll kill it in the presentation" vs actual concerning content.
Built with Cloudflare Workers, Durable Objects, and Groq's llama-3.1-8b for the LLM layer. Tiered pricing where you only pay for LLM calls when they're actually needed.
Happy to go deep on the detection logic, false positive reduction, or the skip engine that decides which layers to run.
Olehype•1h ago