Traditional regex-based secrets scanners (Gitleaks, TruffleHog, detect-secrets) face a fundamental tradeoff: crank up sensitivity and drown in false positives flagging things like "YOUR_API_KEY_HERE", or tune it down and miss real credentials. We kept hearing from security teams that they couldn't trust their scanning tools because of the noise – developers would just ignore the alerts.
Regex is great at fast pattern matching, but terrible at understanding context. So instead of trying to make regex smarter, we built a hybrid system: regex does the initial high-recall sweep, then a fine-tuned 3B model filters out false positives by actually understanding the code context.
Technical approach: - Started with teacher-student architecture using DeepSeek R1 as teacher - Curated ~8K diverse secrets from Samsung's CredData dataset, relabeled for consistency - Generated synthetic edge cases using Gemini 2.5 Pro and Claude Sonnet 4 - Fine-tuned on ~900 examples with deterministic outputs (not chain-of-thought)
Integration is straightforward – run your existing regex tool, feed candidates to Narada with ±20 lines of context, get structured JSON output with true/false positive classification and reasoning.
We built this as part of Autofix Bot's secrets detection agent, and it outperformed static-only tools significantly in our benchmarks [2]. Figured the security community would benefit from having this available as an open-source building block. Would love to hear your feedback and learn what other edge cases you encounter.
[2] https://autofix.bot/benchmarks#benchmarks-secrets-detection
[3] https://autofix.bot/news/narada-secrets-detection-classifica...