I'm Kristiyan, former Engineering Manager for Redis' Visual Developer Tools (including Redis Insight). I built BetterDB because Valkey is growing fast but lacks proper observability tooling.
BetterDB is a monitoring platform for Valkey (and Redis) that focuses on what existing tools miss:
Historical persistence – Slowlog entries disappear when the buffer fills. BetterDB persists them so you can see what queries were running at 3am, which clients were connected, and what anomalies were detected — not just current state.
Pattern analysis – Stop scrolling through raw slowlog entries. BetterDB aggregates them and shows you "HGETALL user:* is 80% of your slow queries" — actionable insights, not raw data.
COMMANDLOG support – Valkey 8.1 introduced COMMANDLOG for tracking large requests/replies, not just slow ones. That 50MB MSET that's killing your network? Now you'll see it. BetterDB is the first monitoring tool to support it.
Anomaly detection – Automatic baseline learning with Z-score analysis across 15+ metrics. Know when something's off before your users do.
Prometheus-native – 99 metrics exposed at /prometheus/metrics. No new dashboards to learn — plug into your existing Grafana/Datadog setup and get Valkey-specific data you can't get elsewhere.
Cluster-aware – Automatic node discovery, topology visualization, per-slot metrics, and aggregated slowlogs across all nodes.
ACL audit trail – Track who accessed what, when. ACL denied events by reason and user, persisted for compliance and debugging.
Memory & Latency Doctor – Built-in diagnostics that tell you what's wrong, not just that something is wrong.
The core is MIT licensed. Pro features (key analytics, AI assistant) live in a separate proprietary/ directory under a source-available license. During beta, use BETA-TEST to unlock everything free.
Website: https://betterdb.com
GitHub: https://github.com/BetterDB-inc/monitor
Release notes: https://github.com/BetterDB-inc/monitor/releases
Docs: https://docs.betterdb.com
Quick start:
docker pull betterdb/monitor:latest
docker run -d -p 3001:3001 -e DB_HOST=your-valkey-host -e BETTERDB_LICENSE_KEY=BETA-TEST betterdb/monitor:latest
All ideas are welcome and all feedback is important — don't be shy! Star the repo if this is useful, open issues for bugs or feature requests, or just drop a comment here. What pain points do you have with your current Valkey/Redis monitoring setup?
incidentiq•2w ago
The pattern analysis ("HGETALL user:* is 80% of your slow queries") is what teams manually do during postmortems - automating that correlation saves real debugging time.
Two questions:
1. How does the Prometheus integration handle high-cardinality key patterns? One of the pain points with Redis metrics is that per-key metrics can explode label cardinality. Are you sampling or aggregating at the pattern level?
2. For the anomaly detection - what's the baseline learning window? Redis workloads can be very bursty (batch jobs, cache warming after deploy), so false positives on "anomaly" can be noisy if the baseline doesn't account for periodic patterns.
Good timing on the Valkey support - with the Redis license change, a lot of teams are evaluating migration and will need tooling that supports both.
kaliades•1w ago
1. Cardinality: We don't do per-key metrics — that's a guaranteed way to blow up Prometheus. All pattern metrics are aggregated at the command pattern level (e.g., HGETALL user:* not HGETALL user:12345). The pattern extraction normalizes keys so you see the shape of your queries, not the individual keys. For cluster slot metrics, we automatically cap at top 100 slots by key count — otherwise you'd get 16,384 slots × 4 metrics = 65k series just from slot stats. The metrics that can grow are client connections by name/user, but those scale with unique client names, not keys. If it becomes an issue, standard Prometheus relabel_configs can aggregate or drop those labels.
2. Baseline window: We use a rolling circular buffer of 300 samples (5 minutes at 1-second polling). Minimum 30 samples to warm up before detection kicks in. To reduce noise from bursty workloads, we require 3 consecutive samples above threshold before firing, plus a 60-second cooldown between alerts for the same metric. This helps with the "batch job at 2am" scenario — a single spike won't trigger, but sustained deviation will. That said, you're right that periodic patterns (daily batch jobs, cache warming after deploy) aren't explicitly modeled yet. It's on the roadmap — likely as configurable "expected variance windows" or integration with deployment events. Would love to hear what approach would work best for your use case.
I think the licensing issues are long gone (it was all the way in 2024), so most people have moved on, but monitoring and observability are something that people have said are missing over and over.