Show HN: BetterDB – OSS Valkey/Redis monitoring with historical data

6•kaliades•2w ago

Hey HN,

I'm Kristiyan, former Engineering Manager for Redis' Visual Developer Tools (including Redis Insight). I built BetterDB because Valkey is growing fast but lacks proper observability tooling.

BetterDB is a monitoring platform for Valkey (and Redis) that focuses on what existing tools miss:

Historical persistence – Slowlog entries disappear when the buffer fills. BetterDB persists them so you can see what queries were running at 3am, which clients were connected, and what anomalies were detected — not just current state.

Pattern analysis – Stop scrolling through raw slowlog entries. BetterDB aggregates them and shows you "HGETALL user:* is 80% of your slow queries" — actionable insights, not raw data.

COMMANDLOG support – Valkey 8.1 introduced COMMANDLOG for tracking large requests/replies, not just slow ones. That 50MB MSET that's killing your network? Now you'll see it. BetterDB is the first monitoring tool to support it.

Anomaly detection – Automatic baseline learning with Z-score analysis across 15+ metrics. Know when something's off before your users do.

Prometheus-native – 99 metrics exposed at /prometheus/metrics. No new dashboards to learn — plug into your existing Grafana/Datadog setup and get Valkey-specific data you can't get elsewhere.

Cluster-aware – Automatic node discovery, topology visualization, per-slot metrics, and aggregated slowlogs across all nodes.

ACL audit trail – Track who accessed what, when. ACL denied events by reason and user, persisted for compliance and debugging.

Memory & Latency Doctor – Built-in diagnostics that tell you what's wrong, not just that something is wrong.

The core is MIT licensed. Pro features (key analytics, AI assistant) live in a separate proprietary/ directory under a source-available license. During beta, use BETA-TEST to unlock everything free.

Website: https://betterdb.com

GitHub: https://github.com/BetterDB-inc/monitor

Release notes: https://github.com/BetterDB-inc/monitor/releases

Docs: https://docs.betterdb.com

Quick start:

  docker pull betterdb/monitor:latest
  
  docker run -d -p 3001:3001 -e DB_HOST=your-valkey-host -e BETTERDB_LICENSE_KEY=BETA-TEST betterdb/monitor:latest

All ideas are welcome and all feedback is important — don't be shy! Star the repo if this is useful, open issues for bugs or feature requests, or just drop a comment here. What pain points do you have with your current Valkey/Redis monitoring setup?

Comments

incidentiq•2w ago

The historical slowlog persistence is the killer feature here. Lost count of how many times I've had a Redis performance issue, went to check slowlog, and found it already rotated because the buffer filled during the incident. By the time you're investigating, the evidence is gone.

The pattern analysis ("HGETALL user:* is 80% of your slow queries") is what teams manually do during postmortems - automating that correlation saves real debugging time.

Two questions:

1. How does the Prometheus integration handle high-cardinality key patterns? One of the pain points with Redis metrics is that per-key metrics can explode label cardinality. Are you sampling or aggregating at the pattern level?

2. For the anomaly detection - what's the baseline learning window? Redis workloads can be very bursty (batch jobs, cache warming after deploy), so false positives on "anomaly" can be noisy if the baseline doesn't account for periodic patterns.

Good timing on the Valkey support - with the Redis license change, a lot of teams are evaluating migration and will need tooling that supports both.

kaliades•2w ago

Thanks! Those are exactly the right questions.

1. Cardinality: We don't do per-key metrics — that's a guaranteed way to blow up Prometheus. All pattern metrics are aggregated at the command pattern level (e.g., HGETALL user:* not HGETALL user:12345). The pattern extraction normalizes keys so you see the shape of your queries, not the individual keys. For cluster slot metrics, we automatically cap at top 100 slots by key count — otherwise you'd get 16,384 slots × 4 metrics = 65k series just from slot stats. The metrics that can grow are client connections by name/user, but those scale with unique client names, not keys. If it becomes an issue, standard Prometheus relabel_configs can aggregate or drop those labels.

2. Baseline window: We use a rolling circular buffer of 300 samples (5 minutes at 1-second polling). Minimum 30 samples to warm up before detection kicks in. To reduce noise from bursty workloads, we require 3 consecutive samples above threshold before firing, plus a 60-second cooldown between alerts for the same metric. This helps with the "batch job at 2am" scenario — a single spike won't trigger, but sustained deviation will. That said, you're right that periodic patterns (daily batch jobs, cache warming after deploy) aren't explicitly modeled yet. It's on the roadmap — likely as configurable "expected variance windows" or integration with deployment events. Would love to hear what approach would work best for your use case.

I think the licensing issues are long gone (it was all the way in 2024), so most people have moved on, but monitoring and observability are something that people have said are missing over and over.

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff