So I Decided to Build My Own Analytics, This Is How It Went

1•tarasshyn•1h ago

Comments

tarasshyn•1h ago

I needed analytics for side projects. PostHog was overkill for what I wanted (Country, Origin, UTMs, per-user attribution, entry page, revenue) and events are immutable, so removing test data needs manual SQL filters everywhere.

Plausible had no per-user attribution. DataFast looked perfect, installed with a proxy. Months later the bill hit $40/m. My whole infra is $150/m. Not paying $500/yr for analytics, but switching meant losing historical data and attribution. So I built it myself.

*Getting the data out.* DataFast has no export option (red flag #1). Wrote a script paginating every exposed endpoint, transforming responses into SQL for my DB.

For context: I have a microservices setup (Kafka, Redis, gateway, auth) and a monorepo front-end with shared components. So I just needed the "core" analytics feature.

A weekend in I had an ugly dashboard, services, a DB, no tracking. DataFast data turned out broken and missing values. Connected my readonly DB via MCP and the readonly key from my payment processor, re-attributed everything. Got to ~95% and moved on.

*Backend refactor.* Claude's boilerplate did attribution with direct Postgres calls - one roundtrip per visitor. Built a caching layer: events go to Redis, flushed to Postgres every ~30s. A distributed Redis lock means only one instance flushes at a time (no duplicates, no races). Each flush processes 5,000 records per SQL statement (Postgres parameter limits); failed chunks get re-buffered to Redis with up to 5 retries. ClickHouse would solve this too, but Redis scales fine.

Then extraction. LLMs have no concept of heap - everything was loaded into memory and iterated. With 100k+ events that kills the server. Rewrote with pagination and batched queries, plus a pre-aggregated daily rollup table for historical queries with no filters. Dashboard now feels instant for past date ranges.

*Front-end.* DataFast's filter system is unusable; ported PostHog's pattern. Their rate limits: 20 concurrent requests per day, and moving back days doesn't abort prior requests, so 3 days back = 60 in flight = rate limited. No signal abort in a prod app in 2025 (red flag #2). Batched my FE down to 5 requests with proper aborts on filter changes.

*Bot protection - this is where it got bad.* Running my tracker side-by-side with DataFast, I had 30-50% fewer attributions. Added Arcjet, hit 100k bot requests in days, disabled it before it bankrupted me.

DataFast has zero bot protection (red flag #3). Datacenter IPs - passed. Null user-agent - passed. 10x10000 resolution - welcome aboard. Read Arcjet's posts, hit 96% bot blockage. Filter obvious user-agents and impossible displays. Use MaxMind DB to block datacenter IPs (I blocked my own infra and got 0 attributions, oops). Proxy real client IP through Cloudflare to my Fly backend.

While doing this I checked how DataFast handles IPs and... they don't (red flag #4). Maybe my misconfig, but their docs don't say. Either way, all my tracked users were attributed to the nearest Cloudflare CDN. I apparently take regular trips to Germany from Poland. Most of my DataFast tracking was garbage.

Added behavioral signals - bounces, no engagement + weird screens, weird browser versions - dozens of params combined into a per-session bot score with a "probably bots" toggle. Hard-filtered cases never hit the DB.

The bot scorer is import-aware: DataFast never tracked scroll depth, engagement, or interactions, so imported sessions have zero behavioral data. The scorer detects this and uses a fingerprint-only algorithm instead of penalizing them for data they never had.

Backend stress-tested (died, bumped RAM). Front-end looking good.

*The savings:* new microservice $25/m. So $39 - $25 = $14/m saved. Took about a month, on and off. Truly genius idea, replace every SaaS and never look back.

Link if curious: https://flowsery.com/

Open-Sourcing SEC Edgar on Hugging Face

40% Increased Throughput 16.8% Less Energy for AI (Verified via ZKP)

Democracy Policy Under Obama [pdf]

Show HN: Lazy-HN, a faster Hacker News front end you probably don't need

Rest of the World Annual Report 2025

Snap's Crucible Moment

Show HN: Evo – parallel autoresearch experiments for Claude Code and Codex

Cal.com is going closed source

Richard Dawkins, let's not bring back Neanderthals

Ask HN: Which LLM model and agentic CLI are you using for local development?

The Malleable Computer

I built a calculator site that doesn't look like garbage

We're only seeing the tip of the chip-smuggling iceberg

Meta creating AI version of Mark Zuckerberg so staff can talk to the boss

The best way to advertise a programming language

Cybersecurity Looks Like Proof of Work Now

Show HN: A semantic flow tool for embeddings

Allbirds shares surge over 430% as footwear firm trades shoes for AI business

I built my first AI agent (and what I got wrong)

I'm curating a digital library of lindy books

Show HN: Cachefetch – Fast CLI tool that shows cache file sizes

Unreal Engine C++ compilation for Windows under Linux with Wine

WhatDoTheyMake, Anonymous Salary Sharing

Show HN: Aegis – 85ns Sovereign Infrastructure Running on $100 Android Hardware

No one's sure if synthetic mirror life will kill us all

Mathematics Isn't Unreasonably Effective

Show HN: I built on-device TTS app because I run out of audiobooks on a flight

Technical debt is dead, the metaphor is broken

Show HN: DeepFake Detector Flags Swalwell Video as Fake

Show HN: Avec – iOS email app that lets you handle your Gmail inbox in seconds