I needed analytics for side projects. PostHog was overkill for what I wanted (Country, Origin, UTMs, per-user attribution, entry page, revenue) and events are immutable, so removing test data needs manual SQL filters everywhere.
Plausible had no per-user attribution. DataFast looked perfect, installed with a proxy. Months later the bill hit $40/m. My whole infra is $150/m. Not paying $500/yr for analytics, but switching meant losing historical data and attribution. So I built it myself.
*Getting the data out.* DataFast has no export option (red flag #1). Wrote a script paginating every exposed endpoint, transforming responses into SQL for my DB.
For context: I have a microservices setup (Kafka, Redis, gateway, auth) and a monorepo front-end with shared components. So I just needed the "core" analytics feature.
A weekend in I had an ugly dashboard, services, a DB, no tracking. DataFast data turned out broken and missing values. Connected my readonly DB via MCP and the readonly key from my payment processor, re-attributed everything. Got to ~95% and moved on.
*Backend refactor.* Claude's boilerplate did attribution with direct Postgres calls - one roundtrip per visitor. Built a caching layer: events go to Redis, flushed to Postgres every ~30s. A distributed Redis lock means only one instance flushes at a time (no duplicates, no races). Each flush processes 5,000 records per SQL statement (Postgres parameter limits); failed chunks get re-buffered to Redis with up to 5 retries. ClickHouse would solve this too, but Redis scales fine.
Then extraction. LLMs have no concept of heap - everything was loaded into memory and iterated. With 100k+ events that kills the server. Rewrote with pagination and batched queries, plus a pre-aggregated daily rollup table for historical queries with no filters. Dashboard now feels instant for past date ranges.
*Front-end.* DataFast's filter system is unusable; ported PostHog's pattern. Their rate limits: 20 concurrent requests per day, and moving back days doesn't abort prior requests, so 3 days back = 60 in flight = rate limited. No signal abort in a prod app in 2025 (red flag #2). Batched my FE down to 5 requests with proper aborts on filter changes.
*Bot protection - this is where it got bad.* Running my tracker side-by-side with DataFast, I had 30-50% fewer attributions. Added Arcjet, hit 100k bot requests in days, disabled it before it bankrupted me.
DataFast has zero bot protection (red flag #3). Datacenter IPs - passed. Null user-agent - passed. 10x10000 resolution - welcome aboard. Read Arcjet's posts, hit 96% bot blockage. Filter obvious user-agents and impossible displays. Use MaxMind DB to block datacenter IPs (I blocked my own infra and got 0 attributions, oops). Proxy real client IP through Cloudflare to my Fly backend.
While doing this I checked how DataFast handles IPs and... they don't (red flag #4). Maybe my misconfig, but their docs don't say. Either way, all my tracked users were attributed to the nearest Cloudflare CDN. I apparently take regular trips to Germany from Poland. Most of my DataFast tracking was garbage.
Added behavioral signals - bounces, no engagement + weird screens, weird browser versions - dozens of params combined into a per-session bot score with a "probably bots" toggle. Hard-filtered cases never hit the DB.
The bot scorer is import-aware: DataFast never tracked scroll depth, engagement, or interactions, so imported sessions have zero behavioral data. The scorer detects this and uses a fingerprint-only algorithm instead of penalizing them for data they never had.
*The savings:* new microservice $25/m. So $39 - $25 = $14/m saved. Took about a month, on and off. Truly genius idea, replace every SaaS and never look back.
tarasshyn•1h ago
Plausible had no per-user attribution. DataFast looked perfect, installed with a proxy. Months later the bill hit $40/m. My whole infra is $150/m. Not paying $500/yr for analytics, but switching meant losing historical data and attribution. So I built it myself.
*Getting the data out.* DataFast has no export option (red flag #1). Wrote a script paginating every exposed endpoint, transforming responses into SQL for my DB.
For context: I have a microservices setup (Kafka, Redis, gateway, auth) and a monorepo front-end with shared components. So I just needed the "core" analytics feature.
A weekend in I had an ugly dashboard, services, a DB, no tracking. DataFast data turned out broken and missing values. Connected my readonly DB via MCP and the readonly key from my payment processor, re-attributed everything. Got to ~95% and moved on.
*Backend refactor.* Claude's boilerplate did attribution with direct Postgres calls - one roundtrip per visitor. Built a caching layer: events go to Redis, flushed to Postgres every ~30s. A distributed Redis lock means only one instance flushes at a time (no duplicates, no races). Each flush processes 5,000 records per SQL statement (Postgres parameter limits); failed chunks get re-buffered to Redis with up to 5 retries. ClickHouse would solve this too, but Redis scales fine.
Then extraction. LLMs have no concept of heap - everything was loaded into memory and iterated. With 100k+ events that kills the server. Rewrote with pagination and batched queries, plus a pre-aggregated daily rollup table for historical queries with no filters. Dashboard now feels instant for past date ranges.
*Front-end.* DataFast's filter system is unusable; ported PostHog's pattern. Their rate limits: 20 concurrent requests per day, and moving back days doesn't abort prior requests, so 3 days back = 60 in flight = rate limited. No signal abort in a prod app in 2025 (red flag #2). Batched my FE down to 5 requests with proper aborts on filter changes.
*Bot protection - this is where it got bad.* Running my tracker side-by-side with DataFast, I had 30-50% fewer attributions. Added Arcjet, hit 100k bot requests in days, disabled it before it bankrupted me.
DataFast has zero bot protection (red flag #3). Datacenter IPs - passed. Null user-agent - passed. 10x10000 resolution - welcome aboard. Read Arcjet's posts, hit 96% bot blockage. Filter obvious user-agents and impossible displays. Use MaxMind DB to block datacenter IPs (I blocked my own infra and got 0 attributions, oops). Proxy real client IP through Cloudflare to my Fly backend.
While doing this I checked how DataFast handles IPs and... they don't (red flag #4). Maybe my misconfig, but their docs don't say. Either way, all my tracked users were attributed to the nearest Cloudflare CDN. I apparently take regular trips to Germany from Poland. Most of my DataFast tracking was garbage.
Added behavioral signals - bounces, no engagement + weird screens, weird browser versions - dozens of params combined into a per-session bot score with a "probably bots" toggle. Hard-filtered cases never hit the DB.
The bot scorer is import-aware: DataFast never tracked scroll depth, engagement, or interactions, so imported sessions have zero behavioral data. The scorer detects this and uses a fingerprint-only algorithm instead of penalizing them for data they never had.
Backend stress-tested (died, bumped RAM). Front-end looking good.
*The savings:* new microservice $25/m. So $39 - $25 = $14/m saved. Took about a month, on and off. Truly genius idea, replace every SaaS and never look back.
Link if curious: https://flowsery.com/