Datadog's $65M/year customer mystery solved

https://blog.pragmaticengineer.com/datadog-65m-year-customer-mystery/

152•thunderbong•7mo ago

Comments

mrkramer•7mo ago

And who says that SaaS doesn't pay off?! It pays off like hell!

delichon•7mo ago

> For observability, Coinbase spun up a dedicated team with the goal of moving off of Datadog, and onto a Grafana/Prometheus/Clickhouse stack.

We recently did the same, and our Datadog bill was only five figures. We're finding the new stack to not be a poor man's anything, but more flexible, complete and manageable than yet another SaaS. With just a little extra learning curve observability is a domain where open source trounces proprietary, and not just if you don't have money to set on fire.

oulipo•7mo ago

Have you tried the ClickStack? https://news.ycombinator.com/item?id=44194082

asnyder•7mo ago

There's also https://openobserve.ai, while not as stable as Grafana/Prometheus/Clickhouse, feels a bit easier to setup and manage. Though has a bit of ways to go, does the basics and more without issue.

Crazy crazy they spent so much on observability. Even with DataDog they could've optimized that spend. DataDog does lots of bad things with billing where by default, especially with on-demand instances you get charged significantly more than you should as they have (had?) pretty deficient counting towards instance hours and instances.

For example, rather than run the agent (which counts as an instance regardless of if it's on for a minute), you can send the logs, metrics, etc. directly to their ingestion endpoints and not have those instances counted towards their usage other than log and metric usage.

Maybe at that level they don't even get into actual by usage anymore, and they just negotiate arbitrary amounts for some absurd quota of use.

ljm•7mo ago

I wonder how much that no-expense-spared, money-is-no-object attitude to buying SaaS impacts an engineers ability to make sensible decisions around infra and architecture. Coinbase might have been fine blowing 65 mil but take that approach to a new startup and you could trivially eat up a significant amount of runway with it.

I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.

swyx•7mo ago

the visible cost of burning runway on a bill is very often far less than the invisible cost of burning engineer time rebuilding undifferentiated heavy lifting rather than working on product/customer needs

9283409232•7mo ago

People say this but I wonder about this from time to time. I don't think anyone is asking to rebuild datadog from scratch for your company but surely it's worth it to migrate to something not as expensive even if it takes a bit of elbow grease.

closeparen•7mo ago

Assuming there's nothing else you could do with that elbow grease that would create more value than the SaaS bill costs.

9283409232•7mo ago

Value is not a hard science. I've seen people shelve tech debt in favor of work on a feature that no one ends up using.

nemothekid•7mo ago

1. Leadership doesn’t want to burn engineer cycles on undifferentiated features.

2. Management doesn’t get recognized for working on undifferentiated features.

3. Engineers working on undifferentiated features aren’t recognized when looking for new jobs.

Saving money “makes” sense but getting people to actually prioritize it is hard.

uaas•7mo ago

Well, saving money is a differentiator, and one of the best things an engineer can put on their CVs.

pphysch•7mo ago

Most of the complexity in observability is clientside.

It is not hard to spin up Grafana and VictoriaMetrics (and now VictoriaLogs) and keep them running. It is not hard to build a Grafana dashboard that correlates data across both metrics and logs sources, and alerting functionality is pretty good now.

The "heavy lift" is instrumenting your applications and infrastructure to provide valuable metrics and logs without exceeding a performance budget. I'm skeptical that Datadog actually does much of that heavy-lifting and that they are actually worth the money. You can probably save 10x with same/better outcomes by paying for managed Grafana + managed DBs and a couple FTEs as observability experts.

lerchmo•7mo ago

You could hire 100 people to manage your timeseries data and save 70%

ljm•7mo ago

I used to be quite fond of Datadog but after one or two completely surprising bills (thanks to their granular but unintuitive pricing model), I wouldn't recommend it to anybody any more. If I were more cynical I would say the pricing model is designed to be confusing so customers spend more than they need, and this is only made worse by the extreme breadth of the platform now.

These days I'd suggest to just suck it up, spin up a Grafana box, and wire up OpenTelemetry.

QuinnyPig•7mo ago

This is very well stated.

swyx•7mo ago

haha thanks Corey, i echo the best (you)

wavemode•7mo ago

I wouldn't really say "very often". Occasionally, perhaps.

Even from a pure zero-sum mathematical perspective, it can make sense to invest even as much as 2 or 3 months of engineer time on cloud cost savings measures. If the engineer is making $200K, that's a $30000 - $50000 investment. When you see the eye-watering cloud bills many startups have, you would realize that, that investment is peanuts in comparison to the potential savings over the next several years.

And then you also have to keep in mind that, these things are usually not actually zero-sum. The engineer could be new, and working on the efficiency project helps them onboard to your stack. It could be the case that customers are complaining (or could start complaining in the future) about how slow your product is, so you actually improve the product by improving the infrastructure. Or it could just be the very common case that there isn't actually a higher-value thing for that engineer to be working on at that time.

happymellon•7mo ago

> It could be the case that customers are complaining (or could start complaining in the future) about how slow your product is

If Jira has taught me anything, it's that ignoring customers when they complain its too slow makes financial sense.

viccis•7mo ago

>I wonder how much that no-expense-spared, money-is-no-object attitude to buying SaaS impacts an engineers ability to make sensible decisions around infra and architecture

I saw this a lot at a previous company. Being able to just "have more Lambdas scale up to handle it" got some very mediocre engineers past challenges they encountered. But it did so at the cost of wasting VAST amounts of money and saddling themselves with tech debt that completely hobbled the company's ability to scale.

It was very frustrating to be too junior to be able to change minds. Even basic things like "I know it worked for you with old on-prem NFS designs but we shouldn't be storing our data in 100kb files in S3 and firing off thousands of Lambda invocations to process workloads, we should be storing it in 100mb files and using industry leading ETL frameworks on it". They were old school guys who hadn't adjusted to best practices for object storage and modern large scale data loads (this was a 1M event per second system) and so the company never really succeeded despite thousands of customers and loads of revenue.

I consider cost consideration and profiling to be an essential skill that any engineer working in cloud style environments should have, but it's especially important that a staff engineer or person in a similar position have this skill set and be ready to grill people who come up with wasteful solutions.

nasmorn•7mo ago

It is also not a very hard skill. You do a back of the envelope calculation and if your proposed architecture is crazy expensive for your reasonable load, then you have to figure out if you are a special snowflake or just doing it wrong.

viccis•7mo ago

This is correct. It's really more of a mindset than anything. You take a guess at how much something will cost based on a quick calculation (good cloud providers make this easy, some cough Databricks cough just use a black box and bill you whatever they feel like) and then once you test it at a small scale, you verify that it's as expected and continue to monitor.

happymellon•7mo ago

What's also frustrating is that a lot of times, costs are hidden from engineering.

I don't know if I would call them mediocre, but without a feedback loop its hard to get engineers to agree whether it's worth time reviewing the code to make it faster compared to just making the db one size larger.

viccis•7mo ago

>costs are hidden from engineering

Yeah one of my big pet peeves was when engineering teams build platforms to run things on that obscure the cost. There have been times where they said "hey we made this big platform for analytics, just ship your stuff as configuration changes and it's deployed!" Then when I did it with very simple small cases, some unoptimized stuff on their end (a lot of what I talked about before) resulted in runaway costs that they, of course, tagged to my team.

Ultimately, you can only control what's in your scope and anything else you will need to hope that management can take that runaway cost feedback and make the correct team optimize it away.

>I don't know if I would call them mediocre, but without a feedback loop its hard to get engineers to agree whether it's worth time reviewing the code to make it faster compared to just making the db one size larger.

This started in the mid 2010s, by which point they should understand that you don't put terabytes of data into S3 in 100kb files. And if not, they should be willing to take some very simple steps to address it (literally just bundling them all in 100mb files with an index file containing the byte offsets of the individual ones would have solved a lot of their problems). There was a feedback loop. There just happened to be big egos more interested in their next fun project of reinventing another solution to another solved problem. I learned there that engineering driven companies sometimes wind up in situations in which the staff engineers love fun new database and infrastructure projects like that more than they enjoy improving their existing product.

happymellon•7mo ago

Yeah, that sounds terrible.

JohnMakin•7mo ago

> Coinbase might have been fine blowing 65 mil but take that approach to a new startup and you could trivially eat up a significant amount of runway with it.

Most startups are not going to have anywhere near the scale to generate anything approaching this bill.

> I won’t single out Datadog on this because the exact same thing happens with cloud spend, and it’s very literally burning money.

Unless you're in the business of deploying and maintaining production-ready datacenters at scale, it very literally isn't.

closeparen•7mo ago

That's the point of usage-based pricing: it's cheap to adopt when you're small.

abxyz•7mo ago

(May 2023)

everfrustrated•7mo ago

>Originally published on 11 May 2023

cybice•7mo ago

An article that's basically an ad for Datadog: Pay us a ton of money - it’s still cheaper in the long run.

decimalenough•7mo ago

> Assume that Datadog cuts the number of outages by half, by preventing them with early monitoring. That would mean that without Datadog, we’d look at 24 hours’ worth of downtime, not 12. Let’s also assume that using Datadog results in mitigating outages 50% faster than without - thanks to being able to connect health metrics with logs, debug faster, pinpoint the root cause and mitigate faster. In that case, without Datadog, we could be looking at 36 hours worth of total downtime, versus the 12 hours with Datadog. To put it in numbers: the company would make around $9M in revenue it would otherwise lose, Now that $10M/year fee practically pays for itself!

Those are some pretty heroic assumptions. In particular, they assume the only options are Datadog or nothing, when there are far cheaper alternatives like the Prometheus/Grafana/Clickhouse stack mentioned in the article itself.

passivepinetree•7mo ago

Another assumption that bothers me here is that the $9M in revenue would be completely lost during an outage. I imagine many customers would simply wait until the outage was resolved before performing their intended transactions, meaning far less than $9M would be lost.

calt•7mo ago

On the other hand, customers can become frustrated at being unable to trade when they need during an outage to and go to a competitor.

secondcoming•7mo ago

We are moving from Datadog to Prometheus/Grafana and it's really not all a bed of roses. You'll need monitoring on your monitoring.

disgruntledphd2•7mo ago

Ideally you want two independent monitors for critical things.

hagen1778•7mo ago

Ofc you need to monitor your monitoring, because you run it. Datadog runs their own systems and monitors them, that's why they charge you so much. I barely can imagine a criticial piece of software that I need to run and not monitor it in the same time.

vjvjvjvjghv•7mo ago

I bet they would get much better results if they spent a fraction of the money to better understand their systems and designing them better than spending millions on Datadog

cloudking•7mo ago

What problems does Datadog solve that you can't solve with cheaper solutions?

remify•7mo ago

It's honestly easy to use and implements. It's a try it and adopt it product, it's even one of their sales tactics.

xp84•7mo ago

In my career, I have found that this is not usually the question to ask. I am very confident that I could put together and support a solution that does everything I need Datadog to do and has direct costs far, far lower. However, this would also consume a noticeable fraction of my time, both in predictable ways (maintenance, feature adds that I decide I need), and in unpredictable ways (oh no it’s down).

I believe a much more useful question to ask is just “is this the highest and best use of my finite attention and time?” It is much easier to find $100,000 a year of budget than it is to find an additional $50,000 worth of skilled[1] developer time.

[1] This skilled part is critical because if you have some flunky create your “SaaS alternative” you are in for an even worse time.

therein•7mo ago

I should have known it was Coinbase. I know that Coinbase used to spend $35,000 a month to back up the data directory of ETH nodes.

mrkramer•7mo ago

>I know that Coinbase used to spend $35,000 a month to back up the data directory of ETH nodes.

They paid backups to whom? Who was vendor....I'm interested.

therein•7mo ago

Amazon. They put it on an EBS volume and kept it around.

mrkramer•7mo ago

Ah ok, I thought it was some startup.

xp84•7mo ago

Of all the things Coinbase could spend money on, backups may be the smartest! Imagine if they lost a few billion dollars worth of crypto to a faulty SSD or something!

aeyes•7mo ago

> we really work with customers to restructure their contracts

Does anyone have such an experience with Datadog? A few million wasn't enough to get them to talk about anything, always paid list price and there was no negotiating either when they restructured their pricing.

arccy•7mo ago

you've got bad negotiators... getting at least 10% off list price should be the baseline, even on less than $1m/year

evulhotdog•7mo ago

They were completely unwilling to negotiate with us at all, and it forced our hand to go other open source routes so we don’t get locked in again.

GuinansEyebrows•7mo ago

> To put it in numbers: the company would make around $9M in revenue it would otherwise lose, Now that $10M/year fee practically pays for itself!

am i misunderstanding, or is the author saying it's better to spend $10m than $9m?

areyourllySorry•7mo ago

you spend that extra million to keep customers satisfied in a competitive industry. they have users trading hundreds of thousands - if there's downtime and they lose money because they weren't able to sell their positions at the right time, they might even try to sue, who knows

gneray•7mo ago

This person is like the Gossip Guy of tech. Who cares?

generalpf•7mo ago

When did this guy stop writing about engineering and start running a tech gossip rag?

willejs•7mo ago

I have run ELK, Grafana + Prom, Grafana + Thanos/Coretex, New relic and all of the more traditional products for monitoring/observability. More recently in the last few years, I have been running full observability stacks via either The Grafana LGTM stack or datadog at a reasonable scale and complexities. Ultimately you want one tool that can alert you off a metric, present you some traces, and drill down into logs, all the way down the stack.

I have found Datadog to be, by far hands down the best developer experience from the get go, the way it glues the mostly decent products together is unparalleled in comparison to other products (Grafana cloud/LGTM). I usually say if your at a small to medium scale business just makes sense, IF you understand the product and configure it correctly which is reasonably easy. The seamless integration between tracing, logging and metrics in the platform, which you can then easily combine with alerts is great. However, its easy to misconfigure it and spend a lot of money on seemingly nothing. If you do not implement tracing and structured logs (at the right volume and level) with trace/span ids etc all the way through services its hard to see the value, and seems expensive. It requires some good knowledge, and configuration of the product to make it pay off. The rest of the product features are generally good, for example their security suite is a good entry level to cloud security monitoring and SEIM too.

However, when you get to a certain scale, the cost of APM and Infrastructure hosts in Datadog can become become somewhat prohibitive. Also, Datadogs custom metrics pricing is somewhat expensive and its query language cababilities does not quite match the power of promql, and you start to find yourself needed them to debug issues. At that point, the self hosted LGTM stack starts to make sense, however, it involves a lot more education for end users in both integration (a little less now Otel is popular) and querying/building dashboards etc, but also running it yourself. The grafana cloud platform is more attractive though.

SOLAR_FIELDS•7mo ago

My experience mirrors yours wrt Datadog. It's incredible value at low scale, you get a full robust system with great devex for pennies. Once you hit that tipping point though, you are locked in pretty hardcore. Datadog snakes its way far into your codebase, with all the custom tracing and stuff like that. Migrating off of it is a very expensive endeavor, which is probably one of the reasons why they are such a money printing operation.

willejs•7mo ago

Yeah, the secret sauce of the dd libs was/is addictive for sure! I think its perhaps better now you can just use oTel for custom traces and oTel contrib libs for auto instrumentation and send that to the dd agent? I have not yet tried it because i suspected labels and other things might be named differently than the DD auto instrumentation/contrib packages, but i don't think the gap is as big now?

mbesto•7mo ago

I think "medium scale" is probably more appropriate. For a $3M~$5M revenue SaaS you're still paying $50k+/year. That's not nothing for a small owner or PE backed SaaS company that is focused on profits/EBITDA.

wenbin•7mo ago

Earlier this year, we at Listen Notes switched to Better Stack [0], replacing both Datadog and PagerDuty, and we couldn’t be happier :) Datadog offers a rich set of features, and as a public company, it makes sense for them to keep expanding their product and pushing larger contracts. But as a small team, we don't have a strong demand for constant new features. By switching to Better Stack, we were able to cut our monitoring and alerting costs by 90%, with basically the same things that we used from Datadog previously.

[0] https://www.listennotes.com/blog/use-betterstack-to-replace-...

iwovtb•7mo ago

Observability spend is super expensive. Realizing this, we built grepr.ai.... reduces 96% spend without change. Check it out... Not just a dumb pipeline either.

iwovtb•7mo ago

Observability is expensive. Rip/replace is hard... we built grepr.ai to solve this problem and are seeing 96% reduction in spend from Splunk/Sumo/New Relic/Datadog etc. No change mgt. The result set is: 96% reduction in spend and noise elimination. Pretty compelling.. come check us out. www.grepr.ai

MehdiHK•7mo ago

No pricing page? Also how does it compare to Cribl?

hagen1778•7mo ago

My understanding is that with Prometheus+Grafana, and the rest of their stack, you can achieve the same functionality as Datadog (or even more) at much lower costs. But, it requires engineering time to set up these tools, monitor them, build dashboards and alerts. Build an observability platform at home, in other words.

But what about other open source solutions that already trying very hard to become an out-of-box solution for observability? Things like Netdata, Hyperdx, Coroot, etc. are already platforms for all telemetry signals, with fancy UIs and a lot of presets. Why people don't use them instead of Datadog?

Cpoll•7mo ago

> or even more

Grafana isn't quite as featureful as Datadog, though nothing to keep you from getting the job done.

> But, it requires engineering time to set up these tools

At some price point, you have to wonder if it doesn't make more sense to hire engineers to get it just right for your use case. I'd bet that price point is less than $65MM. Hell, you could have people full-time on Grafana to add features you want.

east4ming•7mo ago

Think the other way around, engineers are also a cost, and investment in open-source software is also a cost. If engineers are very cheap, then the company would choose to build it themselves, right?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

The F Word

You Are Here

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Browser based state machine simulator and visualizer

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I write games in C (yes, C) (2016)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

72M Points of Interest

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

France's homegrown open source online office suite

Where did all the starships go?

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

The F Word

You Are Here

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

Show HN: Browser based state machine simulator and visualizer

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I write games in C (yes, C) (2016)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

72M Points of Interest

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

France's homegrown open source online office suite

Where did all the starships go?

Datadog's $65M/year customer mystery solved

Comments