Normalizing Ratings

http://hopefullyintersting.blogspot.com/2025/05/normalizing-ratings.html

51•Symmetry•2mo ago

Comments

nlh•2mo ago

Similarly - one of my biggest complaints about almost every rating system in production is how just absolutely lazy they are. And by that, I mean everyone seems to think "the object's collective rating is an average of all the individual ratings" is good enough. It's not.

Take any given Yelp / Google / Amazon page and you'll see some distribution like this:

User 1: "5 stars. Everything was great!"

User 2: "5 stars. I'd go here again!"

User 3: "1 star. The food was delicious but the waiter was so rude!!!one11!! They forgot it was my cousin's sister's mother's birthday and they didn't kiss my hand when I sat down!! I love the food here but they need to fire that one waiter!!"

Yelp: 3.6 stars average rating.

One thing I always liked about FourSquare was that they did NOT use this lazy method. Their score was actually intelligent - it checked things like how often someone would return, how much time they spent there, etc. and weighted a review accordingly.

theendisney•2mo ago

With averages: to have 5 stars you need a hudred 5 star ratings for each one star rating.

If one would normalize the ratings they could change without doing anything. A former customer may start giving good ratings elsewhere making yours worse or give poor ones inproving yours.

Maybe the relevance of old ratings should decline.

kayson•2mo ago

The normalization doesn't have to be "live". You could apply the factor at time of rating and then not change it.

theendisney•2mo ago

Then could let everyone start with 100 one star ratings. If they rate their first thing it counts as 1/101 vote. If they start with a one star rating it will be their highest ever.

Alternatively you could apply the same rating to the customer and display it next to their user name along with their own review counter.

What also seems a great option is to simply add up all the stars :) Then the grumpy people wont have to do anything.

ajmurmann•2mo ago

Is that actually bad? What happened is that we learned more about the customer's rating system. I might never have had Cuban food and love it the first time I try it on Miami but then keep eating it and it turns out the first restaurant was actually not as good as I thought, I just really like Cuban food.

This actually somewhat goes into another pet peeve of mine with rating systems. I'd like to see ratings for how much I will like it. An extreme but simple example might be that the ratings of a vegan customer of a steak house might be very relevant to other vegans but very irrelevant to non-vegans. More subtle versions are simply about shared preferences. I'd love to see ratings normalized and correlated to other users to create a personalized rating. I think Netflix used to do stuff like this back in the day and you could request your personal predicted score via API but now that's all hidden and I'm instead shown different covers off the same shows over and over

theendisney•2mo ago

Sounds like a fun startup. Vegan restaurant rating. Could do quite a few variations of it for people who care about a specific thing.

Also, yes it matters that one in a hundred ratings leaves such a large mark on your business. I know one where they go out of their way to deliver quality. They get maybe two ratings per week. The competitor left only four fake ratings. It would take 200 weeks or 4 years to get back to 5 stars.

Hizonner•2mo ago

I buy a lot of "technical things", and you constantly see one or two star ratings from people who either don't know what the thing is actually supposed to do, or don't know how to use it.

My favorites: A power supply got one star for not simultaneously delivering the selected limit voltage and the selected limit current into the person's random load. In other words, literally for not violating the laws of physics. An eccentric-cone flare tool got one star for the cone being off center. "Eccentric" is in the name, chum....

stevage•2mo ago

Or worse, a 1 star rating for a product they loved but there was a problem with delivery.

derefr•2mo ago

I take this not as people being dumb, but as a clear conflict of interest: people want to be able to rate the logistics provider separately from the product, but marketplaces don't want to give people the option to do that — as that would reveal that the marketplace will sometimes decide to use "the known-shitty provider" for some orders. (And make no mistake, the marketplace knows that that provider is awful!)

esperent•2mo ago

> for not violating the laws of physics.

I would personally frame that as a review for poor documentation. A device shouldn't expect users to know laws of physics to understand it's limitations.

Hizonner•2mo ago

If you don't know that particular law of physics, you have no business messing with electricity. You'll very likely damage something, and quite possibly damage someone.

We're talking about a general-purpose device meant to drive a circuit you create yourself. I'm not sure what a good analogy would be. Expecting the documentation for a saw to tell you you have to cut all four table legs the same length?

esperent•2mo ago

Yes, people are dumb and will use dangerous things without fully understanding them. All household devices like toasters and washing machines already have a warning section in their manual, why shouldn't a potentially far more dangerous device also be expected to have it?

The saw analogy isn't a good one - saws work within the range of physics that humans have instinctual understanding of. We instinctively know what causes a table to wobble. We do not instinctively know the physical behaviors of electricity.

You might counter that people should know this before messing with electricity, and I'll agree. But what people should know and what they actually know are often very different.

A warning in the manual might prevent some overeager teenager who got their hands on this device from learning this particular law of physics the hard way.

Hizonner•2mo ago

> All household devices like toasters and washing machines already have a warning section in their manual, why shouldn't a potentially far more dangerous device also be expected to have it?

Actually, a toaster is more dangerous than a low voltage bench supply if you use it in the maximally moronic way. If you ask me to set a fire on purpose, and give me a choice of which to use, I'll pick the toaster. And, by the way, I don't buy the idea that people "instinctively" understand fire.

But the other, more relevant part is that a toaster or a washing machine is used for a single purpose in a stereotyped way. There are a bounded number of fairly well understood mistakes that people are likely to make frequently. You can list them.

A bench supply, like many other tools, can be used in an almost infinite number of ways, which you are meant to be designing for yourself. If you buy a washing machine, you're "saying" that you want to wash clothes. If you buy a variable power supply, you're "saying" that you want to design or repair electronics that might do anything, or do electroplating, or who-knows-what-else. There is no complete list.

You cannot do such things safely without actually understanding how they work, in more depth than an instruction manual is going to be able to give you, even if the manual somehow knew what you were planning to do, which of course it doesn't. You can't design an electronic circuit, or a plating protocol, and you definitely can't troubleshoot either one, without having a clue about Ohm's law.

People sometimes have their own ideas, based on actual understanding of something significant, and they need general purpose tools to support those ideas.

> The saw analogy isn't a good one - saws work within the range of physics that humans have instinctual understanding of. We instinctively know what causes a table to wobble.

... and yet people will try to build a table by attaching each leg with a single nail in the end, because they don't instinctively understand wood grain, or that there are lateral forces and leverage involved. The saw manual doesn't get into those things either. That's woodworking knowledge you're meant to have before you buy the saw. Even though you could easily put something heavy on the table and get hurt when it collapses.

Saw instructions are restricted to the actual process of sawing. They don't get into table design, not even the "inobvious" parts. Actually, many power saw instructions don't even say much about how to saw. They tell you how to mount the blade, and what this or that switch does. After the endless warnings.

> But what people should know and what they actually know are often very different.

I could also say that people don't read manuals. Especially not if the manuals are gigantic tomes, which is where they end up when they go in the direction you're talking about. The manual for your average power tool has pages of fine-print warnings, basically trying to explicitly forbid every stupid way somebody has misused that kind of tool in the past. Essentially no users read them.

I have an air nailer whose instructions helpfully inform me that I shouldn't use acetylene instead of air. It specifically says that. That's like toaster instructions warning you not to make your toast out of asbestos panels dusted with cyanide.

Hooking your nailer up to acetylene is not an obvious mistake to make. It's not an idea that comes to mind. It probably wouldn't even work well (before the fireworks started). It's not easy, either. You'd have to kludge up some weird adapter system to make the deliberately incompatible fittings work. Anybody who works around acetylene knows really well why it's fucking suicidally stupid. And, yes, if you don't know that, you shouldn't be messing with acetylene. Or nailers.

But apparently some cretin did it one time, and now it's in the instructions. Even so, the instructions for that nailer don't say that it's not for styrofoam. Which is more the sort of thing that was going on with the power supply. I guess nobody's lost an eye to a ballistic nail that came through a piece of styrofoam yet. Or at least nobody's had the gall to stand in a courtroom and say it was the nailer company's fault.

The cretin with the acetylene wouldn't have read the warnings, and now that they're longer, the next cretin is even less likely to read them. In fact, nobody expects most users to read the warnings. The warnings are not there to improve safety. They're purely for defense against lawsuits. They might even work for that, but they're not what you'd write if you actually set out to improve safety.

The real safety impact , if any, is almost certainly negative. Warning about obviously off the wall idiotic behavior overwhelm any actually useful warnings, and prevent them from being seen. On the other end of the spectrum, overcautious warnings, if they are read, breed contempt for warnings about really important, possibly inobvious risks. More is not better.

... and bringing it back to reviews and that power supply, such supplies typically come with a specification sheet and either no instructions, or one page with a few bullet points that would make no sense if you didn't understand how voltage and current relate. So even if the documentation on that particular supply were in some sense inadequate, it would still be up to the standards of any other supply you might buy. And in any case, the reviewer's claim was that there was something wrong with the device itself because it didn't do something the reviewer should have known was impossible. That's not a useful review.

nmstoker•2mo ago

This feels like a stretch, all the more so given they were specifically talking about "technical things". Assumptions around documentation, reading of docs and widespread physics knowledge seem like they could all be different here.

anon7000•2mo ago

Totally. I’ve noticed lots of sites moving away from comments on reviews too. For example, Amazon reviews on mobile can be “helpful” or I can report them.

Why can’t I downvote or comment on it? As a user, I just want more context.

But obviously, it’s not in Amazon’s interest to make me not want to buy something.

BeFlatXIII•2mo ago

Or shopping for USB charging bricks. No matter where in the quality spectrum you look, there is a constant percentage of one-star reviews for “this overheated and burned my house down.”

kazinator•2mo ago

-1 stars! They forgot it was my cousin's sister's mother's birthday, and the obnoxious waiter snarkily pointed out that my cousin's sister is just another cousin, and her mother is just my aunt.

zzo38computer•2mo ago

I think that numeric ratings (especially if only one number can be specified) (and then averaging or making or other types of statistics) are not as useful as actually reading the reviews in order to determine whether or not it addresses your concerns with it, and if they have specific complaints or specific things they say are good, to judge them by yourself according to your own intentions.

nlh•2mo ago

Agreed. And particularly with the advent of LLMs, this is something that could be done quite easily. Don't even give users an option of giving a numeric/star rating - just allow people to write a sentence or two (or 10) and let the LLM do the sentiment analysis and aggregate a score.

xnx•2mo ago

I don't understand why letter grades aren't more popular for rating things in the US.

"A+" "B" "C-" "F", etc. feel a lot more intuitive than how stars are used.

NegativeK•2mo ago

We'd still get the same pressure to give an A+ to every interaction unless things were fucked.

I used to rate three stars for what "performs as expected" until I realized that it's punishing good products. Switch to A-F would result in the same behavior, except it'd be Uber drivers trying to make a living instead of noxious parents declaring that their kid deserves an A.

technetist•2mo ago

I think that ultimately you run into the same issue.

In US education you are taught that you need to get an A. Anything below a C, gets you on the equivalent of a “Performance Improvement Plan” in corporate world. And B is… well… B.

So with that rating engrained, people would probably feel bad about rating their ride-share driver a C when they did what was expected. And it wouldn’t stop companies from pushing for A ratings.

Even elsewhere like the food industry where they do have letter ratings, A is the norm with anything lower being an outlier.

Perhaps for this to work, it would need a complete systemic shift where C truly is the average and A and F are the outliers. In school C would need to be “did the student do the assignment.” And A would need to be “the student did the assignment, and then some.”

jsnell•2mo ago

There's nothing intrinsically intuitive about letter grades. It's just that you've been taught those specific arbitrary mappings.

Consider for example the "S" as a better grade than "A", originating from Japan but widely applied in gaming.

xboxnolifes•2mo ago

Even worse, is S actually good, or does the scale go SSS+, SSS, SS, S, A, B?

Retr0id•2mo ago

> I'm genuinely mystified why its not applied anywhere I can see.

I wonder if companies are afraid of being accused of "cooking the books", especially in contexts where the individual ratings are visible.

If I saw a product with 3x 5-star reviews and 1x 3-star review, I'd be suspicious if the overall rating was still a perfect 5 stars.

tibbar•2mo ago

One of my favorite algorithms for this is Expectation Maximization [0].

You would start by estimating each driver's rating as the average of their ratings - and then estimate the bias of each rider by comparing the average rating they give to the estimated score of their drivers. Then you repeat the process iteratively until you see both scores (driver rating, and user bias) converge.)

[0] https://en.wikipedia.org/wiki/Expectation%E2%80%93maximizati...

rossdavidh•2mo ago

I have often had the same thought, and I have to believe the reason is that the companies' bottom line is not impacted the tiniest bit by their ratings' systems. It wouldn't be that hard to do better, but anything that takes a non-zero amount of attention and effort to improve, has to compete with all of those other priorities. As far as I can tell, they just don't care at all about how useful their rating system is.

Alternatively, there might be some hidden reason why a broken rating system is better than a good one, but if so I don't know it.

parrit•2mo ago

https://xkcd.com/1098/

https://xkcd.com/937/

parrit•2mo ago

For uber you don't need a rating at all. The tracking system knows if they were late, if they took a good route and if they dropped you off at the wrong location.

Anything really bad can be dealt with via a complaint system.

Anything exceptional could be asked by a free text field when giving a tip.

Who is going to read all those text fields and classify them? AI!

healsdata•2mo ago

Counterpoint -- Lyft attempted to charge me a late fee when a driver went to the wrong spot in a parking by garage.

parrit•2mo ago

Star rating doesn't help here

JSR_FDED•2mo ago

A++++ article!

stevage•2mo ago

Wow you remind me of eBay.

stevage•2mo ago

I like rating systems from -2 to +2 for this reason.

The big rating problem I have is with sites like boardgamegeek where ratings are treated by different people as either an objective rating of how good the game is within its category, or subjectively how much they like (or approve of) the game. They're two very different things and it makes the ratings much less useful than they could be.

They also suffer a similar problem in that most games score 7 out of 10. 8 is exceptional, 6 is bad, and 5 is disastrous.

mzmzmzm•2mo ago

A problem with accounting for "above average" service is sometimes I don't want it. If a driver goes above and beyond, offering a water bottle or something else exceptional, occasionally I would rather be left alone during a quiet, impersonal ride.

pbronez•2mo ago

One formal measure of this is Inter-Rater Reliability

https://en.wikipedia.org/wiki/Inter-rater_reliability

homeonthemtn•2mo ago

I'd rather we just did an increment of 3 rating. 1. Bad 2. Fine 3. Great

2 and 4 are irrelevant and/or a wild guess or user defined/specific.

Most of the time our rating systems devolve into roughly this state anyways.

E.g.

5 is excellent 4.x is fine <4 is problematic

And then there's a sub domain of the area between 4 and 5 where a 4.1 is questionable, 4.5 is fine and 4.7+ is excellent

In the end, it's just 3 parts nested within 3 parts nested within 3 parts nested within....

Let's just do 3 stars (no decimal) and call it a day

Retric•2mo ago

All rating systems are relative to other ratings on the platform. So it doesn’t matter if you dumb things down or not.

The trick is collecting enough ratings to average out the underlying issues and keeping context. IE: You want rankings relative to the area, but also on some kind of absolute scale, and also relative to the price point etc.

homeonthemtn•2mo ago

I'd argue that a 3 star system makes that easier to average or otherwise compare vs a 5 or (for whatever insane reason) a 10 based system

Retric•2mo ago

The less choices you the more random noise you get from rounding.

A reviewer might round up a 7/10 to a 3 as it’s better than average, while someone else might round down a 8/10 because it’s not at that top tier. Both systems are equally useful with 1 or 10,000 reviews but I’m not convinced they are equivalent with say 10 review.

Also, most restaurants that stick around are pretty good but you get some amazingly bad restaurants that soon fail. It’s worth separating overpriced from stay the fuck away.

dragonwriter•2mo ago

The fewer choices and the more clear the meaning, the less noise you get from the very well-documented cultural difference in how wide-range numeric rating systems are used; this isn't important if you are running a platform with a very narrow cultural audience, but it is (despite being widely ignored in design) in platforms with wide and diverse audiences, since your ratings literally mean different things based on the the subcultural mix rating each product.

Retric•2mo ago

A lot of mechanisms are involved. Culture doesn’t just impact the scores people rate something on, but also how people interpret them which mitigates that effect.

However, the rounding issue is a big deal both in how people rate stuff and how they interpret the scores to the point where small numbers of responses become very arbitrary.

dragonwriter•2mo ago

> A lot of mechanisms are involved. Culture doesn’t just impact the scores people rate something on, but also how people interpret them which mitigates that effect.

It doesn't mitigate the effect, the combination of the effect on rating and interpretation is the source of the issue, which exists whenever the review reader isn't in the cultural midpoint of the raters.

Retric•2mo ago

> whenever the review reader isn't in the cultural midpoint of the raters

Obviously, yet the scale of the mismatch when looking at a composite score isn’t total, thus the effect is being mitigated.

Further, even without that the more consistent the cultural mix the more consistent the ratings. Anyone can understand a consistent system.

jonstewart•2mo ago

I give five stars always because I’m not a rat.

enaaem•2mo ago

Check the bad reviews. If the 1-2 star reviews are mostly about the rude owner, then you know the food is good.

nmstoker•2mo ago

Does anyone else get that survey rating effect where you start off thinking the company is reasonable, you give a 4 or 5, then the next page asks for why you chose this and as you think it through you realise more and more shitty things they did, so you go back to bring them down to a 2 or 3. Effectively by asking in detail they undermine the perception of them

adrmtu•2mo ago

Isn't this basically a de-biasing problem? Treat each rider’s ratings as a random variable with its own mean μᵤ and variance σᵤ², then normalize. Basically compute z = (r – μᵤ)/σᵤ, then remap z back onto a 1–5 scale so “normal” always centers around ~3. You could also add a time decay to weight recent rides higher to adapt when someone’s rating habits drift.

Has anyone seen a live system (Uber, Goodreads, etc.) implement per-user z-score normalization?

User23•2mo ago

Same for peer reviews. Giving anything less than a four is saying fire this person. And even too many fours is PIP territory.

lordnacho•2mo ago

Has anyone done a forced ranking rating?

"Here's your last 5 drivers, please rank them"

esafak•2mo ago

Sounds like thankless work.

theendisney•2mo ago

Rating systemen should really mature to exclude non customers and list the customers purchase history.

Weight by amount spend could be interesting.

Big vendors/companies should probably be required to have per product ratings rather than optional. Rating adobe or alibaba on general is probably not all that useful.

The EU almost requires it but google (for example) still didnt find a nice technical solution.

Asynchrony is not concurrency

How to write Rust in the Linux kernel: part 3

Ccusage: A CLI tool for analyzing Claude Code usage from local JSONL files

Shutting Down Clear Linux OS

Silence Is a Commons by Ivan Illich (1983)

Broadcom to discontinue free Bitnami Helm charts

Wii U SDBoot1 Exploit “paid the beak”

EPA says it will eliminate its scientific reseach arm

Multiplatform Matrix Multiplication Kernels

lsr: ls with io_uring

Valve confirms credit card companies pressured it to delist certain adult games

Meta says it wont sign Europe AI agreement, calling it growth stunting overreach

Trying Guix: A Nixer's impressions

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

AI capex is so big that it's affecting economic statistics

Show HN: Molab, a cloud-hosted Marimo notebook workspace

Mango Health (YC W24) Is Hiring

CP/M creator Gary Kildall's memoirs released as free download

The year of peak might and magic

Sage: An atomic bomb kicked off the biggest computing project in history

Show HN: I built library management app for those who outgrew spreadsheets

Cancer DNA is detectable in blood years before diagnosis

A New Geometry for Einstein's Theory of Relativity

Show HN: Simulating autonomous drone formations

How I keep up with AI progress

Benben: An audio player for the terminal, written in Common Lisp

Making a StringBuffer in C, and questioning my sanity

Hundred Rabbits – Low-tech living while sailing the world

How to Get Foreign Keys Horribly Wrong

When root meets immutable: OpenBSD chflags vs. log tampering

Normalizing Ratings

Comments

Asynchrony is not concurrency

How to write Rust in the Linux kernel: part 3

Ccusage: A CLI tool for analyzing Claude Code usage from local JSONL files

Shutting Down Clear Linux OS

Silence Is a Commons by Ivan Illich (1983)

Broadcom to discontinue free Bitnami Helm charts

Wii U SDBoot1 Exploit “paid the beak”

EPA says it will eliminate its scientific reseach arm

Multiplatform Matrix Multiplication Kernels

lsr: ls with io_uring

Valve confirms credit card companies pressured it to delist certain adult games

Meta says it wont sign Europe AI agreement, calling it growth stunting overreach

Trying Guix: A Nixer's impressions

Replication of Quantum Factorisation Records with a VIC-20, an Abacus, and a Dog

AI capex is so big that it's affecting economic statistics

Show HN: Molab, a cloud-hosted Marimo notebook workspace

Mango Health (YC W24) Is Hiring

CP/M creator Gary Kildall's memoirs released as free download

The year of peak might and magic

Sage: An atomic bomb kicked off the biggest computing project in history

Show HN: I built library management app for those who outgrew spreadsheets

Cancer DNA is detectable in blood years before diagnosis

A New Geometry for Einstein's Theory of Relativity

Show HN: Simulating autonomous drone formations

How I keep up with AI progress

Benben: An audio player for the terminal, written in Common Lisp

Making a StringBuffer in C, and questioning my sanity

Hundred Rabbits – Low-tech living while sailing the world

How to Get Foreign Keys Horribly Wrong

When root meets immutable: OpenBSD chflags vs. log tampering