Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark.
In short: great for moral grandstanding in the comments section, but zero evidence that taxpayer money or investigative time was ever saved.
The model is considered fair if its performance is equal across these groups.
One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.
Should basketball performance be equal across racial, or sex groups? How about marathon performance?
It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.
In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.
The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.
The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.
In the end, the new model was just as biased, but in the other direction, and performance was simply worse:
> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.
Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.
There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.
(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)
> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.
Nice to see an investigation that's serious enough to acknowledge this.
1. In aggregate over any nationality, people face the same probability of a false positive.
2. Two people who are identical except for their nationality face the same probability of a false positive.
In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).
This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.
My suspicion is that in many situations you could build a detector/estimator which was fairly close to being blind without a significant total increase in false positives, but how much is too much?
I'm actually more concerned that where I live even accuracy has ceased to be the point.
It would be immoral to disadvantage one nationality over another. But we also cannot disadvantage one age group over another. Or one gender over another. Or one hair colour over another. Or one brand of car over another.
So if we update this statement:
> Two people who are identical except for any set of properties face the same probability of a false positive.
With that new constraint, I don't believe it is possible to construct a model which outperforms a data-less coin flip.
One can't change one's race, but changing marital status is possible.
Where it gets tricky is things like physical fitness or social groups...
We tend to distinguish between ascribed and achieved characteristics. It is considered to be unethical to discriminate upon things a person has no control over, such as their nationality, gender, age or natural hair color.
However, things like a car brand are entirely dependent on one's own actions, and if there's a meaningful statistically significant correlation owning a Maserati and fraudulently applying for welfare, I'm not entirely sure it would be unethical to consider such factor.
And it also depends on what a false positive means for a person in question. Fairness (like most things social) is not binary, and while outright rejections can be very unfair, additional scrutiny can be less so, even though still not fair (causing prolonged times and extra stress). If things are working normally, I believe there's a sort of (ever-changing, of course, as times and circumstances evolve) an unspoken social agreement on what's the balance between fairness and abuse that can be afforded.
Nationality and natural hair color I understand, but age and gender? A lot of behaviors are not evenly distributed. Riots after a football match? You're unlikely to find a lot of elderly women (and men, but especially women) involved. Someone is fattening a child? That elderly women you've excluded for riots suddenly becomes a prime suspect.
> things like a car brand are entirely dependent on one's own actions
If you assume perfect free will, sure. But do you?
That’s true. But the idea is that feeding it to a system as an input could be considered unethical, as one cannot control their age. Even though there’s a valid correlation.
> If you assume perfect free will, sure. But do you?
I’m not. If this matters, I’m actually currently persuaded that free will doesn’t exist. Which doesn’t change that if one buys a car, its make is typically all their decision. Whenever such decision is coming from them having a free will or entirely determined by antecedent causes doesn’t really matter for purposes of fraud detection (or maybe I fail to see how it does).
I mean, we don’t need to care why people do things (at all, in general) - it matters for how we should act upon detection, but not for detecting itself. And, as I understand it, we know we don’t want to cause unfair pressure on groups defined by factors they cannot change. Because when we did that it consistently contributed to various undesirable consequences. E.g. discrimination and stereotypes against women or men, or prejudice against younger or elder people didn’t do us any well.
To take a few examples, looking at employment characteristics will have a strong relationship with gender, generally creating greater false positives for women. Similarly, academic success will have greater false positives for men. Where a person choose to live will proxy heavily towards social economic factors, which in turn has gender as a major factor.
Welfare fraud in itself also has differences between men and women. The sums tend to be higher for men. Women in turn dominate the users of the welfare system. Women and men also tend to receive welfare at different time in their life. It possible even that car brand has a correlation with gender which then would act as a proxy.
In terms of defining fairness, I do find it interesting that the Analogue Process gave men a beneficial advantage, while both the initial and the reweighed model are the opposite and give women an even bigger beneficial advantage. The change in bias against men created by using the detection algorithms is actually about the same size as the change in bias against non-dutch nationality between initial model and the reweighed one.
One has to wonder if the study is more valid a predictor of the implementers' biases than that of the subjects.
Training on past human decisions inevitably bakes in existing biases.
Fraud detection models will never be fair. Their job is to find fraud. They will never be perfect, and the mistaken cases will cause a perfectly honest citizen to be disadvantaged in some way.
It does not matter if that group is predominantly 'people with skin colour X' or 'people born on a Tuesday'.
What matters is that the disadvantage those people face is so small as to be irrelevant.
I propose a good starting point would be for each person investigated to be paid money to compensate them for the effort involved - whether or not they committed fraud.
Nevertheless the idea of giving money is still good imo, because it also incentivizes the fraud detection becoming more efficient, since mistakes now cost more. Unfortunately I have a feeling people might game that to get more money by triggering false investigations.
Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.
The post does talk about it when it briefly mentions that the goal of building the model (to decrease the number of cases investigated while increasing the rate of finding fraud) wasn't achieved. They don't say any more than that because that's not the point they are making.
Anyway, the project was shelved after a pilot. So your point is entirely false.
> In late November 2023, the city announced that it would shelve the pilot.
I would agree that implications regarding the use of those models do not hold, but not the ones about their quality.
Amsterdam didn't deploy their models when they found their outcome is not satisfactory. I find it a perfectly fine result.
What's the problem with this? It isn't racism, it's literally just Bayes' Law.
Upon evaluation, your model seems to accept everyone who mentions a "fraternity" and reject anyone who mentions a "sorority". Swapping out the words turns a strong reject into a strong accept, and vice versa.
But you removed any explicit mention of gender, so surely your model couldn't possibly be showing an anti-women bias, right?
djoldman•15h ago
It's generally straightforward to develop one if we don't care much about the performance metric:
If we want the output to match a population distribution, we just force it by taking the top predicted for each class and then filling up the class buckets.
For example, if we have 75% squares and 25% circles, but circles are predicted at a 10-1 rate, who cares, just take the top 3 squares predicted and the top 1 circle predicted until we fill the quota.
Scarblac•14h ago
djoldman•14h ago
As noted above, this doesn't do anything for performance.
wongarsu•14h ago
As you say, that would be a crappy model. But in my opinion that would also be hardly a fair or unbiased model. That would be a model unfairly biased in favor of HP, who barely sell anything worth recommending
djoldman•14h ago
"Unbiased" and "fair" are quite overloaded here, to borrow a programming term.
I think it's one of those times where single words should expressly NOT be used to describe the intent.
The intent of this is to presume that the rate of the thing we are trying to detect is constant across subgroups. The definition of a "good" model therefore is one that approximates this.
I'm curious if their data matches that assumption. Do subgroups submit bad applications at the same rate?
It may be that they don't have the data and therefore can't answer that.
teekert•13h ago
Any model would be unfair, age-wise but also ethnically.
To be most effective the model would have to be unfair. It would suck to be a law abiding young specific ethnic minority.
But does it help to search elderly couples?
I’m Genuinely curious what would be fair and effective here. You can’t be a Bayesian.
lostlogin•11h ago
Eg, police shooting and brutality stats wouldn’t be tolerated for very long.