How Do You Find an Illegal Image Without Looking at It?

https://mahmoud-salem.net/the-invisible-shield

27•danso•2d ago

Comments

mystraline•1h ago

This matters greatly if you want to self-host something like Matrix, and you permit federation..

You WILL get a CSAM spam issue. It will get caught in your server cache. And you won't catch it until after the fact. And shit admin tools will not properly remove the spammer or content.

Better yet, if you run Matrix, disable image caching and preloading.

disillusioned•1h ago

What does it say about us, as a society, or just as _humans_, where the scale and magnitude of this problem is so great and only growing? Where and how are we failing ourselves that the sort of mental illness that percolates and drives this sort of behavior festers, amplifies, and converts into actual, illicit action?

These numbers are mind-boggling, and while I understand that a "few (extremely) bad apples" are probably responsible for an outsized amount of production, AND that AI-generated imagery is flooding the zone disproportionate to the amount of actual human children being physically harmed, it's still absolutely wild to me that we collectively are producing and consuming so much of this content, despite it being largely universally considered essentially the most abhorrent thing possible.

What would fixing this at the root cause even start to begin? How do we apply whatever combination of therapeutic intervention or further societal pressure or whatever might work to reduce the incidence of people having these urges, exploring them, feeding them, and sometimes acting on them? We see signs in every airport bathroom telling us to look for signs of trafficking. Trafficking intervention training is a huge deal in the travel industry in general. There are early intervention and detection systems for social workers and case workers.

But has anyone spent any real time looking at this from the other side: the side of the offender? I imagine there's research on the typical chain of how someone gets "onboarded" here: it probably starts with some early abuse, or if not that, early exposure or early curiosity, and then snowballs from there. I'm just thinking out loud about how large the magnitude of the problem is on the offender side if we're talking about this volume of images, and how we might be able to evaluate things from the "ounce of prevention worth a pound of cure" side of things, because damn is this depressing.

Loughla•1h ago

Images are interesting though. You can have a massive amount of images for only a few consumers.

I would be interested in statistics related to the percent of adults who would be considered child predators. I have zero scope on how large this issue is by percent of population.

If we're talking about 3% of everyone who is sexually attracted to children, that's one thing, but if it's .0000001% then the issue really is just the producers of content.

Does anyone here know of any studies or statistics? My basic googling hasn't really turned up anything trustworthy.

disillusioned•1h ago

That's what I'm getting at with the "few bad apples" reference: it's _possible_ (and I'd hope) that the percentages are very small... but the insane volume of things like _grooming_ and other behaviors, to say nothing of just how many women report some form of sexual assault or abuse by the time they reach adulthood being in, what, the high 30%s?... it's not great.

formerly_proven•39m ago

As per Wikipedia there is really bad/no data on this because almost all research relies on convicted pedophiles and going around making “are you a pedo, perchance?” surveys in the general population simply does not work.

throwaway55553•34m ago

You can't have any meaningful statistics as long as people flip out whenever this topic comes up.

For some, "child predators" are those who do harmful things to toddlers.

For others, "child predators" are anyone who you want to accuse of it, like in this story: https://www.the-independent.com/news/world/americas/crime/ke...

adi_kurian•18m ago

I think it is around that. I remember being startled hearing it.

https://scispace.com/pdf/how-common-is-men-s-self-reported-s...

Ghastly.

TurdF3rguson•8m ago

What percentage of pornhub visitors click on the "barely legal" category? I'm pretty sure that data is available.

PunchyHamster•20m ago

It's also worth considering just parent taking photos of the child would hit the positive on classifier. And it can be CSAM and not CSAM at the same time, because it is fine to be on the device of the parent, but it can also be stolen and distributed by maliciosu actor.

> What does it say about us, as a society, or just as _humans_, where the scale and magnitude of this problem is so great and only growing?

That the people in power have too much power and they get away with it often enough that there is actual money to be made supplying them.

thousand_nights•58m ago

> no X. no Y. just Z

i am so sick of AI slop writing..

Cider9986•37m ago

I agree. Why should I read such a long article that a human didn't put any effort into?

>Built with love and ~25 000 tokens. Conceived and directed by a human. Written by AI.

I appreciate the transparency, although it is at the bottom.

thousand_nights•25m ago

i wouldn't have such an issue with it if didn't completely homogenize every text it spits out. i want to read something that at least resembles the words in the author's mind, not the output of an instruction to describe something.

i should take a break from the internet, the past couple of weeks feel like being stuck in an asylum where everything is written by the same one author, using the same words, same tropes, same idioms. i'm slowly going insane.

therobots927•48m ago

“ Over 1.5 million of those reports involved generative AI. Some of this material depicts entirely fictional children. But a growing share is generated using the likenesses of real, identifiable children — children who have never suffered contact abuse, but who are now victims nonetheless. And all of it — real or synthetic — floods into the same investigation pipeline, where human analysts must treat every image as potentially depicting a real child in danger.”

If any of the leading AI companies are looking to get back in the good graces of the public, they should seriously think about releasing an open source model that reliably labels media (text, photo or video) with a probability said media is AI generated.

There is a 0% chance they don’t already have models for this to prevent feeding their models AI generated training data. So release it.

nradov•31m ago

That's a nice thought but the unfortunate technical reality is that AI content detection tools have never worked reliably and probably never will.

therobots927•29m ago

https://deepmind.google/models/synthid/

metalman•43m ago

simple, capture people who are already seeking these images, and keep them somewhere in confinement, but with access to the internet, they find more and act as agents for society for life, be good little perverted monsters, and they dont get castrated and released into the general prison population.

throwaway55553•26m ago

There's an even better idea: forget about people looking at pixels on the screen and focus on the real world.

Why spend the limited law enforcement budget on giving officers a cushy job of catching people for the crime of using a computer, when the same limited budget can be spent on catching those who actually hurt others?

derektank•10m ago

Demand for CSAM creates incentives for people to actually hurt others. Same reason we ban the sale of ivory.

measurablefunc•43m ago

Haven't read the post yet but I think the general technique is variations on spectral analysis. Break up the image into spectral components & then figure out a relative similarity metric based on spectral statistics.

areoform•28m ago

This is one of the most legible, well-detailed, and well-written article I've seen on perceptual hashing. It must have taken months of effort to pull off, and I'd love to see the author write about other things.

But the article fails to take its statements to their logical conclusion, in one section, he writes,

    > Every false positive means an innocent person's content was flagged — a family photo, a medical image, a piece of art. It means unnecessary investigation, potential harm to reputation, and erosion of trust in the system. At scale, even a 0.01% false positive rate means thousands of wrongful flags per day.

and,

    > In practice, the industry errs heavily toward minimizing false negatives — catching every possible match — and then uses human review to resolve false positives. This means the system flags aggressively but confirms carefully. The cost of a false positive is an investigation. The cost of a false negative is a child.
    > 
    > This is also why the hybrid approach from Chapter VI matters. Perceptual hashing against a verified database has a low false positive rate — but not zero. Certain images (blank, solid-color, simple gradients) produce hashes that collide with database entries by coincidence, not because they depict abuse. Production systems include collision detection to filter these out before matching. Classifiers for unknown material have a higher false positive rate still (the model is making a judgment, not a comparison). By layering them — hashing first, then classifiers, then human review — the system can be both aggressive and precise. But no layer is perfect, and the threshold remains a human decision.

If there is a way to "include collision detection to filter these out before matching" then why do they "then human review?" The author starts the next section with, "Three Steps. No One Sees the Image."

But they do human review to eliminate false positives? Both statements can't be simultaneously true - "no human ever sees it," or "by layering them — hashing first, then classifiers, then human review — the system can be both aggressive and precise."

Secondly, although I'm not a researcher, I think I and a lot of researchers would love to see this "aggressive, but precise algorithm" that eliminates collisions (an imprecise term - while here it means an image of a background or a setting that ticks off the similarity system; it's still not exactly a collision in the classical sense as the algorithm is a type of clustering with hashes) without making the algorithm useless? As far as I'm aware, no such algorithm exists without either becoming useless or having significant false positives. But I might be wrong.

At one point in the article, the author says, "The cost of a false negative is a child." This "aggressive and precise" system diverts resources from actual investigations and prosecution. A few examples,

A very famous case from 2022, https://www.nytimes.com/2022/08/21/technology/google-surveil...

A more precise example, as the author mentions PhotoDNA,

    > LinkedIn found 75 accounts that were reported to EU authorities in the second half of 2021, due to files that it matched with known CSAM. But upon manual review, only 31 of those cases involved confirmed CSAM. (LinkedIn uses PhotoDNA, the software product specifically recommended by the U.S. sponsors of the EARN IT Bill.)

PhotoDNA's "aggressive and precise" have a 58.6% false positive rate when tested. That means nearly 60% of the cases it generates for investigations wasted investigators time, leading to fewer investigations overall.

from, https://www.eff.org/deeplinks/2022/08/googles-scans-private-...

These systems are also flagging photos of adults,

    > In the process of reporting images, the occurrence of false positives—instances where non-CSAM images are mistakenly reported as CSAM—is inevitable. *One officer told us that there are “a lot” of CyberTipline reports that are images of adults.124* More false positives will mean fewer cases going unreported, and platforms must decide what balance they are comfortable with. False positives and false negatives can be minimized with better detection technology. One respondent criticized platforms for relying on their in-house technology. They perceived those as inferior to solutions offered by start-ups, suggesting that this choice might be driven by profit motives.125 Platforms, however, might have reservations about using third-party services for screening potential CSAM due to legal and ethical considerations. An NGO employee highlighted platform concerns, asking, “Can we trust these organizations? What ethical due diligence have they done?”

via https://purl.stanford.edu/pr592kc5483

The uncomfortable truth is that people are trying to use technology to fix a structural problem. Usually, most victims of CSA (including me) know the abuser. In my case and others, at least one adult knew (or suspected) and did nothing. More maddeningly, even when reported and the CSA is discovered and the perpetrator is punished, the victims are reabused within the foster care system. https://ballardbrief.byu.edu/issue-briefs/sexual-abuse-of-ch... 40% of children in foster care experience some type of abuse. Most never get the help they need.

I think the impulse to create systems to monitor everyone's phones for CSAM comes from a good place. But it's energy misdirected; better investigations into exploitation networks, investment in foster care and care for abused children and teens, heck even child AI companions capable of reporting abuse for children suspected of being abused would lead to better outcomes than scanning everyone's phone.

Saline9515•3m ago

It's an ai-written article.

Native Instant Space Switching on macOS

Many African families spend fortunes burying their dead

Charcuterie – Visual similarity Unicode explorer

Reverse engineering Gemini's SynthID detection

PicoZ80 – Drop-In Z80 Replacement

Robots Eat Cars

Hegel, a universal property-based testing protocol and family of PBT libraries

Research-Driven Agents: When an agent reads before it codes

Unfolder for Mac – A 3D model unfolding tool for creating papercraft

Instant 1.0, a backend for AI-coded apps

Moving from WordPress to Jekyll (and static site generators in general)

Microsoft is employing dark patterns to goad users into paying for storage?

Top laptops to use with FreeBSD

Show HN: I built a Cargo-like build tool for C/C++

Old laptops in a colo as low cost servers

Reallocating $100/Month Claude Code Spend to Zed and OpenRouter

How Do You Find an Illegal Image Without Looking at It?

EFF is leaving X

Introduction to Nintendo DS Programming

BunnyCDN has been silently losing our production files for 15 months

Show HN: Druids – Build your own software factory

Progressive encoding and decoding of 'repeated' protobuffer fields

Maine is about to become the first state to ban major new data centers

A WebGPU implementation of Augmented Vertex Block Descent

The Training Example Lie Bracket

Wit, unker, Git: The lost medieval pronouns of English intimacy

Show HN: CSS Studio. Design by hand, code by agent

The Future of Everything Is Lies, I Guess: Part 3 – Culture

Meta removes ads for social media addiction litigation

Help Keep Thunderbird Alive

How Do You Find an Illegal Image Without Looking at It?

Comments

Native Instant Space Switching on macOS

Many African families spend fortunes burying their dead

Charcuterie – Visual similarity Unicode explorer

Reverse engineering Gemini's SynthID detection

PicoZ80 – Drop-In Z80 Replacement

Robots Eat Cars

Hegel, a universal property-based testing protocol and family of PBT libraries

Research-Driven Agents: When an agent reads before it codes

Unfolder for Mac – A 3D model unfolding tool for creating papercraft

Instant 1.0, a backend for AI-coded apps

Moving from WordPress to Jekyll (and static site generators in general)

Microsoft is employing dark patterns to goad users into paying for storage?

Top laptops to use with FreeBSD

Show HN: I built a Cargo-like build tool for C/C++

Old laptops in a colo as low cost servers

Reallocating $100/Month Claude Code Spend to Zed and OpenRouter

How Do You Find an Illegal Image Without Looking at It?

EFF is leaving X

Introduction to Nintendo DS Programming

BunnyCDN has been silently losing our production files for 15 months

Show HN: Druids – Build your own software factory

Progressive encoding and decoding of 'repeated' protobuffer fields

Maine is about to become the first state to ban major new data centers

A WebGPU implementation of Augmented Vertex Block Descent

The Training Example Lie Bracket

Wit, unker, Git: The lost medieval pronouns of English intimacy

Show HN: CSS Studio. Design by hand, code by agent

The Future of Everything Is Lies, I Guess: Part 3 – Culture

Meta removes ads for social media addiction litigation

Help Keep Thunderbird Alive