Ai is not a hype. We have started to actually do something with all the data and this process will not stop soon.
Aline the RL what is now happening through human feedback alone (thumbs up/down) is massive.
This meant making a rich synthetic dataset first, to pre-train the model, before fine tuning on real, expensive data to get the best results.
but this was always the case.
Then again, maybe we're still operating from a framework where the dataset is part of your moat. It seems like such a way of thinking will severely limit the sources of innovation to just a few big labs.
Very much this. Its the dataset that shapes the model, the model is a product of the dataset, rather than the other way around (mind you, synthetic datasets are different...)
[1] https://web.archive.org/web/20190224031626/https://blog.open...
This was published before anyone knew it running an AI company would be very very expensive.
The EU has started the process of opening discussions aiming to set the stage for opportunities to arise on facilitating talks looking forward to identify key strategies of initiating cooperation between member states that will enable vast and encompassing meetings generating avenues of reaching top level multi-lateral accords on passing legislation covering the process of processing processes while preparing for the moment when such processes will become processable in the process of processing such processes.
#justeuthings :)
Even Google DeepMind's relabeled MedQA dataset, created for MedGemini in 2024, has flaws.
Many healthcare datasets/benchmarks contain dirty data because accuracy incentives are absent and few annotators are qualified.
We had to pay Stanford MDs to annotate 900 new questions to evaluate frontier models and will release these as open source on Hugging Face for anyone to use. They cover VQA and specialties like neurology, pediatrics, and psychiatry.
If labs want early access, please reach out. (Info in profile.) We are finalizing the dataset format.
Unlike general LLMs, where noise is tolerable and sometimes even desirable, training on incorrect/outdated information may cause clinical errors, misfolded proteins, or drugs with off-target effects.
Complicating matters, shifting medical facts may invalidate training data and model knowledge. What was true last year may be false today. For instance, in April 2024 the U.S. Preventive Services Task Force reversed its longstanding advice and now urges biennial mammograms starting at age 40 -- down from the previous benchmark of 50 -- for average-risk women, citing rising breast-cancer incidence in younger patients.
(To clarify: this is not the fault of scientists. This is a byproduct of a severely broken system with the wrong incentives, which encourages publication of papers and not discovery of truth. Hug cancer researchers. They have accomplished an incredible amount while being handcuffed and tasked with decoding the most complex operating system ever designed.)
Are scientists not writing those papers? There may be bad incentives, but scientists are responding to those incentives.
This game is being undermined and destroyed by infamous anti-vaxxer, non-medical expert, non-public-policy expert RFK Jr.[1] The disastrous cuts to the NIH's public grant scheme is likely to amount to $8,200,000,000 ($8.2 trillion USD) in terms of years of life lost.[2]
So, should scientists not write those papers? Should they not do science for public benefit? These are the only ways to not respond to the structure of the American public grant scheme. It seems to me that, if we want better outcomes, then we should make incremental progress to the institutions surrounding the public grant scheme. This seems fair more sensible than installing Bobby Brainworms to burn it all down.
[1] https://youtu.be/HqI_z1OcenQ?si=ZtlffV6N1NuH5PYQ
[2] https://jamanetwork.com/journals/jama-health-forum/fullartic...
That said, your comment has an implication: in which fields can we trust data if incentives are poor?
For instance, many Alzheimer's papers were undermined after journalists unmasked foundational research as academic fraud. Which conclusions are reliable and which are questionable? Who should decide? Can we design model architectures and training to grapple with this messy reality?
These are hard questions.
ML/AI should help shield future generations of scientists from poor incentives by maximizing experimental transparency and reproducibility.
Apt quote from Supreme Court Justice Louis Brandeis: "Sunlight is the best disinfectant."
1. Nucleotides are a form of tokenization and encode bias. They're not as raw as people assume. For example, classic FASTA treats modified and canonical C as identical. Differences may alter gene expression -- akin to "polish" vs. "Polish".
2. Sickle-cell anemia and other diseases are linked to nucleotide differences. These single nucleotide polymorphisms (SNPs) mean hard attention for DNA matters and single-base resolution is non-negotiable for certain healthcare applications. Latent models have thrived in text-to-image and language, but researchers cannot blindly carry these assumptions into healthcare.
There are so many open questions in biomedical AI. In our experience, confronting them has prompted (pun intended) better inductive biases when designing other types of models.
We need way more people thinking about biomedical AI.
Every fact is born an opinion.
This challenge exists in most, if not all, spheres of life.
You are may find a definition of what a "skyscraper" is, by some hyperfocused association, but you'll get a bias towards a definite measurement like "skyscrapers are buildings between 700m to 3500m tall", which might be useful for some data mining project, but not at all what people mean by it.
The actual definition is not in a specific source but in the way it is used in other sources like "the Manhattan skyscraper is one of the most iconic skyscrapers", on the aggregate you learn what it is, but it isn't very citable on its own, which gives WP that pedantic bias.
Same thing with law data
Get an ai to autogenerate lots of crap! Reddit, hn comments, false datasets, anything!
"High-paid" is an exaggeration for many of these, but certainly a small subset of people will make decent money on it.
At one provider I was as an exception paid 6x their going rate because they struggled to get people skilled enough at the high-end to accept their regular rate, mostly to audit and review work done by others. I have no illusion I was the only one paid above their stated range. I got paid well, but even at 6x their regular rate I only got paid well because they estimated the number of tasks per hour and I was able to exceed that estimate by a considerable margin - if their estimate had matched my actual speed I'd have just barely gotten to the low end of my regular rate.
But it's clear there's a pyramid of work, and a sustained effort to create processes to allow the bulk of the work to be done by low-cost labellers, and then push smaller and smaller subsets of the data up more expensive to experts, as well as creating tooling to cut down the amount of time experts spend by e.g. starting with synthetic data (including model-generated reviews of model-generated responses).
I don't think I was at the top of that pyramid - the provider I did work for didn't handle many prompts that required deep specialist knowledge (though I did get to exercise my long-dormant maths and physics knowledge that doesn't say too much). I think most of what we addressed would at most need people with MSc level skills in STEM subjects. And so I'm sure there are a few more layers on the pyramid handling PhD-level complexity data. But from what I'm seeing from hiring managers contacting me, I get the impression the pay scale for them isn't that much higher (with the obvious caveat given what I mentioned above that there almost certainly are people getting paid high multiples on the stated scale)
Some of these pipelines of work are highly complex, often including multiple stages of reviews, sometimes with multiple "competing" annotators in parallel feeding into selection and review stages.
The rate they offered was between $50-90 per hour, so significantly higher than what I’d think low-cost data labellers are getting.
Needless to say, I marked them as spam though. Harvesting emails through GitHub is dirty imo. Was also sad that the recruiter was acting on behalf of a yc company.
I can answer any questions people have about the experience (within code of conduct guidelines so I don't get in trouble...)
aspenmayer•2d ago