AI brand identity has made the unfortunate pivot to "how much do you trust us" which is going be a real race to the bottom. I don't want LLMs managing nuclear reactors or replacing junior lab technicians. I don't trust any of these LLMs to do the bare minimum, regardless of how good it is for your brand.
It's gross watching these stunts unfold. Next ChatGPT will fly a passenger jet, which Claude will one-up with an agentic surgery, which OpenAI will respond to by putting a humanoid robot on the moon. If this is what 21st century market competition looks like, we are all fucked.
Furthermore, science isn’t suffering from a lack of papers. It’s suffering from a lack of good papers. Making it easier to just pump out paper-mill publications is about the last thing science needs right now.
In the long run conceivable we could use AI to hold papers to a much higher standard, audit all the data and code that is associated etc.
Seems to be based on https://github.com/swaruplab/operon as evidenced by the authorization dialog and https://x.com/testingcatalog/status/2037684573161783373 .
Mostly targeted at life sciences - e.g. integration for FDA, PubMed, genomics databases but no ACM / IEEE as far as I can tell.
Edit: arXiv search seems to be supported - but not Google Scholar etc. So, this tool is of little use for most researchers outside life sciences.
Edit 2: Quick walkthrough: the AppImage starts a browser window with an onboarding wizard and a chat interface. It suggests a few things one might do at the start of a research project - e.g. do a quick literature review. When I chose that option, wrote Python scripts that used MCP calls to do arXiv searches. Stayed seemingly stuck there for a few minutes not returning anything. Then:
> The free-text search returned too much noise
Claude decided to choose a certain paper as a starting point for further research. Shortly afterwards:
> That DOI resolved to the wrong paper. Let me find the correct anchor papers by title/author search directly.
Then it meandered a few more minutes doing research and creating a citation graph (that it did not show to me).
> I have a complete picture. Let me verify the key DOIs resolve and then write the review.
Then:
> The lint flags em-dash overuse. Let me reduce them, then save.
Then: a nice but verbose literature overview of my chosen topic
<blink>BUT it includes at least one hallucinated reference!</blink>
P.S.: What does this mean?
[reviewer] verifier_mode=default-on downgraded to off: pro subscription tier, autoReviewer withheld (frame=f2a81cb2)I was tickled they had a "Download for linux" button prominently shown, but nothing yet.
So targeting them with a tailored product is understandable.
Image-understanding for data viz is a use case that has been ignored, and modern LLMs are getting better at proper EDA. But, uh, I may need to update my resume.
It's the content that determines the sort of science, not the toolchain.
From the bits I've seen, I'd take claude-generated code any time over that written by maths, physics, biology, linguistics people. Even though I've seen Claude make some super-big mistakes while doing data analysis I'd guess it's already more reliable than most academics trying to code.
every few weeks though i test claude and chatgpt on their scientific reasoning and it has definitely improved over time. in my experience without specific instruction on what is known/unknown they typically are lagging behind the leading edge of the field (dev bio/pluripotency in my case). probably because scientific research articles are not open-source so they can't crawl them.
claude has definitely outperformed chatgpt in this regard however, it's scientific reasoning is impressive.
Claude Sonnet 5
Do they have no shame?
As other comments have pointed out, this is mostly "data science" – but it's not just making plots and writing papers [2]. It also has integrations with many databases and computational tools, including a researcher's institutional cluster.
That alone is valuable. Integrating these tools and databases is hard and time consuming; I founded a YC-funded startup after struggling with this problem at a bio startup. If the only outcome of this product is that great APIs are built for LLMs, it will be a massive positive impact. Many databases used in computational genomics are still only accessible through FTP!
LLMs are particularly good at navigating these tools and databases. It's often very specialized, but straightforward, work that benefits from in-context skills. Seeing an early glimpse of my former customers – bioinformaticians – using LLMs to solve this problem is what led me to join Anthropic in 2024.
Also, this pattern isn't fundamentally constrained to data science: you can also integrate with a wet lab or a CRO for some kinds of science. This is what I'm spending my time on now.
This type of science doesn't solve everything, but it's useful in some niches. For example, progress on many rare diseases are bottlenecked by researcher attention rather than a fundamental breakthrough.
[1] https://x.com/phylo_bio/article/2029233694775624096
[2] In comparison, OpenAI's science product – Prism – was effectively a LaTeX editor they acquired with Crixet.
If it fails you may have to double check it did properly reimplement it, but if it succeeds you do get a reproduction.
repetition of materials and methods toward reproducibility, holds far less wieght than multiple variants of process designed to test a common hypothesis resulting in agreement.[null, or failure to null]
It wasn't perfect before, but it at least took some time to fake a paper. The problem is now people can produce a very plausible looking completely fake paper in minutes. Peer review is in the process of completely collapsing, in fact I think it's already basically done.
The only way this might fix things is if we require all papers are completely reproducable (that doesn't help in subjects like biology of course. They can still provide all the experimental data in the rawest format possible which doesn't break any laws).
An explicit text desloppification pass (i.e. LLM-use obfuscation) seems like outright scientific fraud.
JoshGlazebrook•1h ago
striking•1h ago
> Anthropic @AnthropicAI Jun 27, 2026 · 12:29 AM UTC
> Since June 12, we’ve been working closely with the US government to restore access to Claude Mythos 5 and Fable 5. Today, the government notified us that Mythos 5, our strongest cybersecurity model, can be redeployed to a set of US organizations that operate and defend critical infrastructure.
> We’re restoring access for these organizations quickly, and we’re continuing to work with the government to expand access to Mythos 5 and make Fable 5 available for general use again.
ianm218•56m ago
Opus 4.8/ GPT 5.6 level models with the right workflows/ data/ access are still good enough to do huge amounts of economically valueable work.