Claude Integrations

https://www.anthropic.com/news/integrations

734•bryanh•2mo ago

Comments

behnamoh•2mo ago

That "Allow for this chat" pop up should be optional. It ruins the entire MCP experience. Maybe make it automatic for non-mutating MCP tools.

pcwelder•2mo ago

In the latest update they've replaced "Allow for this chat" with "Always Allow".

avandekleut•2mo ago

MCP also has support for "hints" which note whether an action is destructive.

arjie•2mo ago

The cookie banner type constant Allow Allow Allow makes their client unusable. Are there any alternative desktop MCP clients?

rahimnathwani•2mo ago

https://github.com/patruff/ollama-mcp-bridge

jarbus•2mo ago

Anyone have any data on how effective models are at leveraging MCP? Hard to tell if these things are a buggy mess or a game changer

striking•2mo ago

Claude Code is doing pretty well in my experience :) I've built a tool in our CI environment that reads Jira tickets, files GitHub PRs, etc. automatically. Great for one-shotting bugs, and it's only getting better.

xnx•2mo ago

Integrations are nice, but the superpower is having an AI smart enough to operate a computer/keyboard/mouse so it can do anything without the cooperation/consent of the service being used.

Lots of people are making moves in this space (including Anthropic), but nothing has broken through to the mainstream.

WillAdams•2mo ago

Or even access multiple files?

Why can't one set up a prompt, test it against a file, then once it is working, apply it to each file in a folder in a batch process which then provides the output as a single collective file?

xnx•2mo ago

You can probably achieve what you want with https://github.com/simonw/llm and a little bit of command line.

Not sure what OS you're on, but in Windows it might look like this:

FOR %%F IN (*.txt) DO (TYPE "%%F" | llm -s "execute this prompt" >> "output.txt)

WillAdams•2mo ago

I want to work with PDFs (or JPEGs), but that should be a start, I hope.

xnx•2mo ago

llm supports attachments too

FOR %%F IN (*.pdf) DO (llm -a %%F -s "execute this prompt" >> output.txt)

TheOtherHobbes•2mo ago

I've just done something similar with Claude Desktop and its built-in MCP servers.

The limits are still buggy responses - Claude often gets stuck in a useless loop if you overfeed it with files - and lack of consistency. Sometimes hand-holding is needed to get the result you want. And it's slow.

But when it works it's amazing. If the issues and limitations were solved, this would be a complete game changer.

We're starting to get somewhat self-generating automation and complex agenting, with access to all of the world's public APIs and search resources, controlled by natural language.

I can't see the edges of what could be possible with this. It's limited and clunky for now, but the potential is astonishing - at least as radical an invention as the web was.

WillAdams•2mo ago

I would be fine with storing the output from one run, spooling up a new one, then concatenating after multiple successive runs.

pglevy•2mo ago

I've been using Claude Desktop with built-in File MCP to run operations on local files. Sometimes it will do things directly but usually it will write a Python script. For example: combine multiple.md files into one or organize photos into folders.

I also use this method for doing code prototyping by giving it the path to files in the local working copy of my repo. Really cool to see it make changes in a vite project and it just hot reloads. Then I make tweaks or commit changes as usual.

arnaudsm•2mo ago

I get often ratelimited or blocked from websites because I browse them too fast with my keyboard and mouse. The AI would be slowed down significantly.

LLM-desktop interfaces make great demos, but they are too slow to be usable in practice.

xnx•2mo ago

Good point. Probably makes sense to think of it as an assistant you assign a job to and get results back later.

boh•2mo ago

I think all the retail LLM's are working to broaden the available context, but in most practical use-cases it's having the ability to minimize and filter the context that would produce the most value. Even a single PDF with too many similar datapoints leads to confusion in output. They need to switch gears from the high growth, "every thing is possible and available" narrative, to one that narrows the scope. The "hallucination" gap is widening with more context, not shrinking.

mikepurvis•2mo ago

That's a tough pill to swallow when your company valuation is a $62B based on the premise that you're building a bot capable of transcendent thought, ready to disrupt every vertical in existence.

Tackling individual use-cases is supposed to be something for third party "ecosystem" companies to go after, not the mothership itself.

Etheryte•2mo ago

This has been my experience as well. The moment you turn internet access on, Kagi Assistant starts outputting garbage. Turn it off and you're all good.

fhd2•2mo ago

Definitely my experience. I manage context like a hawk, be it with Claude-as-Google-replacement or LLM integrations into systems. Too little and the results are off. Too much and the results are off.

Not sure what Anthropic and co can do about that, but integrations feel like a step in the wrong direction. Whenever I've tried tool use, it was orders of magnitude more expensive and generally inferior to a simple model call with curated context from SerpApi and such.

loufe•2mo ago

Couldn't agree more. I wish all major model makers would build tools into their proprietary UIs to "summarize contents and start a new conversation with that base". My biggest slowdown with working with LLMs while coding is moving my conversation to a new thread because context limit is hit (Claude) or the coherent-thought threshold is exceeded (Gemini).

fhd2•2mo ago

I never use any web interfaces, just hooked up gptel (an Emacs package) to Claude's API and a few others I regularly use, and I just have a buffer with the entire conversation. I can modify it as needed, spawn a fresh one quickly etc. There's also features to add files and individual snippets, but I usually manage it all in a single buffer. It's a powerful text editor, so efficient text editing is a given.

I bet there are better / less arcane tools, but I think powerful and fast mechanisms for managing context are key and for me, that's really just powerful text editing features.

medhir•2mo ago

you hit the nail on the head. my experience with prompting LLMs is that providing extra context that isn’t explicitly needed leads to “distracted” outputs

ketzo•2mo ago

I mean, to be honest, they gotta do both to achieve what they’re aiming for.

A truly useful AI assistant has context on my last 100,000 emails - and also recalls the details of each individual one perfectly, without confusion or hallucination.

Obviously I’m setting a high bar here; I guess what I’m saying is “yes, and”

energy123•2mo ago

There's a niche for the kitchen sink approach. It's a type of search engine.

Throw in all context --> ask it what is important for problem XYZ --> curate what it tells you, and feed that to another model to actually solve XYZ

roordan•2mo ago

This is my concern as well. How successful is it in selecting the correct tool out of hundreds or thousands?

Different to what this integration is pushing, the LLMs usage in production based products where high accuracy is a requirement (99%), you have to give a very limited tool set to get any degree of success.

bredren•2mo ago

Had been planning a custom mcp for our orgs’ jira.

I’m a bit skeptical that it’s gonna work out of the box because of the amount of custom fields that seem to be involved to make successful API requests in our case.

But I would welcome, not having to solve this problem. Jira’s interface is among the worst of all the ticket tracking applications I have encountered.

But, I have found using a LM conversation paired within enough context about what is involved for successful POSTs against the API allow me to create update and relate issues via curl.

It’s begging for a chat based LLM solution like this. I’d just prefer the underlying model not be locked to a vendor.

Atlassian should be solving this for its customers.

viraptor•2mo ago

You can also do the same thing locally: https://github.com/sooperset/mcp-atlassian Either with the cloude app, or some other system with any tool-using LLM you want.

bredren•2mo ago

I'm familiar with that MCP and was planning to build on top of it.

I hadn't realized but the new integration seems to actually just be an official, closed-source MCP produced *by* Atlassian.

sooperset's MCP is MIT licensed, so I wonder how much of the Atlassian edition is just a lift of that.

There's a comment [1] on the actual integration page asking about custom fields, which I think is possibly a big issue.

At first I thought the open-source version would get crushed by an actual Atlassian release, but not if Atlassian doesn't offer all the support for it to work really well no matter what customizations are fitted into each instance.

My hypothesis is that it takes custom code to make this work, and using the off-the-shelf for Jira won't work. Hoping to be proven wrong though, as it would be less work for me on that front.

[1] https://community.atlassian.com/forums/Atlassian-Platform-ar...

rubenfiszel•2mo ago

I feel dumb but how do you actually add Zapier or Confluence or custom MCP on the web version of claude? I only see it for Drive/Gmail/Github. Is it zoned/slow release?

throwaway314155•2mo ago

edit: <Incorrect>im fairly certain these additions only work on Claude Desktop?</Incorrect>

That or they're pulling an OpenAI and launching a feature that isn't actually fully live.

rubenfiszel•2mo ago

But the videos show claude web

85392_school•2mo ago

This part seems relevant:

> in beta on the Max, Team, and Enterprise plans, and will soon be available on Pro

joshwarwick15•2mo ago

Created a list of remote MCP servers here so people can keep track of new releases - https://github.com/jaw9c/awesome-remote-mcp-servers

zhyder•2mo ago

Is there any way to access this via the API, after perhaps some oauth from the Anthropic user account?

throwup238•2mo ago

The leap frogging at this point is getting insane (in a good way, I guess?). The amount of time each state of the art feature gets before it's supplanted is a few weeks at this point.

LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.

Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.

user_7832•2mo ago

I agree with your overall message - rapid growth appears to encourage competition and forces companies to put their best foot forward.

However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.

Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.

I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.

airstrike•2mo ago

I too like 3.5 better than 3.7 and I use it pretty often. It's like 3.7 is better in 2 metrics but worse in 10 different ones

joshstrange•2mo ago

I use Claude mostly for coding/technical things and something about 3.7 does not feel like an upgrade. I haven't gone back to 3.5 (mostly started using Gemini Pro 2.5 instead).

I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.

3.5->3.7 (even with extended thinking) felt like a nothingburger.

mattlutze•2mo ago

The expectation that one model be top marks for all things is, imo, asking too much.

tiberriver256•2mo ago

3.7 did score higher in coding benchmarks but in practice 3.5 is much better at coding. 3.7 ignores instructions and does things you didn't ask it to do.

UncleEntity•2mo ago

I think it just does that to eat up your token quota and get you to upgrade.

Like, ask it a simple question and it comes up with a full repo, complete with a README and a Makefile, when all you wanted to know was how efficient a particular algorithm would be in the included code.

Can't wait until the add research to the Pro plan because, you know, I have questions...

vineyardmike•2mo ago

> I think it just does that to eat up your token quota and get you to upgrade.

If you pay for a subscription then they don’t have an incentive to use more tokens for the same answer.

It’s definitely because feedback from people has “taught” it that more boilerplate is better. It’s the same reason ChatGPT is annoyingly complementary.

spaceman_2020•2mo ago

3.7 is too overactive

I prefer Gemini 2.5 pro for all code now

conception•2mo ago

2.5 is my “okay Claude can’t get it” but first I check my “bank account” to see if I can afford it.

ralusek•2mo ago

Isn’t 2.5 pro significantly cheaper?

yunwal•2mo ago

They're the same price, and Gemini has a large free tier.

conception•2mo ago

Not when you’re doing 500k tokens per query.

hombre_fatal•2mo ago

Gemini 2.5 Pro has solved problems that Claude 3.7 cannot, so I use it for the hard stuff.

But Gemini is at least as overactive as Claude, sometimes even more overactive when it comes to something like comment spam.

Of course, this can be fixed with prompting. And sometimes it feels sheepish complaining about the machine god doing most of my chore work that didn't even exist a couple years ago.

suyash•2mo ago

That has been the most annoying thing about it, so glad not paying for it anymore.

danw1979•2mo ago

Can’t you still use Sonnet 3.5 anyway ? or is that a paying subscriber feature only ?

sannee•2mo ago

I suspect that is precisely why it got better at coding benchmarks.

garrickvanburen•2mo ago

My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.

jeswin•2mo ago

> My current hypothesis: the more familiar you are with a topic the worse the results from any LLM.

That's not really true, since your prompts are also getting better. Better input leads to better output remains true, even with LLMs (when you see it as a tool).

franga2000•2mo ago

Being more familiar with the topic definitely doesn't always make your prompts better. For a lot of things it doesn't really change (explain X, compare X and Y...) - and this is what is being discussed it. For giving "building" instructions (like writing code) it helps a bit, but even if you know exactly what you want it to write, getting it to do that is pretty much trial and errror (too much detail makes it follow word-for-word and produce bad code, too little and it misses important parts or makes dumb mistakes).

jm547ster•2mo ago

The opposite may be true, the more effective the model the lazier the prompting as it can seemingly handle not being micromanaged as with earlier versions.

mac-mc•2mo ago

He was saying that 3.5 is better than 3.7 on the same topic he knows well tho.

user_7832•2mo ago

That is certainly the case in niche topics where published information is lacking, or needs common sense to synthesize proper outputs [1].

However in this specific example, I don't remember if it was chatgpt or gemini or 3.5 Haiku but the other(s) explained it well enough. I think I re-asked 3.5 Haiku at a later point of time, and to my complete non-surprise, it gave an answer that was quite decent.

1 - For example, the field of DIY audio - which was funnily enough the source of my question. I'm no speaker designer, but combining creativity with engineering basics/rules of thumb seems to be something LLms struggle with terribly. Ask them to design a speaker and they come up with the most vanilla, tired, textbook design - despite several existing market products that are already so much ahead/innovative.

I'm confident that if you asked an LLM an identical question for which there is more discourse - eg make an interesting/innovative phone - you'd get relatively much better results.

terminalcommand•2mo ago

I built open baffle speakers based on measurements and discussion I had with Claude. I think it is really good.

I am a novice, maybe that's why I liked it.

danw1979•2mo ago

Amen to this. As soon as you ask an LLM to explain something in detail that you’re a domain expert in, that’s when you notice the flaws.

startupsfail•2mo ago

Yes, it’s particularly bad when the information found on the web is flawed.

For example, I’m not a domain expert, but I was looking for an RC motor for a toy project and OpenAI had happily tried to source a few, with Deep Research. Only the best candidate it had picked contained an obvious typo in the motor spec (68 grams instead of 680 grams), which is just impossible for a motor of specified dimensions.

parineum•2mo ago

> Yes, it’s particularly bad when the information found on the web is flawed.

It's funny you say that because I was going to echo your parents sentiment and point out it's exactly the same with any news article you read.

The majority if content these LLMs are consuming is not from domain experts.

danw1979•1mo ago

Right, but LLMs are also consuming AWS product documentation and Terraform language docs, some things I have read a lot of and they’re often badly wrong on things from both of those domains, which are really easy for me to spot.

This isn’t just “shit in, shit out”. Hallucination is real and still problematic.

91bananas•2mo ago

I had it generate a baseball lineup the other day, it printed out a list of the 13 kids names, then said (12 players). Just straight up miscounted what it was doing, throwing a wrench to everything else it was doing beyond that point.

eru•2mo ago

Not really. I'm getting pretty good Computer Science theory out of Gemini and even ChatGPT.

bsenftner•2mo ago

It is like this with expert humans too. Which is why, no matter what, we will continue to require expert humans not just "in the loop" but as the critical cogs that are the loop itself, just as it as always been. However, this time around those people will have AI augmentation, and be intellectually athletes of a nature our civilization has never seen.

simsla•2mo ago

I always tell people to trust the LLM to the same extent as an intern. Avoid giving it tasks you cannot verify the correctness of.

subpixel•2mo ago

The more familiar you are with the state of “Jira hygiene” in the megacorp environment, the less hope you have that LLMs will be able to make sense of things.

That said, the “AI all the things” mandates could be the lever that ultimately accomplishes what 100+ PjMs couldn’t - making people write issues as if they really mattered. Because garbage in, garbage out.

fastball•2mo ago

Seems clear to me that Claude 3.7 suffers from overfitting, probably due to Anthropic seeing that 3.5 was a smash hit in the LLM coding space and deciding their North star for 3.7 should be coding benchmarks (which, like all benchmarks, do not properly capture the process of real-world coding).

If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.

snewman•2mo ago

The numbering jump is because there was "Claude 3.5" and then "Claude 3.5 (new)" and they decided to retroactively stop the madness and rename the later to 3.6 (which is what everyone was calling it anyway).

csomar•2mo ago

Plateauing overall but apparently you can gain in certain directions while you lose on some. I've written an article a while back that current models are not that far from GPT-3.5: https://omarabid.com/gpt3-now

3.7 is definitively better at coding but you feel it lost a bit of maneuverability at other domains. For someone who wants code generated, it doesn't matter but I've found myself using DeepSeek first and then getting code output by 3.7.

ilrwbwrkhv•2mo ago

None of those reports are any good though. Maybe for shallow research, but I haven't found them deep. Can you share what kind of research you have been trying there where it has done a great job of actual deep research.

Balgair•2mo ago

I'm echoing this sentiment.

Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?

Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.

To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.

But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.

85392_school•2mo ago

The funny thing is that if your request only needed the top 100's temperature or the top 33's precipitation, it could just read "List of cities by average temperature" or "List of cities by average precipitation" and that would be it, but the top 250 requires reading 184x more pages.

My perspective on this is that if Deep Research can't do something, you should do it yourself and put the results on the internet. It'll help other humans and AIs trying to do the same task.

Balgair•2mo ago

Yeah, that was intentional, well, somewhat.

The project requires the full list of every known city in the western hemisphere and also Japan, Korea, and Taiwan. But that dataset is just maddeningly large, if it is possible at all. Like, I expect it to take me years, as I have to do a lot of translations. So, I figured that I'd be nice and just as for the top 250 for the various models.

There's a lot more data that we're trying to get too and I'm hoping that I can get approval to post it as its a work thing.

wyre•2mo ago

If you have the data, but need to parse all of it, couldn’t you upload it to your LLM of choice (with a large enough context window) and have it finish your project?

XenophileJKO•2mo ago

Well remember listing/ranking things are structurally hard for these models because you have to keep track of what it has listed and what it hasn't, etc.

Balgair•2mo ago

I'm sorry I was unclear. No, I do not have the data yet and I need to get it.

therein•2mo ago

Sounds like the you're having it conduct research and then solve the Knapsack problem for you on the collected data. We should do the same for the traveling salesman one.

How do you validate its results in that scenario? Just take its word for it?

Balgair•2mo ago

Ahh, no. We'll be doing more research on the data once we have it. Things like ranking and averages and distributions on the data will come later, but first we just need it to begin with.

sxg•2mo ago

That's actually not what deep research is for, although you can obviously use it however you like. Your query is just raw data collection—not research. Deep research is about exploring a topic primarily with academic and other high-quality sources. It's a starting point for your own research. Deep research creates a summary report in ~10 min from more sources than you could probably read in a month, and then you can steer the conversation from there. Alternatively, you can just use deep research's sources as a reading list for yourself so you can do your own analysis.

Balgair•2mo ago

I think we have very different definitions of the word 'research' then.

I'd say that what you're saying is 'synthesis'. The 'Intro/Discussion' sections of a journal article.

For me, 'research' means the work of going through and getting all the data in the first place. Like, going out and collecting dino bones in the hot sun, measuring all the soil samples, etc. - that is research. For me, asking these models to go collate some webpages, I mean, you spend the first weeks of a summer undergrad's time to go do this kid of thing to get them used to the file systems and spruce up their organization skills, see where they are at. Writing the paper up, that's part of research sure, but not the hard part that really matters.

sxg•2mo ago

Agreed—we're working with different definitions of "research". The deep research products from OpenAI, Google Gemini, and Perplexity seem to be more aligned with my definition of research if that helps you gain more utility from them.

tomrod•2mo ago

It's excellent at producing short literature reviews on open access papers and data. It has no sense of judgment, trusting most sources unless instructed otherwise.

fakedang•2mo ago

Gemini's Deep Research is very good at discriminating between sources though, in my experience (haven't tried Claude or Perplexity). It finds really obscure but very relevant documents that don't even show up in Google Search for the same queries. It also discounts results that are otherwise irrelevant or very low-value from the final report. But again, it is just a starting point as the generated report is too short, and I make sure to check all the references it gives once again. But that's where I find its value.

spaceman_2020•2mo ago

My wife, who is writing her PhD right now and teaches undergraduate students, says they are at the level of a really bright final year undergrad

Maybe in a year, they’ll hit the graduate level. But we’re not near PhD level yet

xrdegen•2mo ago

It is because you are just such a genius that already knows everything unlike us stupid people that find these tools amazingly useful and informative.

cwillu•2mo ago

The failure mode is that people unfamiliar with a subject aren't able to distinguish careful analysis from bullshit. However the second failure mode where someone pointing that out is assumed to be calling people stupid is a longstanding wetware bug.

greymalik•2mo ago

Out of curiosity - can you give any examples of the programming questions you are using deep research on? I’m having a hard time thinking of how it would be helpful and could use the inspiration.

dimitri-vs•2mo ago

Easy, any research task that will take you 5 minutes to complete it's worth firing off a Deep Research request while you work on something else in parallel.

I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.

jerpint•2mo ago

Have you tried using llms.txt when available? Very useful resource

emorning3•2mo ago

I often use Chrome to valid what I think I know.

I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).

The response convinced me that I'm not a total idiot.

I'm not an academic and I'm often wrong about theory so the validation is really useful to me.

scargrillo•2mo ago

That’s a perfect example of LLMs providing epistemic scaffolding — not just giving you answers, but helping you check your footing as you explore unfamiliar territory. Especially valuable when you’re reasoning through something structurally complex like rewrite systems or proof strategies. Sometimes just seeing your internal model reflected back (or gently corrected) is enough to keep you moving.

miki_oomiri•2mo ago

"Chrome" ? What do you mean? Gemini?

risyachka•2mo ago

What are you talking about

It is literally stagnated for a year now

All that changed is they connect more apis.

And add a thinking loop with same model powering it

This is the reason it seems fast - nothing really happens except easy things

tymscar•2mo ago

I totally agree with you, especially if you actually try using these models, not just looking at random hype posters on twitter or skewed benchmarks.

That being said, isn’t it strange how the community has polar opposite views about this? Did anything like this ever happen before?

itissid•2mo ago

I've been using it for pre scoping things I have no idea about and rapidly iterating by refeeding it a version with guard rails and conditions from previous chats.

Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).

It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.

wilg•2mo ago

o3 since it can web search while reasoning is a really useful lighter weight deep research

spaceman_2020•2mo ago

Gemini 2.5 pro was the moment for me where I really thought “this is where true adoption happens”

All those talks about AI replacing people seemed a little far fetched in 2024. But in 2025, I really think models are getting good enough

antupis•2mo ago

You still need "human in the loop" because with simple tasks or some tasks that have lots of training material, models can one-shot answer and are like super good. But if the domain grows too complex, there are some not-so-obvious dependencies, or stuff that is in bleeding edge. Models fail pretty badly. So you need someone to split those complex tasks to more simpler familiar steps.

iLoveOncall•2mo ago

Calling some APIs is leap-frogging? You could do this with GPT-3, nothing has changed except it's branded under a new name and tries to establish a (flawed) standard.

If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.

never_inline•2mo ago

Why would you love on-call though?

iLoveOncall•2mo ago

In my previous team most of our oncall requests came from bug reports by customers on various tools that we owned, so to be able to work on random tools that my team owned was a nice change of pace / scenery compared to working on the same thing for 3 months uninterrupted.

Now I'm in a new team where 99% of our oncall tickets come from automated alarms and 80% of them are a subset of a few issues where the root-cause isn't easy to address but there is either nothing to actually do once investigated, or the fix is a one time process that is annoying to run, so the username isn't accurate anymore :)

I still like the change of pace though, 0 worries about sprint tasks or anything else for a week every few months.

apwell23•2mo ago

> DeepResearch which started to actually come up with useful results on more complex programming questions

Is there a youtube video of ppl using this on complex open source projects like linux kernel or maybe something like pytorch.

How come none of the oss pojects( atleast not the ones i follow) are progressing fast(er) from AI like 'deepresearch'

WhitneyLand•2mo ago

The integrations feel so rag-ish. It talks, tells you it’s going to use a tool, searches, talks about what it found…

Hope one day it will be practical to do nightly finetunes of a model per company with all core corporate data stores.

This could create a seamless native model experience that knows about (almost) everything you’re doing.

pyryt•2mo ago

I would love to do this on my codebase after every commit

notgiorgi•2mo ago

why is finetuning talked about so much less than RAG? is it not viable at all?

mring33621•2mo ago

i'm not an expert in either, but RAG is like dropping some 'useful' info into the prompt context, while fine tuning is more like a performing mix of retraining, appending re-interpretive model layers and/or brain surgery.

I'll leave it to you to guess which one is harder to do.

disgruntledphd2•2mo ago

RAG is much cheaper to run.

computerex•2mo ago

It’s significantly harder to get right, it’s a very big stepwise increase in technical complexity over in context learning/rag.

There are now some light versions of fine tuning that don’t update all the model weights but train a small adapter layer called Lora which is way more viable commercially atm in my opinion.

ijk•2mo ago

There were initial difficulties in finetuning that made it less appealing early on, and that's snowballed a bit into having more of a focus on RAG.

Some of the issues still exist, of course:

* Finetuning takes time and compute; for one-off queries using in-context learning is vastly more efficient (i.e., look it up with RAG).

* Early results with finetuning had trouble reliably memorizing information. We've got a much better idea of how to add information to a model now, though it takes more training data.

* Full finetuning is very VRAM intensive; optimizations like LoRA were initially good at transferring style and not content. Today, LoRA content training is viable but requires training code that supports it [1].

* If you need a very specific memorized result and it's costly to get it wrong, good RAG is pretty much always going to be more efficient, since it injects the exact text in context. (Bad RAG makes the problem worse, of course).

* Finetuning requires more technical knowledge: you've got to understand the hyperparameters, avoid underfitting and overfitting, evaluate the results, etc.

* Finetuning requires more data. RAG works with a handful datapoints; finetuning requires at least three orders of magnitude more data.

* Finetuning requires extra effort to avoid forgetting what the model already knows.

* RAG works pretty well when the task that you are trying to perform is well-represented in the training data.

* RAG works when you don't have direct control over the model (i.e., API use).

* You can't finetune most of the closed models.

* Big, general models have outperformed specialized models over the past couple of years; if it doesn't work now, just wait for OpenAI to make their next model better on your particular task.

On the other hand:

* Finetuning generalizes better.

* Finetuning has more influence on token distribution.

* Finetuning is better at learning new tasks that aren't as present in the pretraining data.

* Finetuning can change the style of output (e.g., instruction training).

* When finetuning pays off, it gives you a bigger moat (no one else has that particular model).

* You control which tasks you are optimizing for, without having to wait for other companies to maybe fix your problems for you.

* You can run a much smaller, faster specialized model because it's been optimized for your tasks.

* Finetuning + RAG outperforms just RAG. Not by a lot, admittedly, but there's some advantages.

Plus the RL Training for reasoning has been demonstrating unexpectedly effective improvements on relatively small amounts of data & compute.

So there's reasons to do both, but the larger investment that finetuning requires means that RAG has generally been more popular. In general, the past couple of years have been won by the bigger models scaling fast, but with finetuning difficulty dropping there is a bit more reason to do your own finetuning.

That said, for the moment the expertise + expense + time of finetuning makes it a tough business proposition if you don't have a very well-defined task to perform, a large dataset to leverage, or other way to get an advantage over the multi-billion dollar investment in the big models.

[1] https://unsloth.ai/blog/contpretraining

jimbokun•2mo ago

So is a good summary:

1. If you have a large corpus of valuable data not available to the corporations, you can benefit from fine tuning using this data.

2. Otherwise just use RAG.

kordlessagain•2mo ago

That summary's not wrong, it's just reductionist.

Fine-tuning makes sense when you need behavioral shifts (style, tone, bias) or are training on data unavailable at runtime.

RAG excels when you want factual augmentation without retraining the whole damn brain.

It's not either/or — it's about cost, latency, use case, and update cycles. But hey, binaries are easier to pitch on a slide.

msp26•2mo ago

Thanks for the detailed comment.

I had no idea that fine tuning for adding information is viable now. Last I checked (year+ back) it seemed to not work well.

omneity•2mo ago

RAG is infinitely more accessible and cheaper than finetuning. But it is true that finetuning is getting severely overlooked in situations where it would outperform alternatives like RAG.

riku_iki•2mo ago

> RAG is infinitely more accessible and cheaper than finetuning.

it depends on your data access pattern. If some text goes through LLM input many times, it is more efficient for LLM to be finetuned on it once.

omneity•2mo ago

This assumes the team deploying the RAG-based solution has equal ability to either engineer a RAG-based system or to finetune an LLM. Those are different skillsets and even selecting which LLM should be finetuned is a complex question, let alone aligning it, deploying it, optimizing inference etc.

The budget question comes into play as well. Even if text is repetitively fed to the LLM, that might happen over a long enough time compared to finetuning which is a sort of capex that it is financially more accessible.

Now bear in mind, I'm a big proponent of finetuning where applicable and I try to raise awareness to the possibilities it opens. But one cannot deny RAG is a lot more accessible to teams which are likely developers / AI engineers compared to ML engineers/researchers.

riku_iki•2mo ago

> But one cannot deny RAG is a lot more accessible to teams which are likely developers / AI engineers compared to ML engineers/researchers.

It looks like major vendors provide simple API for fine-tuning, so you don't need ML engineers/researchers: https://platform.openai.com/docs/guides/fine-tuning

Setting RAG infra is likely more complicated than that.

omneity•2mo ago

You are certainly right, managed platforms make finetuning much easier. But managed/closed model finetuning is pretty limited and in fact should be named “distribution modeling” or something.

Results with this method are significantly more limited compared to all the power open-weight finetuning gives you (and the skillset needed in return).

And in either case don’t forget alignment and evals.

riku_iki•2mo ago

> Results with this method are significantly more limited compared to all the power open-weight finetuning gives you (and the skillset needed in return).

I am not sure I understand why you are so certain that finetuned top market models, built by top researchers will be significantly worse than whatever open source model you pick.

retinaros•2mo ago

fine tuning can cost 80$ and a few hours. a good rag doesnt exist

never_inline•2mo ago

Can find tuning produce results as grounded as RAG?

How many epochs do you run?

onel•2mo ago

You usually fine tune when you want to add capabilities (an output style, json output, function calling, etc). You use RAG to add knowledge

kordlessagain•2mo ago

Nuance is hard. Binary choices are fast, comforting, and require less thought. Certainty feels safer than ambiguity — especially in conflict, where complexity threatens identity. And in most arenas (tech, media, politics), decisive hot takes get applause. Fence-sitters get ignored.

I worked at a startup where the CEO swore up and down that real-time fine-tuning was the future — that models would continuously update with company data. It sounded cool until you remember: That’s not how LLMs work. It’s not efficient. It’s not flexible. And it’s not even necessary — we already have RAG.

Pipedreams make good pitch decks. But they break when you hit production.

It's a fucking pipedream, this. That's not how LLMs work, it's not efficient, it's not useful (we have RAG for reference augmentation), and it’s not even desirable unless you want your model overfitting on stale, internal narratives every night.

VSerge•2mo ago

Ongoing demo of integrations with Claude by a bunch of A-list companies: Linear, Stripe, Paypal, Intercom, etc.. It's live now on: https://www.youtube.com/watch?v=njBGqr-BU54

In case the above link doesn't work later on, the page for this demo day is here: https://demo-day.mcp.cloudflare.com/

mkagenius•2mo ago

are people really doing this mcp thing, yikes. Tomorrow, let me reinvent css as model context design (mcd)

warkdarrior•2mo ago

Do you have a better solution to give models on-demand access to data sources?

mkagenius•2mo ago

you mean other than writing an api? no

cruffle_duffle•2mo ago

And what is the protocol for the interface between the GPU-based LLM and the API? How does the LLM signal to make a tool call? What mechanism does it use?

Because MCP isn’t an API it’s the protocol that defines how the LLM even calls the API in the first place. Without it, all you've got is a chat interface.

A lot of people misunderstand what is the role of MCP. It’s the signaling the LLM uses to reach out of its context window and do things.

turblety•2mo ago

Is there a reason they went and built some new standard, rather than just using a http api?

knowaveragejoe•2mo ago

You can use either HTTP or stdio.

imbnwa•2mo ago

Feel like middle management is gonna go well before engineers do with LLM rate of advancement

DebtDeflation•2mo ago

That started awhile ago. Google "the great flattening".

6stringmerc•2mo ago

Feed Claude the data willingly to learn more about human behavior they can’t scrape or obtain otherwise without consent? Hard pass. I’m not telling any AI any more about what it means to be a creative person because training it how to suffer will only further hurt my job prospects. Nice try, no dice.

n_ary•2mo ago

Is this the beginning of the apps for everything era and finally the SaaS for your LLM begins? Initially we had internet but value came when instead of installed apps, webapps arrived to become SaaS. Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila! Let the new golden age of SaaS for LLMs begin and the old fad(replace job XYZ with AI) die already.

throwaway7783•2mo ago

MCP is yet another interface for an existing SaaS (like UI and APIs), but now magically "agent enabled". And $$$ of course

clvx•2mo ago

I'm more excited I can run now a custom site, hook an MCP for it, and have all the cool intelligence I had to pay for SaaS without having to integrate to them plus govern my data, it's a massive win. I just see AI assistant coding replicating current SaaS services that I can run internally. If my shop was a specific stack, I could aim to have all my supporting apps in that specific stack using AI assistant coding, simplifying operations, and being able to hook up MCP's to get intelligence from all of them.

Truly, OSS should be more interesting in the next decade for this alone.

heyheyhouhou•2mo ago

We should all thank the chinese companies for releasing so many incredible open weight models. I hope they keep doing it, I dont want to rely on OpenAI, Anthropic or Google for all my future computer interactions.

achierius•2mo ago

Don't forget Meta, without them we probably wouldn't have half the publicly available models we do today.

naravara•2mo ago

On one hand, yes this is very cool for a whole host of personal uses. On the other hand giving any company this level of access to as many different personal data sources as are out there scares the shit out of me.

I’d feel a lot better if we had something resembling a comprehensive data privacy law in the United States because I don’t want it to basically be the Wild West for anyone handling whatever personal info doesn’t get covered under HIPAA.

falcor84•2mo ago

Absolutely agreed, but just wanted to mention that it's essentially the same level of access you would give to Zapier, which is one of their top examples of MCP integrations.

n_ary•2mo ago

It took many years for online tracking, iframes, sticky cookies and cambridge analytics before things like GDPR came into existence. We have to similarly wait a few years before similar major leaks happen through LLM pipelines/integrations. Sadly, that is the reality we live with.

jimbokun•2mo ago

The question is whether or not it happens before the emergence of Skynet.

OtherShrezzing•2mo ago

I'd love a _tip jar_ MCP, where the LLM vendor can automatically tip my website for using its content/feature/service in a query's response. Even if the amount is absolutely minuscule, in aggregate, this might make up for ad revenue losses.

fredoliveira•2mo ago

Not that exactly, but I just saw this on twitter a few minutes ago from Stripe: https://x.com/jeff_weinstein/status/1918029261430255626

consumer451•2mo ago

This had never occurred to me, and it’s pretty cool.

It is really cool to witness the velocity of MCP adoption.

insin•2mo ago

It's perfect, nobody will have time to care about how many 9s your service has because the nondeterministic failure mode now sitting slap-bang in the middle is their problem!

Manfred•2mo ago

Imagine dynamic subscription rates based on vibes where you won't even notice price hikes because not even the supplier can explain what they are.

donmcronald•2mo ago

> Now if LLMs can use specific remote MCP which is another SaaS for your LLM, the remote MCP powered service can charge a subscription to do wonderful things and voila!

I've always worked under the assumption the best employees make themselves replaceable via well defined processes and high quality documentation. I have such a hard time understanding why there's so much willingness to integrate irreplaceable SaaS solutions into business processes.

I haven't used AI a ton, but everything I've done has focused on owning my own context, config, etc.. How much are people going to be willing to pay if someone else owns 10+ years of their AI context?

Am I crazy or is owning the context massively valuable?

brumar•2mo ago

Hello fellow context owner. I like my modules with their context.sh at their root level. If crafted with care, magic happens. Reciprocally, when AI derails, it's most often due to bad context management and fixed by improving it.

drivingmenuts•2mo ago

Is each Claude instance a separate individual or is a shared AI? Because I'm not sure I would want an AI that learned about my confidential business information sharing that with anyone else, without my express permission.

This does not sound like it would be learning general information helpful across an industry, but specific, actionable information.

If not available now, is that something that AI vendors are working toward? If so, what is to keep them from using that knowledge to benefit themselves or others of their choosing, rather than the people they are learning from?

While people understand ethics, morals and legality (and ignore them), that does not seem like something that an AI understands in a way that might give them pause before doing an action.

zoogeny•2mo ago

I'm curious what kind of research people are doing that takes 45 minutes of LLM time. Is this a poke at the McKinsey consultant domain?

Perhaps I am just frivolous with my own time, but I tend to use LLMs in a more iterative way for research. I get partial answers, probe for more information, direct the attention of the LLM away from areas I am familiar and towards areas I am less familiar. I feel if I just let it loose for 45 minutes it would spend too much time on areas I do not find valuable.

This seems more like a play for "replacement" rather than "augmentation". Although, I suppose if I had infinite wealth, I could kick of 10+ research agents each taking 45 minutes and then review their output as it became available, then kick off round 2, etc. That is, I could do my process but instead of interactively I could do it asynchronously.

throwup238•2mo ago

That iterative research process is exactly how I use Google Deep Research since it has a 20/day rate limit. Research a problem, notice some off hand assumption or remark the report made, and fire off another research run asking about it. It depends on what you work on; in my use case I often have to do hours of research for 30 minutes of work like when integrating a bunch of different vendors’ APIs or pouring over datasheets for EE, so it’s worth firing off research and then working on something else for 10-20 minutes (it helps that the Gemini app fires off a push notification when the report is done - Anthropic please do this! Even for requests made from the web app).

As for long research times, one thing I’ve been using it for is historical research on old books. Gemini DeepResearch was the first one able to properly explain the nuances of identifying a chimeral first edition Origin of Species after taking half an hour and reading 400 sources. It went into all the important details like spelling errors and the properties of chimeral FY2** copies found in various libraries around the world.

abhisek•2mo ago

Where is Skynet and when is judgement day?

52-6F-62•2mo ago

1. Publishing advertisements all over the place. 2. Some Tuesday

pton_xd•2mo ago

"To start, you can choose from Integrations for 10 popular services, including Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid. ... Each integration drastically expands what Claude can do."

Give us an LLM with better reasoning capabilities, please! All this other stuff just feels like a distraction.

Centigonal•2mo ago

Building integrations is a more predictable way of developing a smaller competitive advantage versus research. I think most of the leading AI companies are adopting a multi-arm strategy of research + product/ecosystem development to balance their risks.

atonse•2mo ago

I disagree. They can walk and chew gum, do both things at once. And this practical stuff is very important.

I've been using the Atlassian MCP for nearly a month now, and it's completely changed (and eliminated) the feeling of having an overwhelming backlog.

I can have it do things like "find all the tickets related to profile editing and combine them into one epic" where it works perfectly. Or "help me prioritize the 15 tickets assigned to me this sprint" and it'll actually go through and suggest "maybe you can do these two tickets first since they seem smaller, then do this big one" – i haven't hooked it up to my calendar yet.

But I'd love for it to suggest things like "do this one ticket that requires a lot of heads down time on wednesday since you don't have any meetings. I can create a block on your calendar so that nobody will schedule a meeting then"

Those are all superhuman things that can be done with MCP and a smart model.

I've defined rules in cursor that say "when I ask you to mark something ready for test, change the status and assign it to <x person>, and leave a comment summarizing the changes"

If you look at my JIRA comments now, you'd wonder how I had so much time to write such thorough comments. I don't, Cursor and whatever model is doing it for me.

It's been an absolute game changer. MCP is going to be what the App store was to mobile. Yes you can get by without it, but actually hooking into all your daily tool is when this stuff gets insanely valuable in a practical sense.

OJFord•2mo ago

> If you look at my JIRA comments now, you'd wonder how I had so much time to write such thorough comments. I don't, Cursor and whatever model is doing it for me.

How do your colleagues feel about it?

warkdarrior•2mo ago

My colleagues' LLM assistants think that my LLM assistant leaves great JIRA comments.

atonse•2mo ago

haha! Funny enough I do have to tell the LLMs to leave concise comments.

I also don’t want to read too many unnecessary words.

sdesol•2mo ago

Joking aside, I do believe we are moving into a era where we have LLMs write for each other and humans have a dedicated TL;DR. This includes code with a lot of comments or design styles that might seem obvious or stupid but can help another LLM.

eknkc•2mo ago

Why use JIRA at this point then?

Can’t we point an LLM to a sqlite db and tell it to treat it as an issue tracking db and have everyone do the same.

The service (jira) would materialize inside the LLMs then.

Why even use abstractions like tickets etc. Ask LLM what to do.

zoogeny•2mo ago

JIRA is more than just ticket management for most big orgs. It provides a reporting interface for business with long-term planning capabilities. A lot of the annoying things that devs have to do in JIRA is often there to make those functions more valuable. In other cases it is a compliance thing as well. Some certifications necessary for enterprise sales require audit trails for all code changes, from the bug report to the code commit. JIRA provides the integration and reporting necessary for that.

Unless you can provide the same visibility, long-term planning features and compliance aspects of JIRA on top of you sqlite db, you won't compete with JIRA. But if you do add those things on top of SQLite and LLMs, you probably have a solid business idea. But you'd first need to understand JIRA well enough to know why they are there in the first place.

falcor84•2mo ago

Exactly, applying the principle of Chesterton's Fence [0].

[0] https://en.wikipedia.org/w/index.php?title=Wikipedia:FENCE

atonse•2mo ago

Well I had half a mind to not tell them to see what they’d say, but I also was excited to show everyone so they can also be empowered with it.

One of them said “yeah I was wondering cuz you never write that much” - as a leader, I actually don’t set a good example of how to leave quality JIRA comments. And my view with all these things is that I have to lead by example, not by orders.

With the help of these kinds of tools, we can improve the quality of these comments. And I wouldn’t expect others to write them manually, more that I wanted to show that everyone’s use of JIRA on the team can improve.

OJFord•2mo ago

Notice they commented on the quantity, not the quality?

I don't think it's good leadership to unleash drivel on an organisation, have people waste time reading and perhaps replying to it, thinking it's something important and thoughtful coming from atonse.

Good thing you told them though, now they can ignore it.

stefan_•2mo ago

It sure seems like the next evolution of Jira though. Designed to waste everyones time, picked by "leaders" that don't use it. Why not spam tickets with LLM drivel? They are perfect to pick up on all the inconsistency in the PM insanity driven custom designed workflow - and comment on it tagging a bunch of stray people seen in the ticket history, the universal exit hatch.

atonse•2mo ago

In another comment I mentioned that I ask for it to be concise.

Also, a lot of the kinds of comments are things like, when you combine a bunch of tickets, leaving comments on the cancelled tickets to show why they were cancelled.

In the past, that info simply wouldn’t be there.

sensanaty•2mo ago

Someone please shoot me if my PM ever gets this idea in his head of using LLM slop to spam tickets with en masse.

There's nothing I hate more than people sending me their AI messages, be it in a ticket or a PR or even on Slack. I'm forced to engage and spend effort on something it took them all of 3 seconds to generate without even proofreading what they're sending me says. The amount of times I've had to ask 11 clarifying questions because their message has 11 contradictions within itself is maddening to the highest degree.

The worst is when I call out one of these numerous contradictions, and the reply is "oh haha, stupid Claude :)", makes my blood boil and at the same time amazes me that someone has so little pride and respect for their fellow humans to do crap like that.

artur_makly•2mo ago

"I remember those days when we manually wrote comments"... - what were comments papa?

atonse•2mo ago

Sounds like your coworkers might be abusing things here.

I’m not remotely interested in throwing random slop in there.

In fact, we did try a year ago to have AI help write our tickets and it was very clear that they were AI generated. There was way too much nonsense in there that wasn’t relevant to our product.

So we don’t do that.

zoogeny•2mo ago

Honestly, that backlog management idea is probably the first time an MCP actually sounded appealing to me.

I'm not in that world at the moment, but I've been the lead on several projects where the backlog has became a dumping ground of years of neglect. You end up with this tiered backlog thing where one level of backlog gets too big so you create a second tier of backlog for the stuff you are actually going to work on. Pretty soon you end up with duplicates in the second tier backlog for items already in the base level backlog since no one even looks at that old backlog anymore.

I've done a lot of tidy up myself when I inherit this kind of mess, just closing tickets we definitely will never get to, de-duping, adding context when available, grouping into epics, tagging with relevant "tech-debt", "security", "bug", "automation", etc. But when there are 100s of tickets it is a slog. Having an LLM do this makes so much sense.

organsnyder•2mo ago

I have Claude hooked up to our project management system, GitHub, and my calendar (among other things). It's already proving extremely useful for various project management tasks.

edaemon•2mo ago

Lots of reported security issues with MCP servers seemed to be mitigated by their local-only setup. These MCP implementations are remotely accessible, do they address security differently?

paulgb•2mo ago

Largely, yes -- one of the big issues with using other people's random MCP servers is that they are run by default as a system process, even if they only need to speak over an API. Remote MCP mitigates this by not running any untrusted code locally.

What it _doesn't_ seem to yet mitigate is prompt injection attacks, where a tool call description of one tool convinces the model to do something it shouldn't (like send sensitive data to a server owned by the attacker.) I think these concerns are a little bit overblown though; things like pypi and the Chrome Extension store scare me more and it doesn't stop them from mostly working.

zoogeny•2mo ago

They offhand mention OAuth integration in their discussion of Cloudflare integrated solutions. I can't see how that would be any less secure than any other OAuth protected API offering.

Nijikokun•2mo ago

context windows are too small and conversely larger windows are not accurate enough its annoying

indigodaddy•2mo ago

So any chat to Claude will now just auto-activate web search to be included? What if I try to use it just as a search engine exclusively? Also will proxies like Openrouter have access to the web search capabilities?

gianpaj•2mo ago

> Web search is now globally available to all Claude.ai paid plans.

surfingdino•2mo ago

I don't know why web search is such a big deal. You can implement it with any LLM that offers an API and function calling.

tene80i•2mo ago

Do you think most people know how to do that, or even what it means? The market is larger than just software engineers.

ChicagoDave•2mo ago

There is targeted value in integrations, but everything still leads back to larger context windows.

I love MCP (it’s way better than plain Claude) but even that runs into context walls.

davee5•2mo ago

I'm quite struck by the title of this announcement. The box being drawn around "your world" shows how narrow the AI builder's window into reality tends to be.

> a new way to connect your apps and tools to Claude. We're also expanding... with an advanced mode that searches the web.

The notion of software eating the world, and AI accelerating that trend, always seems to forget that The World is a vast thing, a physical thing, a thing that by its very nature can never be fully consumed by the relentless expansion of our digital experiences. Your worldview /= the world.

The cynic would suggest that the teams that build these tools should go touch grass, but I think that misses the mark. The real indictment is of the sort of thinking that improvements to digital tools [intelligences?] in and of themselves can constitute truly substantial and far reaching changes.

The reach of any digital substrate inherently limited, and this post unintentionally lays that bare. And while I hear accelerationists invoking "robots" as the means for digital agents to expand their potent impact deeper into the real world I suggest this is the retort of those who spend all day in apps, tools, and the web. The impacts and potential of AI is indeed enormous, but some perspective remains warranted and occasional injections of humility and context would probably do these teams some good.

dang•2mo ago

(Just for context: we've since changed the title above. Corporate press release titles are rarely a good fit for HN and we usually change them.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...)

atonse•2mo ago

I think with MCPs and related tech, if Apple just internally went back to the drawing board and integrated the concept of MCPs directly into iOS (via the "Apple Intelligence" umbrella) and seamlessly integrated it into the App Store and apps, they will win the mobile race for this.

Being Apple, they would have to come up with something novel like they did with push (where you have _one_ OS process running that delegates to apps rather than every app trying to handle push themselves) rather than having 20 MCP servers running. But I think if they did this properly, it would be so amazing.

I hope Apple is really re-thinking their absolutely comical start with AI. I hope they regroup and hit it out of the park (like how Google initially stumbled with Bard, but are now hitting it out of the park with Gemini)

mattlondon•2mo ago

Do you really think Apple can catch up with and then surpass all these SOTA AI labs?

They bet big and got distracted on VR. It was obviously the wrong choice at the time, and even more so now. They're going to have to abandon all that VR crap and pivot hard to AI to try and catch up. I think the more likely case is they can't catch up now and will just have to end up licensing Gemini from Google/Google paying them to use Gemini as the default AI.

atonse•2mo ago

No I’m not saying Apple even has to build their own model. I’m saying Apple can build a stellar _product_ experience around it.

As others have pointed out, if that’s what App Intents are, have they started to integrate this as part of Apple Intelligence?

mattlondon•2mo ago

So what is Apple Intelligence then? Just using CharGPT for Siri's backend?

Bit uninspiring really isn't it?

_pdp_•2mo ago

Apple already has the equivalent of MCP. https://developer.apple.com/documentation/appintents.

bloomca•2mo ago

That's just App Intents. I don't think they lack data at this point, they just struggle how to use that data on the OS level

cruffle_duffle•2mo ago

The video demos never really showed the auth “story” but I assume that there is some oauth step to connect Claude with your MCP service, right?

belter•2mo ago

All these integrations are likely to cause a massive security leak sooner or later.

OJFord•2mo ago

Where's the permissioning, the data protection?

People will say 'aaah ad company' (me too sometimes) but I'd honestly trust a Google AI tool with this way more. Not just because it already has access to my Google Workspace obviously, but just because it's a huge established tech firm with decades of experience in trying not to lose (or have taken) user data.

Even if they get the permissions right and it can only read my stuff if I'm just asking it to 'research', now Anthropic has all that and a target on their backs. And I don't even know what 'all that' is, whatever it explored deeming it maybe useful.

Maybe I'm just transitioning into old guy not savvy with latest tech, but I just can't trust any of this 'go off and do whatever seems correct or helpful with access to my filesystem/Google account/codebase/terminal' stuff.

I like chat-only (well, +web) interactions where I control the input and taking the output, but even that is not an experience that gives me any confidence in giving uncontrolled access to stuff and it always doing something correct and reasonable. It's often confidently incorrect too! I wouldn't give an intern free reign in my shell either!

joshwarwick15•2mo ago

Permissoning: OAuth Data protection: Local LLMs

weinzierl•2mo ago

If you do not enable "Web Search" are you guaranteed it does not access the web anyway?

Sometimes I want a pure model answer and I used to use Claude for that. For research tasks I preferred ChatGPT, but I found that you cannot reliably deny it web access. If you are asking it a research question, I am pretty sure it uses web search, even when "Search" and "Deep Research" are off.

rafram•2mo ago

Oh no, remote MCP servers. Security was nice while it lasted!

rvz•2mo ago

This is a fantastic time to get into the security space and trick all these LLMs into leaking sensitive data and make a lot of money out of that.

MCP is a flawed spec and quite frankly a scam.

knowaveragejoe•2mo ago

What makes a remotely hosted MCP server less secure? The alternative, and what most of MCP consists of at the moment, is essentially running arbitrary code on your machine, as your user, and hooking this up to an LLM.

rvz•2mo ago

Can't wait for the first security incident relating to the fundamentally flawed MCP specification which an LLM will inadvertently be tricked to leak sensitive data.

Increasing the amount of "connections" to the LLM increases the risk in a leak and it gives your more rope to hang yourself with when at least one connection becomes problematic.

Now is a great time to be a LLM security consultant.

dimgl•2mo ago

This is great, but can you fix Claude 3.7 and make it more like 3.5? I'm seriously disappointed with 3.7. It seems to be performing significantly worse for me on all tasks.

Even my wife, who normally used Claude to create interesting recipes to bake cookies, has noticed a huge downgrade in 3.7.

t0lo•2mo ago

3.7 seems to be way more filler and ambiguity and less insights for me.

bjornsing•2mo ago

The strategic business dynamic here is very interesting. We used to have "GPT-wrapper SaaS". I guess what we're about to see now is the opposite: "SaaS/MCP-wrapper GPTs".

jimbokun•2mo ago

The GPT wrappers were always going to be subsumed by improvements to the models themselves.

LLMs wrapping the services makes more sense, as the data stored in those services adds a lot of value to off the shelf LLMs.

bjornsing•2mo ago

I think I agree. There’s a lot of utility in a single LLM that can talk to many SaaS and integrate them. Feels like a better path forward than a separate LLM inside every SaaS.

hdjjhhvvhga•2mo ago

The people who connect a LLM to their Paypal and CLoudflare accounts perfectly deserve the consequences, both positive and negative.

conroy•2mo ago

Remote MCP servers are still in a strange space. Anthropic updated the MCP spec about a month ago with a new Streamable HTTP transport, but it doesn't appear that Claude supports that transport yet.

When I hooked up our remote MCP server, Claude sends a GET request to the endpoint. According to the spec, clients that want to support both transports should first attempt to POST an InitializeRequest to the server URL. If that returns a 4xx, it should then assume the SSE integration.

kordlessagain•2mo ago

Claude Desktop doesn't support resources without directly importing them and now they've taken the button away for that and tools so I have to build the status of them into a tool so I could see what was loading and what wasn't.

Here's my tool for Desktop: https://github.com/kordless/EvolveMCP

gonzan•2mo ago

So there are going to be companies built on just an MCP server I guess, wonder what the first big one will be, just a matter of time I think

worldsayshi•2mo ago

Is it just me that would like to see more of confirmations before making opaque changes to remote systems?

I might not dare to add an integration if it can potentially add a bunch of stuff to the backing systems without my approval. Confirmations and review should be part of the protocol.

sepositus•2mo ago

Yeah, this was my first thought. I was watching the video of it creating all of these Jira tickets just thinking in my head: "I hope it just did all that correctly." I think the level of patience with my team would be very low if I started running an LLM that accidentally deleted a bunch of really important tickets.

worldsayshi•2mo ago

Yeah. Feels like it's breaking some fundamental UX principle. If an action is going to make any significant change make sure that it fulfills at least one of these:

1. Can be rollbacked/undone

2. Clearly states exactly what it's going to do in a reviewable way

If those aren't fulfilled you're going to end up with users that are afraid of using your app.

todsacerdoti•2mo ago

Check out 2500+ MCP servers at https://mcp.pipedream.com

the_clarence•2mo ago

Been playing with MCP in the last few days and it's basically a more streamlined way to define tools/function calls.

That + the agent SDK of openAI makes creating agentic flow so easy.

On the other hand you're kinda forced to run these tools / MCP servers in their own process which makes no sense to me.

nilslice•2mo ago

you might like mcp.run, a tool management platform we're working on... totally agree running a process per tool, with all kinds of permissions is nonsensical - and the move to "remote MCP" is a good one!

but, we're taking it a step (or two) further, enabling you to dynamically build up a MCP server from other servers managed in your account with us.

try it out, or let me get you a demo! this goes for any casual comment readers too ;)

https://cal.com/team/dylibso/mcp.run-demo

the_clarence•2mo ago

I meant I wanted to run them synchronously in the same process :)

nilslice•2mo ago

that's what this does :)

you bundle mcp servers into a profile, which acts as a single virtual mcp server and can be dynamically updated without re-configuring your mcp client (e.g. claude)

kostas_f•2mo ago

Anthropic's strategy seems to go towards "AI as universal glue". They want to tie Claude into all the tools teams already live in (Jira, Confluence, Zapier, etc.). That's a smart move for enterprise adoption, but it also feels like they're compensating for a plateau in core model capabilities.

Both OpenAI and Google continue to push the frontier on reasoning, multimodality, and efficiency whereas Claude's recent releases have felt more iterative. I'd love to see Anthropic push into model research again.

bl4ckneon•2mo ago

I am sure they are already doing that. To think that an AI researcher is doing essentially api integration work is a bit silly. Multiple efforts can happen at the same time

kostas_f•2mo ago

They certainly have internal research efforts underway, but I'm talking about what’s actually been released to end users via the Claude app or API. Their latest public Sonnet release 3.7 (feb 2025) felt pretty incremental compared to Sonnet 3.5 (june 2024), especially when you compare them to OpenAI and Google released models. In terms of the models you can integrate today, Anthropic hasn’t quite kept pace on either reasoning performance or cost efficiency.

freewizard•2mo ago

I would expect Slack do this. Maybe Slack and Claude should merge one day, given MS and Google has their own core models.

tjsk•2mo ago

Slack is owned by Salesforce which is doing its own Agentforce stuff

spacebanana7•2mo ago

Salesforce loves acquisitions. I can already picture Benioff’s victory speech on CNBC.

kostas_f•2mo ago

Anthropic is now too expensive to be acquired. Only Amazon could be a potential buyer, given that out of the 3 big cloud providers, it's the only one without their own model offering.

kordlessagain•2mo ago

What they are doing is taking the model and elevating it to the level of attention controller in the conversation. It can run tools, spout inferences, etc. If they can control the UI in the CLIENT, they win. If they don't, they have to rely on someone putting their API integration into another project.

deanc•2mo ago

I find it absolutely astonishing that Atlassian hasn’t yet provided an LLM for confluence instances and instead a third party is required. The sheer scale of documentation and information I’ve seen at some organisations I’ve worked with is overwhelming. This would be a killer feature. I do not recommend confluence to my clients simply because the search is so appalling .

Keyword search is such a naive approach to information discovery and information sharing - and renders confluence in big orgs useless. Being able to discuss and ask questions is a more natural way of unpacking problems.

artur_makly•2mo ago

on their announcement page they wrote " In addition to these updates, we're making WEB SEARCH available globally for all Claude users on paid plans."

So I tested a basic prompt:

1. go to : SOME URL

2. copy all the content found VERBATIM, and show me all that content as markdown here.

Result : it FAILED miserably with a few basic html pages - it simply is not loading all the page content in its internal browser.

What worked well: - Gemini 2.5Pro (Experimental) - GPT 4o-mini // - Gemini 2.0 Flash ( not verbatim but summarized )

meander_water•2mo ago

Looks like this is possible due to the relatively recent addition of OAuth2.1 to the MCP spec [0] to allow secure comms to remote servers.

However, there's a major concern that server hosters are on the hook to implement authorization. Ongoing discussion here [1].

[0] https://modelcontextprotocol.io/specification/2025-03-26

[1] https://github.com/modelcontextprotocol/modelcontextprotocol...

dmarble•2mo ago

Direct link to the spec page on authorization: https://modelcontextprotocol.io/specification/2025-03-26/bas...

Source: https://github.com/modelcontextprotocol/modelcontextprotocol...

marifjeren•2mo ago

That github issue is closed but:

> major concern that server hosters are on the hook to implement authorization

Doesn't it make perfect sense for server hosters to implement that? If Claude wants access to my Jira instance on my behalf, and Jira hosts a remote MCP server that aids in exposing the resources I own, isn't it obvious Jira should be responsible for authorization?

How else would they do it?

cruffle_duffle•2mo ago

The authorization server and resource server can be separate entities. Meaning that jira instance can validate the token but not be the one issuing it or handling credentials.

marifjeren•2mo ago

Yes, this is true of OAuth, which is exactly what the latest Model context protocol is using.. What's the concern again?

I guess maybe you are saying the onus is NOT on the MCP server but on the authorization server.

Anyway while technically true this is mostly just distracting because:

1. in my experience the resource server and the authorization server are almost always maintained by the same company -- Jira/Atlassian being an example

2. the resource server still minimally has the responsibility of identifying and integrating with some authorization server, and *someone* has to be the authorization server, so I'm not sure deferring the responsibility to that unidentified party is a strong defense against the critique anyway. The strong defense is: of course the MCP server should have these responsibilities.

meander_water•2mo ago

I think the pain points will be mostly for enterprise customers who want to integrate servers into their auth systems.

For example, say you have a JIRA self hosted instance with SSO to entra id. You can't just install an MCP server off the shelf because authZ and resources are tightly coupled and implementation specific. It would be much easier if the server only handled providing resources, and authZ was offloaded to a provider of your choosing.

marifjeren•2mo ago

I'm under the impression that what you described is exactly how the new model context protocol works, since it's using oauth and is therefore unaware of any of the authentication (eg SSO) details. Your authentication process could be done via carrier pigeon and Claude would be none the wiser.

halter73•2mo ago

That github issue is closed because it's been mostly completed. As of https://github.com/modelcontextprotocol/modelcontextprotocol..., the latest draft specification does not require the resource server to act as or poxy to the IdP. It just hasn't made its way to a ratified spec yet, but SDKs are already implementing the draft.

bdd_pomerium•2mo ago

This is very cool. Integrations look slick. Folks are understandably hyped—the potential for agents doing "deep research-style" work across broad data sources is real.

But the thread's security concerns—permissions, data protection, trust—are dead on. There is also a major authN/Z gap, especially for orgs that want MCP to access internal tools, not just curated SaaS.

Pushing complex auth logic (OAuth scopes, policy rules) into every MCP tool feels backwards.

* Access-control sprawl. Each tool reinvents security. Audits get messy fast.

* Static scopes vs. agent drift. Agents chain calls in ways no upfront scope list can predict. We need per-call, context checks.

* Zero-Trust principles mismatch. Central policy enforcement is the point. Fragmenting it kills visibility and consistency.

We already see the cost of fragmented auth: supply-chain hits and credential reuse blowing up multiple tenants. Agents only raise the stakes.

I think a better path (and in one in full disclosure, we're actively working on at Pomerium ) is to have:

* One single access point in front of all MCP resources.

* Single sign-on once, then short-lived signed claims flow downstream..

* AuthN separated from AuthZ with a centralized policy engine that evaluates every request, deny-by-default. Evaluation in both directions with hooks for DLP.

* Unified management, telemetry, audit log and policy surface.

I’m really excited about what MCP is putting us in the direction of being able to do with agents.

But without a higher level way to secure and manage the access, I’m afraid we’ll spend years patching holes tool by tool.

tkgally•2mo ago

For the past couple of months, I’ve been running occasional side-by-side tests of the deep research products from OpenAI, Google, Perplexity, DeepSeek, and others. Ever since Google upgraded its deep research model to Gemini 2.5 Pro Experimental, it has been the best for the tasks I give them, followed closely by OpenAI. The others were far behind.

I ran two of the same prompts just now through Anthropic’s new Advanced Research. The results for it and for ChatGPT and Gemini appear below. Opinions might vary, but for my purposes Gemini is still the best. Claude’s responses were too short and simple and they didn’t follow the prompt as closely as I would have liked.

Writing conventions in Japanese and English

https://claude.ai/public/artifacts/c883a9a5-7069-419b-808d-0...

https://docs.google.com/document/d/1V8Ae7xCkPNykhbfZuJnPtCMH...

https://chatgpt.com/share/680da37d-17e4-8011-b331-6d4f3f5ca7...

Overview of an industry in Japan

https://claude.ai/public/artifacts/ba88d1cb-57a0-4444-8668-e...

https://docs.google.com/document/d/1j1O-8bFP_M-vqJpCzDeBLJa3...

https://chatgpt.com/share/680da9b4-8b38-8011-8fb4-3d0a4ddcf7...

The second task, by the way, is just a hypothetical case. Though I have worked as a translator in Japan for many years, I am not the person described in the prompt.

noisy_boy•2mo ago

What is the best stack/platform to get started with MCP? I'm talking in terms of ergonomics, features and popularity.

jngiam1•2mo ago

This is awesome. We implemented a MCP client that's fully compatible with the new remote MCP specs, support OAuth and all. It's really smooth and I think paves the way for AI to work with tools. https://lutra.ai/mcp

jes5199•2mo ago

the MCP spec as it stands today is pretty half-baked. It’s pretty clear that the first edition was trying to emulate STDIO over HTTP, but that meant holding open a connection indefinitely. The new revision tries to solve this by letting you hold open as many connections as you want! but that makes it vague about message delivery ordering when you have multiple streams open. There even seems to be part of the spec that is logically impossible - people are wrestling with it in the GitHub issues.

which is to say: I’m not sure it actually wins, technically, over the OpenAI/OpenAPI idea from last year, which was at least easy to understand

sagarpatil•2mo ago

Should have just called it Remote MCP. Integrations sounds very vague.

Surac•2mo ago

I often use Claude 3.7 on programming things never done before. Even extensive search in the web brings up zero hits. I understand that this is very uncommon but my work portfolio is more science than real programming. Claude 3.7 really „thinks“ about the questions i ask. But 3.5 regularly drifts into dream mode if asked anything over it‘s training data. So if you ask for code easy found on the web you will see no difference. Try asking things not so common and you will see a difference

myflash13•2mo ago

Finally I can do something simple that I’ve wanted to do for ages: paste in a poster image or description of an event and tell the AI to add it to my calendar.

gjohnhazel•2mo ago

I just have it create an .ics file and open that

franze•2mo ago

> Integrations and advanced Research are now available in beta on the Max, Team, and Enterprise plans, and will soon be available on Pro.

MarkMarine•2mo ago

The plaid integration is to let you look at your install? I was excited to see all my accounts (as a consumer) knit together and reported on by Claude. Bummer

sebstefan•2mo ago

An AI that is capable of responding to a "How do I do X" prompt with "Hey this seems related to a ticket that was already opened on your Jira 2 months ago", or "There is a document about this in Sharepoint", it would bring me such immense value, I think I might cry.

Edit: Actually right in the tickets themselves would probably be better and not require MCP... but still

MagicMoonlight•2mo ago

Copilot can already be setup to use sharepoint etc. And you can set it up to only respond based on internal content.

So if you ask it “who is in charge of marketing” it will read it off sharepoint instead of answering generically

elia_42•2mo ago

Very interesting. The integration videos are great to start right away and try out the new features. The extensions of the deep reasoning capabilities are also incredible.

I think we are coming to a new automated technology ecosystem where LLMs will orchestrate many different parts of software with each other, speeding up the launch, evolution and monitoring of products.

abhisek•2mo ago

Looks to me another apps ecosystem coming up similar to Android or iPhone. We are probably going to see a lot of AI apps marketplaces that solve the problem of discovery, billing & integration with AI hosts like Claude Desktop.

game_the0ry•2mo ago

Its only a matter of time where folks write user stories and an LLM takes over for the first draft, then iterate from there.

Btw, that speaks to how important it is to get clear business requirements for work.

clintonb•2mo ago

Greptile (https://www.greptile.com/) tries to do that, at least for bug tickets. I recall being annoyed by its suggestions (posted as Linear comments).

dakshgupta•2mo ago

Co-founder of Greptile - that was a bad feature that we since deprecated to focus entirely on AI code reviews

game_the0ry•2mo ago

Genuinely curious -- Why pivot to code reviews?

dakshgupta•2mo ago

Of the things we had built it was the most useful for us, and the early users found it very useful too

ausbah•2mo ago

usability like this seems to be a big nail through oss llm usage

Local-first software (2019)

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

Operators, Not Users and Programmers

Europe's first geostationary sounder satellite is launched

Speeding up PostgreSQL dump/restore snapshots

macOS Icon History

Atomic "Bomb" Ring from KiX (1947)

X-Clacks-Overhead

The Calculator-on-a-Chip (2015)

WinUAE 6 Amiga Emulator

Haskell, Reverse Polish Notation, and Parsing

Seine reopens to Paris swimmers after century-long ban

The Prime Reasons to Avoid Amazon

Parametric shape optimization with differentiable FEM simulation

Is It Cake? How Our Brain Deciphers Materials

Gecode is an open source C++ toolkit for developing constraint-based systems (2019)

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate

Solve high degree polynomials using Geode numbers

Pet ownership and cognitive functioning in later adulthood across pet types

Yet Another Zip Trick

QSBS Limits Raised

Build Systems à la Carte (2018) [pdf]

Optimizing typography of insect labels using free fonts and free software (2012) [pdf]

Being too ambitious is a clever form of self-sabotage

Telli (YC F24) Is Hiring Engineers [On-Site Berlin]

Just Ask for Generalization (2021)

The History of Electronic Music in 476 Tracks (1937–2001)

Problems the AI industry is not addressing adequately

The Moat of Low Status

Local-first software (2019)

Cod Have Been Shrinking for Decades, Scientists Say They've Solved Mystery

Optimizing Tool Selection for LLM Workflows with Differentiable Programming

Operators, Not Users and Programmers

Europe's first geostationary sounder satellite is launched

Speeding up PostgreSQL dump/restore snapshots

macOS Icon History

Atomic "Bomb" Ring from KiX (1947)

X-Clacks-Overhead

The Calculator-on-a-Chip (2015)

WinUAE 6 Amiga Emulator

Haskell, Reverse Polish Notation, and Parsing

Seine reopens to Paris swimmers after century-long ban

The Prime Reasons to Avoid Amazon

Parametric shape optimization with differentiable FEM simulation

Is It Cake? How Our Brain Deciphers Materials

Gecode is an open source C++ toolkit for developing constraint-based systems (2019)

What 'Project Hail Mary' teaches us about the PlanetScale vs. Neon debate

Solve high degree polynomials using Geode numbers

Pet ownership and cognitive functioning in later adulthood across pet types

Yet Another Zip Trick

QSBS Limits Raised

Build Systems à la Carte (2018) [pdf]

Optimizing typography of insect labels using free fonts and free software (2012) [pdf]

Being too ambitious is a clever form of self-sabotage

Telli (YC F24) Is Hiring Engineers [On-Site Berlin]

Just Ask for Generalization (2021)

The History of Electronic Music in 476 Tracks (1937–2001)

Problems the AI industry is not addressing adequately

The Moat of Low Status

Claude Integrations

Comments