Lots of people are making moves in this space (including Anthropic), but nothing has broken through to the mainstream.
Why can't one set up a prompt, test it against a file, then once it is working, apply it to each file in a folder in a batch process which then provides the output as a single collective file?
Not sure what OS you're on, but in Windows it might look like this:
FOR %%F IN (*.txt) DO (TYPE "%%F" | llm -s "execute this prompt" >> "output.txt)
FOR %%F IN (*.pdf) DO (llm -a %%F -s "execute this prompt" >> output.txt)
The limits are still buggy responses - Claude often gets stuck in a useless loop if you overfeed it with files - and lack of consistency. Sometimes hand-holding is needed to get the result you want. And it's slow.
But when it works it's amazing. If the issues and limitations were solved, this would be a complete game changer.
We're starting to get somewhat self-generating automation and complex agenting, with access to all of the world's public APIs and search resources, controlled by natural language.
I can't see the edges of what could be possible with this. It's limited and clunky for now, but the potential is astonishing - at least as radical an invention as the web was.
I also use this method for doing code prototyping by giving it the path to files in the local working copy of my repo. Really cool to see it make changes in a vite project and it just hot reloads. Then I make tweaks or commit changes as usual.
LLM-desktop interfaces make great demos, but they are too slow to be usable in practice.
Tackling individual use-cases is supposed to be something for third party "ecosystem" companies to go after, not the mothership itself.
Not sure what Anthropic and co can do about that, but integrations feel like a step in the wrong direction. Whenever I've tried tool use, it was orders of magnitude more expensive and generally inferior to a simple model call with curated context from SerpApi and such.
I bet there are better / less arcane tools, but I think powerful and fast mechanisms for managing context are key and for me, that's really just powerful text editing features.
A truly useful AI assistant has context on my last 100,000 emails - and also recalls the details of each individual one perfectly, without confusion or hallucination.
Obviously I’m setting a high bar here; I guess what I’m saying is “yes, and”
Throw in all context --> ask it what is important for problem XYZ --> curate what it tells you, and feed that to another model to actually solve XYZ
Different to what this integration is pushing, the LLMs usage in production based products where high accuracy is a requirement (99%), you have to give a very limited tool set to get any degree of success.
I’m a bit skeptical that it’s gonna work out of the box because of the amount of custom fields that seem to be involved to make successful API requests in our case.
But I would welcome, not having to solve this problem. Jira’s interface is among the worst of all the ticket tracking applications I have encountered.
But, I have found using a LM conversation paired within enough context about what is involved for successful POSTs against the API allow me to create update and relate issues via curl.
It’s begging for a chat based LLM solution like this. I’d just prefer the underlying model not be locked to a vendor.
Atlassian should be solving this for its customers.
That or they're pulling an OpenAI and launching a feature that isn't actually fully live.
> in beta on the Max, Team, and Enterprise plans, and will soon be available on Pro
LLMs were always a fun novelty for me until OpenAI DeepResearch which started to actually come up with useful results on more complex programming questions (where I needed to write all the code by hand but had to pull together lots of different libraries and APIs), but it was limited to 10/month for the cheaper plan. Then Google Deep Research upgraded to 2.5 Pro and with paid usage limits of 20/day, which allowed me to just throw everything at it to the point where I'm still working through reports that are a week or more old. Oh and it searched up to 400 sources at a time, significantly more than OpenAI which made it quite useful in historical research like identifying first edition copies of books.
Now Claude is releasing the same research feature with integrations (excited to check out the Cloudflare MCP auth solution and hoping Val.town gets something similar), and a run time of up to 45 minutes. The pace of change was overwhelming half a year ago, now it's just getting ridiculous.
However, unfortunately, I cannot shower much praise on Claude 3.7. And if you (or anyone) asks why - 3.7 seems much better than 3.5, surely? - Then I’m moderately sure that you use Claude much more for coding than for any kind of conversation. In my opinion, even 3.5 Haiku (which is available for free during high loads) is better than 3.7 Sonnet.
Here’s a simple test. Try asking 3.7 to intuitively explain anything technical - say, mass dominated vs spring dominated oscillations. I’m a mechanical engineer who studied this stuff and I could not understand 3.7’s analogies.
I understand that coders are the largest single group of Claude’s users, but Claude went from being my most used app to being used only after both chatgpt and Gemini, something that I absolutely regret.
I haven't been able to use Claude research yet (it's not rolled out to the Pro tier) but o1 -> o3 deep research was a massive jump IMHO. It still isn't perfect but o1 would often give me trash results but o3 deep research actually starts to be useful.
3.5->3.7 (even with extended thinking) felt like a nothingburger.
Like, ask it a simple question and it comes up with a full repo, complete with a README and a Makefile, when all you wanted to know was how efficient a particular algorithm would be in the included code.
Can't wait until the add research to the Pro plan because, you know, I have questions...
If you pay for a subscription then they don’t have an incentive to use more tokens for the same answer.
It’s definitely because feedback from people has “taught” it that more boilerplate is better. It’s the same reason ChatGPT is annoyingly complementary.
I prefer Gemini 2.5 pro for all code now
That's not really true, since your prompts are also getting better. Better input leads to better output remains true, even with LLMs (when you see it as a tool).
However in this specific example, I don't remember if it was chatgpt or gemini or 3.5 Haiku but the other(s) explained it well enough. I think I re-asked 3.5 Haiku at a later point of time, and to my complete non-surprise, it gave an answer that was quite decent.
1 - For example, the field of DIY audio - which was funnily enough the source of my question. I'm no speaker designer, but combining creativity with engineering basics/rules of thumb seems to be something LLms struggle with terribly. Ask them to design a speaker and they come up with the most vanilla, tired, textbook design - despite several existing market products that are already so much ahead/innovative.
I'm confident that if you asked an LLM an identical question for which there is more discourse - eg make an interesting/innovative phone - you'd get relatively much better results.
If it was actually good they would've named it 4.0, the fact that they went from 3.5 to 3.7 (weird jump) speaks volumes imo.
Deep Research hasn't really been that good for me. Maybe I'm just using it wrong?
Example: I want the precipitation in mm and monthly high and low temperature in C for the top 250 most populous cities in North America.
To me, this prompt seems like a pretty anodyne and obvious task for Deep Research. It's long, tedious, but mostly coming from well structured data sources (wikipedia) across two languages at most.
But when I put this in to any of the various models, I mostly get back ways to go and find that data myself. Like, I know how to look at Wikipedia, it's that I don't want to comb through 250 pages manually or try to write a script to handle all the HTML boxes. I want the LLM/model to do this days long tedious task for me.
My perspective on this is that if Deep Research can't do something, you should do it yourself and put the results on the internet. It'll help other humans and AIs trying to do the same task.
The project requires the full list of every known city in the western hemisphere and also Japan, Korea, and Taiwan. But that dataset is just maddeningly large, if it is possible at all. Like, I expect it to take me years, as I have to do a lot of translations. So, I figured that I'd be nice and just as for the top 250 for the various models.
There's a lot more data that we're trying to get too and I'm hoping that I can get approval to post it as its a work thing.
How do you validate its results in that scenario? Just take its word for it?
I'd say that what you're saying is 'synthesis'. The 'Intro/Discussion' sections of a journal article.
For me, 'research' means the work of going through and getting all the data in the first place. Like, going out and collecting dino bones in the hot sun, measuring all the soil samples, etc. - that is research. For me, asking these models to go collate some webpages, I mean, you spend the first weeks of a summer undergrad's time to go do this kid of thing to get them used to the file systems and spruce up their organization skills, see where they are at. Writing the paper up, that's part of research sure, but not the hard part that really matters.
Maybe in a year, they’ll hit the graduate level. But we’re not near PhD level yet
I use it a lot when documentation is vague or outdated. When Gemini/o3 can't figure something out after 2 tries. When I am working with a service/API/framework/whatever that I am very unfamiliar with and I don't even know what to Google search.
I recently asked Chrome to show me how to apply the Knuth-Bendix completion procedure to propositional logic, and I had already formed my own thoughts about how to proceed (I'm building a rewrite system that does automated reasoning).
The response convinced me that I'm not a total idiot.
I'm not an academic and I'm often wrong about theory so the validation is really useful to me.
It is literally stagnated for a year now
All that changed is they connect more apis.
And add a thinking loop with same model powering it
This is the reason it seems fast - nothing really happens except easy things
That being said, isn’t it strange how the community has polar opposite views about this? Did anything like this ever happen before?
Like I wanted to scope how to build a home made TrueNAS Scale unit, it helped me with a avoiding pitfalls like knowing that I needed two GPUs minimum to run the OS and local llms, and speed up config for a CLI back up of my Dropbox locally(it told me to use the right filesystem format over ZFS to make Dropbox client work).
It has researched how I can structure my web app for building payment system on the web(something I knew nothing about) to writing small tools to talk to my document collection and index them into collections in Anki in one day.
All those talks about AI replacing people seemed a little far fetched in 2024. But in 2025, I really think models are getting good enough
If there was truly any innovation still happening in OpenAI, Anthropic, etc., they would be working on models only, not on side features that someone could already develop over a weekend.
Now I'm in a new team where 99% of our oncall tickets come from automated alarms and 80% of them are a subset of a few issues where the root-cause isn't easy to address but there is either nothing to actually do once investigated, or the fix is a one time process that is annoying to run, so the username isn't accurate anymore :)
I still like the change of pace though, 0 worries about sprint tasks or anything else for a week every few months.
Hope one day it will be practical to do nightly finetunes of a model per company with all core corporate data stores.
This could create a seamless native model experience that knows about (almost) everything you’re doing.
I'll leave it to you to guess which one is harder to do.
There are now some light versions of fine tuning that don’t update all the model weights but train a small adapter layer called Lora which is way more viable commercially atm in my opinion.
Some of the issues still exist, of course:
* Finetuning takes time and compute; for one-off queries using in-context learning is vastly more efficient (i.e., look it up with RAG).
* Early results with finetuning had trouble reliably memorizing information. We've got a much better idea of how to add information to a model now, though it takes more training data.
* Full finetuning is very VRAM intensive; optimizations like LoRA were initially good at transferring style and not content. Today, LoRA content training is viable but requires training code that supports it [1].
* If you need a very specific memorized result and it's costly to get it wrong, good RAG is pretty much always going to be more efficient, since it injects the exact text in context. (Bad RAG makes the problem worse, of course).
* Finetuning requires more technical knowledge: you've got to understand the hyperparameters, avoid underfitting and overfitting, evaluate the results, etc.
* Finetuning requires more data. RAG works with a handful datapoints; finetuning requires at least three orders of magnitude more data.
* Finetuning requires extra effort to avoid forgetting what the model already knows.
* RAG works pretty well when the task that you are trying to perform is well-represented in the training data.
* RAG works when you don't have direct control over the model (i.e., API use).
* You can't finetune most of the closed models.
* Big, general models have outperformed specialized models over the past couple of years; if it doesn't work now, just wait for OpenAI to make their next model better on your particular task.
On the other hand:
* Finetuning generalizes better.
* Finetuning has more influence on token distribution.
* Finetuning is better at learning new tasks that aren't as present in the pretraining data.
* Finetuning can change the style of output (e.g., instruction training).
* When finetuning pays off, it gives you a bigger moat (no one else has that particular model).
* You control which tasks you are optimizing for, without having to wait for other companies to maybe fix your problems for you.
* You can run a much smaller, faster specialized model because it's been optimized for your tasks.
* Finetuning + RAG outperforms just RAG. Not by a lot, admittedly, but there's some advantages.
Plus the RL Training for reasoning has been demonstrating unexpectedly effective improvements on relatively small amounts of data & compute.
So there's reasons to do both, but the larger investment that finetuning requires means that RAG has generally been more popular. In general, the past couple of years have been won by the bigger models scaling fast, but with finetuning difficulty dropping there is a bit more reason to do your own finetuning.
That said, for the moment the expertise + expense + time of finetuning makes it a tough business proposition if you don't have a very well-defined task to perform, a large dataset to leverage, or other way to get an advantage over the multi-billion dollar investment in the big models.
1. If you have a large corpus of valuable data not available to the corporations, you can benefit from fine tuning using this data.
2. Otherwise just use RAG.
I had no idea that fine tuning for adding information is viable now. Last I checked (year+ back) it seemed to not work well.
it depends on your data access pattern. If some text goes through LLM input many times, it is more efficient for LLM to be finetuned on it once.
The budget question comes into play as well. Even if text is repetitively fed to the LLM, that might happen over a long enough time compared to finetuning which is a sort of capex that it is financially more accessible.
Now bear in mind, I'm a big proponent of finetuning where applicable and I try to raise awareness to the possibilities it opens. But one cannot deny RAG is a lot more accessible to teams which are likely developers / AI engineers compared to ML engineers/researchers.
It looks like major vendors provide simple API for fine-tuning, so you don't need ML engineers/researchers: https://platform.openai.com/docs/guides/fine-tuning
Setting RAG infra is likely more complicated than that.
Results with this method are significantly more limited compared to all the power open-weight finetuning gives you (and the skillset needed in return).
And in either case don’t forget alignment and evals.
How many epochs do you run?
In case the above link doesn't work later on, the page for this demo day is here: https://demo-day.mcp.cloudflare.com/
Because MCP isn’t an API it’s the protocol that defines how the LLM even calls the API in the first place. Without it, all you've got is a chat interface.
A lot of people misunderstand what is the role of MCP. It’s the signaling the LLM uses to reach out of its context window and do things.
Truly, OSS should be more interesting in the next decade for this alone.
I’d feel a lot better if we had something resembling a comprehensive data privacy law in the United States because I don’t want it to basically be the Wild West for anyone handling whatever personal info doesn’t get covered under HIPAA.
I've always worked under the assumption the best employees make themselves replaceable via well defined processes and high quality documentation. I have such a hard time understanding why there's so much willingness to integrate irreplaceable SaaS solutions into business processes.
I haven't used AI a ton, but everything I've done has focused on owning my own context, config, etc.. How much are people going to be willing to pay if someone else owns 10+ years of their AI context?
Am I crazy or is owning the context massively valuable?
This does not sound like it would be learning general information helpful across an industry, but specific, actionable information.
If not available now, is that something that AI vendors are working toward? If so, what is to keep them from using that knowledge to benefit themselves or others of their choosing, rather than the people they are learning from?
While people understand ethics, morals and legality (and ignore them), that does not seem like something that an AI understands in a way that might give them pause before doing an action.
Perhaps I am just frivolous with my own time, but I tend to use LLMs in a more iterative way for research. I get partial answers, probe for more information, direct the attention of the LLM away from areas I am familiar and towards areas I am less familiar. I feel if I just let it loose for 45 minutes it would spend too much time on areas I do not find valuable.
This seems more like a play for "replacement" rather than "augmentation". Although, I suppose if I had infinite wealth, I could kick of 10+ research agents each taking 45 minutes and then review their output as it became available, then kick off round 2, etc. That is, I could do my process but instead of interactively I could do it asynchronously.
As for long research times, one thing I’ve been using it for is historical research on old books. Gemini DeepResearch was the first one able to properly explain the nuances of identifying a chimeral first edition Origin of Species after taking half an hour and reading 400 sources. It went into all the important details like spelling errors and the properties of chimeral FY2** copies found in various libraries around the world.
Give us an LLM with better reasoning capabilities, please! All this other stuff just feels like a distraction.
I've been using the Atlassian MCP for nearly a month now, and it's completely changed (and eliminated) the feeling of having an overwhelming backlog.
I can have it do things like "find all the tickets related to profile editing and combine them into one epic" where it works perfectly. Or "help me prioritize the 15 tickets assigned to me this sprint" and it'll actually go through and suggest "maybe you can do these two tickets first since they seem smaller, then do this big one" – i haven't hooked it up to my calendar yet.
But I'd love for it to suggest things like "do this one ticket that requires a lot of heads down time on wednesday since you don't have any meetings. I can create a block on your calendar so that nobody will schedule a meeting then"
Those are all superhuman things that can be done with MCP and a smart model.
I've defined rules in cursor that say "when I ask you to mark something ready for test, change the status and assign it to <x person>, and leave a comment summarizing the changes"
If you look at my JIRA comments now, you'd wonder how I had so much time to write such thorough comments. I don't, Cursor and whatever model is doing it for me.
It's been an absolute game changer. MCP is going to be what the App store was to mobile. Yes you can get by without it, but actually hooking into all your daily tool is when this stuff gets insanely valuable in a practical sense.
How do your colleagues feel about it?
I also don’t want to read too many unnecessary words.
Can’t we point an LLM to a sqlite db and tell it to treat it as an issue tracking db and have everyone do the same.
The service (jira) would materialize inside the LLMs then.
Why even use abstractions like tickets etc. Ask LLM what to do.
Unless you can provide the same visibility, long-term planning features and compliance aspects of JIRA on top of you sqlite db, you won't compete with JIRA. But if you do add those things on top of SQLite and LLMs, you probably have a solid business idea. But you'd first need to understand JIRA well enough to know why they are there in the first place.
[0] https://en.wikipedia.org/w/index.php?title=Wikipedia:FENCE
One of them said “yeah I was wondering cuz you never write that much” - as a leader, I actually don’t set a good example of how to leave quality JIRA comments. And my view with all these things is that I have to lead by example, not by orders.
With the help of these kinds of tools, we can improve the quality of these comments. And I wouldn’t expect others to write them manually, more that I wanted to show that everyone’s use of JIRA on the team can improve.
I don't think it's good leadership to unleash drivel on an organisation, have people waste time reading and perhaps replying to it, thinking it's something important and thoughtful coming from atonse.
Good thing you told them though, now they can ignore it.
There's nothing I hate more than people sending me their AI messages, be it in a ticket or a PR or even on Slack. I'm forced to engage and spend effort on something it took them all of 3 seconds to generate without even proofreading what they're sending me says. The amount of times I've had to ask 11 clarifying questions because their message has 11 contradictions within itself is maddening to the highest degree.
The worst is when I call out one of these numerous contradictions, and the reply is "oh haha, stupid Claude :)", makes my blood boil and at the same time amazes me that someone has so little pride and respect for their fellow humans to do crap like that.
I'm not in that world at the moment, but I've been the lead on several projects where the backlog has became a dumping ground of years of neglect. You end up with this tiered backlog thing where one level of backlog gets too big so you create a second tier of backlog for the stuff you are actually going to work on. Pretty soon you end up with duplicates in the second tier backlog for items already in the base level backlog since no one even looks at that old backlog anymore.
I've done a lot of tidy up myself when I inherit this kind of mess, just closing tickets we definitely will never get to, de-duping, adding context when available, grouping into epics, tagging with relevant "tech-debt", "security", "bug", "automation", etc. But when there are 100s of tickets it is a slog. Having an LLM do this makes so much sense.
What it _doesn't_ seem to yet mitigate is prompt injection attacks, where a tool call description of one tool convinces the model to do something it shouldn't (like send sensitive data to a server owned by the attacker.) I think these concerns are a little bit overblown though; things like pypi and the Chrome Extension store scare me more and it doesn't stop them from mostly working.
I love MCP (it’s way better than plain Claude) but even that runs into context walls.
> a new way to connect your apps and tools to Claude. We're also expanding... with an advanced mode that searches the web.
The notion of software eating the world, and AI accelerating that trend, always seems to forget that The World is a vast thing, a physical thing, a thing that by its very nature can never be fully consumed by the relentless expansion of our digital experiences. Your worldview /= the world.
The cynic would suggest that the teams that build these tools should go touch grass, but I think that misses the mark. The real indictment is of the sort of thinking that improvements to digital tools [intelligences?] in and of themselves can constitute truly substantial and far reaching changes.
The reach of any digital substrate inherently limited, and this post unintentionally lays that bare. And while I hear accelerationists invoking "robots" as the means for digital agents to expand their potent impact deeper into the real world I suggest this is the retort of those who spend all day in apps, tools, and the web. The impacts and potential of AI is indeed enormous, but some perspective remains warranted and occasional injections of humility and context would probably do these teams some good.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...)
Being Apple, they would have to come up with something novel like they did with push (where you have _one_ OS process running that delegates to apps rather than every app trying to handle push themselves) rather than having 20 MCP servers running. But I think if they did this properly, it would be so amazing.
I hope Apple is really re-thinking their absolutely comical start with AI. I hope they regroup and hit it out of the park (like how Google initially stumbled with Bard, but are now hitting it out of the park with Gemini)
They bet big and got distracted on VR. It was obviously the wrong choice at the time, and even more so now. They're going to have to abandon all that VR crap and pivot hard to AI to try and catch up. I think the more likely case is they can't catch up now and will just have to end up licensing Gemini from Google/Google paying them to use Gemini as the default AI.
People will say 'aaah ad company' (me too sometimes) but I'd honestly trust a Google AI tool with this way more. Not just because it already has access to my Google Workspace obviously, but just because it's a huge established tech firm with decades of experience in trying not to lose (or have taken) user data.
Even if they get the permissions right and it can only read my stuff if I'm just asking it to 'research', now Anthropic has all that and a target on their backs. And I don't even know what 'all that' is, whatever it explored deeming it maybe useful.
Maybe I'm just transitioning into old guy not savvy with latest tech, but I just can't trust any of this 'go off and do whatever seems correct or helpful with access to my filesystem/Google account/codebase/terminal' stuff.
I like chat-only (well, +web) interactions where I control the input and taking the output, but even that is not an experience that gives me any confidence in giving uncontrolled access to stuff and it always doing something correct and reasonable. It's often confidently incorrect too! I wouldn't give an intern free reign in my shell either!
Sometimes I want a pure model answer and I used to use Claude for that. For research tasks I preferred ChatGPT, but I found that you cannot reliably deny it web access. If you are asking it a research question, I am pretty sure it uses web search, even when "Search" and "Deep Research" are off.
MCP is a flawed spec and quite frankly a scam.
Increasing the amount of "connections" to the LLM increases the risk in a leak and it gives your more rope to hang yourself with when at least one connection becomes problematic.
Now is a great time to be a LLM security consultant.
Even my wife, who normally used Claude to create interesting recipes to bake cookies, has noticed a huge downgrade in 3.7.
LLMs wrapping the services makes more sense, as the data stored in those services adds a lot of value to off the shelf LLMs.
When I hooked up our remote MCP server, Claude sends a GET request to the endpoint. According to the spec, clients that want to support both transports should first attempt to POST an InitializeRequest to the server URL. If that returns a 4xx, it should then assume the SSE integration.
I might not dare to add an integration if it can potentially add a bunch of stuff to the backing systems without my approval. Confirmations and review should be part of the protocol.
1. Can be rollbacked/undone
2. Clearly states exactly what it's going to do in a reviewable way
If those aren't fulfilled you're going to end up with users that are afraid of using your app.
That + the agent SDK of openAI makes creating agentic flow so easy.
On the other hand you're kinda forced to run these tools / MCP servers in their own process which makes no sense to me.
but, we're taking it a step (or two) further, enabling you to dynamically build up a MCP server from other servers managed in your account with us.
try it out, or let me get you a demo! this goes for any casual comment readers too ;)
you bundle mcp servers into a profile, which acts as a single virtual mcp server and can be dynamically updated without re-configuring your mcp client (e.g. claude)
Both OpenAI and Google continue to push the frontier on reasoning, multimodality, and efficiency whereas Claude's recent releases have felt more iterative. I'd love to see Anthropic push into model research again.
Keyword search is such a naive approach to information discovery and information sharing - and renders confluence in big orgs useless. Being able to discuss and ask questions is a more natural way of unpacking problems.
So I tested a basic prompt:
1. go to : SOME URL
2. copy all the content found VERBATIM, and show me all that content as markdown here.
Result : it FAILED miserably with a few basic html pages - it simply is not loading all the page content in its internal browser.
What worked well: - Gemini 2.5Pro (Experimental) - GPT 4o-mini // - Gemini 2.0 Flash ( not verbatim but summarized )
However, there's a major concern that server hosters are on the hook to implement authorization. Ongoing discussion here [1].
[0] https://modelcontextprotocol.io/specification/2025-03-26
[1] https://github.com/modelcontextprotocol/modelcontextprotocol...
Source: https://github.com/modelcontextprotocol/modelcontextprotocol...
> major concern that server hosters are on the hook to implement authorization
Doesn't it make perfect sense for server hosters to implement that? If Claude wants access to my Jira instance on my behalf, and Jira hosts a remote MCP server that aids in exposing the resources I own, isn't it obvious Jira should be responsible for authorization?
How else would they do it?
I guess maybe you are saying the onus is NOT on the MCP server but on the authorization server.
Anyway while technically true this is mostly just distracting because:
1. in my experience the resource server and the authorization server are almost always maintained by the same company -- Jira/Atlassian being an example
2. the resource server still minimally has the responsibility of identifying and integrating with some authorization server, and *someone* has to be the authorization server, so I'm not sure deferring the responsibility to that unidentified party is a strong defense against the critique anyway. The strong defense is: of course the MCP server should have these responsibilities.
For example, say you have a JIRA self hosted instance with SSO to entra id. You can't just install an MCP server off the shelf because authZ and resources are tightly coupled and implementation specific. It would be much easier if the server only handled providing resources, and authZ was offloaded to a provider of your choosing.
But the thread's security concerns—permissions, data protection, trust—are dead on. There is also a major authN/Z gap, especially for orgs that want MCP to access internal tools, not just curated SaaS.
Pushing complex auth logic (OAuth scopes, policy rules) into every MCP tool feels backwards.
* Access-control sprawl. Each tool reinvents security. Audits get messy fast.
* Static scopes vs. agent drift. Agents chain calls in ways no upfront scope list can predict. We need per-call, context checks.
* Zero-Trust principles mismatch. Central policy enforcement is the point. Fragmenting it kills visibility and consistency.
We already see the cost of fragmented auth: supply-chain hits and credential reuse blowing up multiple tenants. Agents only raise the stakes.
I think a better path (and in one in full disclosure, we're actively working on at Pomerium ) is to have:
* One single access point in front of all MCP resources.
* Single sign-on once, then short-lived signed claims flow downstream..
* AuthN separated from AuthZ with a centralized policy engine that evaluates every request, deny-by-default. Evaluation in both directions with hooks for DLP.
* Unified management, telemetry, audit log and policy surface.
I’m really excited about what MCP is putting us in the direction of being able to do with agents.
But without a higher level way to secure and manage the access, I’m afraid we’ll spend years patching holes tool by tool.
I ran two of the same prompts just now through Anthropic’s new Advanced Research. The results for it and for ChatGPT and Gemini appear below. Opinions might vary, but for my purposes Gemini is still the best. Claude’s responses were too short and simple and they didn’t follow the prompt as closely as I would have liked.
Writing conventions in Japanese and English
https://claude.ai/public/artifacts/c883a9a5-7069-419b-808d-0...
https://docs.google.com/document/d/1V8Ae7xCkPNykhbfZuJnPtCMH...
https://chatgpt.com/share/680da37d-17e4-8011-b331-6d4f3f5ca7...
Overview of an industry in Japan
https://claude.ai/public/artifacts/ba88d1cb-57a0-4444-8668-e...
https://docs.google.com/document/d/1j1O-8bFP_M-vqJpCzDeBLJa3...
https://chatgpt.com/share/680da9b4-8b38-8011-8fb4-3d0a4ddcf7...
The second task, by the way, is just a hypothetical case. Though I have worked as a translator in Japan for many years, I am not the person described in the prompt.
which is to say: I’m not sure it actually wins, technically, over the OpenAI/OpenAPI idea from last year, which was at least easy to understand
Edit: Actually right in the tickets themselves would probably be better and not require MCP... but still
So if you ask it “who is in charge of marketing” it will read it off sharepoint instead of answering generically
I think we are coming to a new automated technology ecosystem where LLMs will orchestrate many different parts of software with each other, speeding up the launch, evolution and monitoring of products.
behnamoh•19h ago
pcwelder•18h ago
avandekleut•16h ago