It's interesting that the highest level of reasoning that GPT-5 in XCode supports is actually the "low" reasoning level. Wonder why.
This is Claude sign in using your account. If you’ve signed up for Claude Pro or Max then you can use it directly. But, they should give access to Opus as well.
One caveman way:
1. Start your project using Xcode, use it to commit to GitHub, GitLab, whereever. In terminal, change into the dir that has the .git in it and launch claude.
2. Teach Claude Code your own system's path and preferred simulator for build testing. From then on it will build-test every change, so teach it to only commit after build passes. (By teach, I mean, just tell it, then from time to time, tell it to propose updates to claude.md in your project.)
3. Make sure before a PR or push that the project still builds in Xcode, if it doesn't, you can eyeball the changes in Xcode's staged changes viewer and undo them. If you change files via IDE, when you're back in Claude just say: I changed [specific/file, or "a lot of files"].
No xproj or symlinks get harmed in the making of your .swifts.
for whatever reason it ignores my directive that it can from the CLAUDE file at least half the time. one time it even decided it needed to generate a fancy python script to do it. bizarre.
They certainly do, and I can't really follow the analogy you are building.
> We're at a higher level of abstraction now.
To me, an abstraction higher than a programming language would be natural language or some DSL that approximates it.
At the moment, I don't think most people using LLMs are reading paragraphs to maintain code. And LLMs aren't producing code in natural language.
That isn't abstraction over language, it is an abstraction over your computer use to make the code in language. If anything, you are abstracting yourself away.
Furthermore, if I am following you, you are basically saying, you have to make a call to a (free or paid) model to explain your code every time you want to alter it.
I don't know how insane that sounds to most people, but to me, it sounds bat-shit.
Editors are incredibly complex and require domain knowledge to guide agents toward the correct architecture and implementation (and away from the usual naive pitfalls), but in my experience the latest models reason about and implement features/changes just fine.
Why wouldn't it?
I have used agentic coding tools to solve problems that have literally never been solved before, and it was the AI, not me, that came up with the answer.
If you look under the hood, the multi-layered percqptratrons in the attention heads of the LLM are able to encode quite complex world models, derived from compressing its training set in a which which is formally as powerful as reasoning. These compressed model representations are accessible when prompted correctly, which express as genuinely new and innovative thoughts NOT in the training set.
Would you show us? Genuinely asking
It’s happened now that a couple of times it pops out novel results. In computational chemistry, machine learned potentials trained with transformer models have already resulted in publishable new chemistry. Those papers are t out yet, but expect them within a year.
Ask the best available models -- emphasis on models -- for help designing the text editor at a structural rather than functional level first, being specific about what you want and emphasizing component-level test whenever possible, and only then follow up with actual code generation, and you'll get much better results.
Obviously no model is going to one-shot something like a full text editor, but there's an ocean of difference between defining vibe coding as prompting "Make me a text editor" versus spending days/weeks going back and forth on architecture and implementation with a model while it's implementing things bottom-up.
Both seem like common definitions of the term, but only one of them will _actually_ work here.
In Neovim the choice of language server and the choice of LLM is up to the user, (possibly even the choice of this API, I believe, having only skimmed the PR) while both of those choices are baked in to XCode, so they're not the same thing.
Since the landscape of potentially malicious inputs in plain english is practically infinite, without any particular enforced structure for the queries you make of it, means that those "guardrails" are, in effect, an expert system. An ever growing pile of if-then statements. Didn't work then, won't work now.
Your link: "Grade school math problems from a hardcoded dataset with hardcoded answers" [1]
It really is the same thing.
[1] https://openai.com/index/solving-math-word-problems/
--- start quote ---
GSM8K consists of 8.5K high quality grade school math word problems. Each problem takes between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − × ÷) to reach the final answer.
--- end quote ---
1. OpenAI has been doing verifier-guided training since last year.
2. No SOTA model was trained without verified reward training for math and programming.
I supported the first claim with a document describing what OpenAI was doing last year; the extrapolation should have been straightforward, but it's easy for people who aren't tracking AI progress to underestimate the rate at which it occurs. So, here's some support for my second claim:
https://arxiv.org/abs/2507.06920 https://arxiv.org/abs/2506.11425 https://arxiv.org/abs/2502.06807
Indeed."By late next month you'll have over four dozen husbands" https://xkcd.com/605/
> So, here's some support for my second claim:
I don't think any of these links support the claim that "No SOTA model was trained without verified reward training for math and programming"
https://arxiv.org/abs/2507.06920: "We hope this work contributes to building a scalable foundation for reliable LLM code evaluation"
https://arxiv.org/abs/2506.11425: A custom agent with a custom environment and a custom training dataset on ~800 predetermined problems. Also "Our work is limited to Python"
https://arxiv.org/abs/2502.06807: The only one that somewhat obliquely refers to you claim
You’re just angry and adding no value to this conversation because of it
Otherwise there's VSCodium which is what I'm using until I can make the jump to Code Edit.
If you don’t want to use LLM coding assistants – or if you can’t, or it’s not a technology suitable for your work – nobody cares. It’s totally fine. You don’t need to get performatively enraged about it.
> Built for Apple Intelligence.
> 16-core Neural Engine
These Xcode release notes:
> Claude in Xcode is now available in the Intelligence settings panel, allowing users to seamlessly add their existing paid Claude account to Xcode and start using Claude Sonnet 4
All that dedicated silicon taking up space on their SoC and yet you still have to input your credit card in order to use their IDE. Come on...
They would also need to shrink them way down to even fit. And even then, generating tokens on an apple neural chip would be waaaaaay slower than an HTTP request to a monster GPU in the sky. Local llms in my experience are either painfully dumb or painfully slow.
It’s the Apple way to screw the 3rd party and replace with their own thing once the ROI is proven (not a criticism, this is a good approach for any business where the capex is large…)
[0] https://github.com/fguzman82/apple-foundation-model-analysis
When macOS 26 is officially announced on September 9, I expect Apple to announce support for Anthropic and Google models.
Wont work by default if I'm reading this correctly
There's simply no way to properly secure network connected developer systems.
Autocomple is also automatically triggered when you place your cursor inside the code.
Don't be naive.
None of these companies are isolated from the internet.
Also, there are plenty of editors and IDEs that don’t.
Let’s stop pretending like you’re being forced into this. You aren’t.
I spent the last 6 months trying to convince them not to block all outbound traffic by default.
For most corporate code (that is highly confidential) you still have proper internet access, but you sure as hell can't just send your code to all AI providers just because you want to, just because it's built into your IDE.
you can use Claude via bedrock and benefit from AWS trust
Gemini? Google owns your e-mail. Maybe you're even one of those weirdos who doesn't use Google for e-mail - I bet your recipient does.
so... they have your code, your secrets, etc.
I do not think this will be an issue for big companies.
But I guess the user could still get a 3rd party plugin.
Credit card processing is hard... Go price out stripe + customer service + dealing with charge backs and tell me if you really want to do processing your self.
I always find this article something to get back to: https://www.inc.com/magazine/20110301/making-money-small-bus...
Facebook got excoriated for doing that with Onavo but I guess it's Good Actually when it's done in the name of protecting my computer from myself lol
The real news is when Codex CLI / Claude Code get integrated, or Apple introduces a competitor offering to them.
Until then this is a toy and should not be used for any serious work while these far better tools exist.
Compared to stock Claude Code, this version of Claude knows a lot more about SwiftUI and related technologies. The following is output from Claude in Xcode on an empty project. Claude Code gives a generic response when it looked at the same project:
What I Can Help You With
• SwiftUI Development: Layout, state management, animations, etc.
• iOS/macOS App Architecture: MVVM, data flow, navigation
• Apple Frameworks: Core Data, CloudKit, MapKit, etc.
• Testing: Both traditional XCTest and the new Swift Testing framework
• Performance & Best Practices: Swift concurrency, memory management
Example of What We Could Do Right Now
Looking at your current ContentView.swift, I could help you:
• Transform this basic "Hello World" into a recovery tracking interface
• Add navigation, data models, or user interface components
• Implement proper architecture patterns for your Recovery Tracker app
For example: it uses Haiku as a model to run tools and most likely has automatic translations for when the model signals it wants to search or find something -> either use the built-in search or run find/fd/grep/rg
All that _can_ be done by prompting, but - as always with LLMS - prompts are more like suggestions.
You are free to point Claude Code to that folder, or make a slash command that loads their contents. Or, start CC with -p where the prompt is the content of all those files.
Claude Code integration in Xcode would be very cool indeed, but I might still stick with VSCode for pure coding.
I'm sticking with VSCode too, but it's a bit silly to suggest that anyone is using XCode because it's their preferred IDE. It's just the one that's necessary for any non-trivial Apple platform development.
Adding a code generator isn't a marketing ploy to get people to switch editors, it's just a small concession to the many hapless souls stuck dealing with Apple on the professional side, or masochistically building mac SwiftUI apps just to remind themselves what pain feels like.
SwiftUI previews were... manageable but not great.
[1]: https://www.macrumors.com/2025/05/02/apple-anthropic-ai-codi...
https://appleinsider.com/articles/22/06/06/apple-now-has-ove...
I think that means either:
* they have revenues of $3.4b/year just from the $100 annual fees, or
* some decent percentage of people have signed up for a free developer account and then never done anything with it (like me)
I also wonder if it will have separate rate limits from ChatGPT (app/web) and Codex CLI (which currently has its own rate limits).
The "best" way to get the "latest" details on Apple's APIs is to suffer through mind-numbingly vapid WWDC videos with their reverse uncanny valley presenters (where humans pretend to be robots) and keep your full attention on them to catch a fleeting glimpse of a single method or property that does what you were looking for. Even 1.5x/2x speed is torture. I tried to get AIs to sift through the transcripts of their videos, and may Skynet forgive me for this cruelty.
Then when you go try to use that API, oops it's been changed in the current beta and there's no further documentation on it except auto-generated headers.
They also removed bookmarks from Xcode's built-in documentation browser years ago, and it doesn't retain a memory of previously open tabs, and often seems to be behind the docs on their websites.
I wish they would just provide open-source sample apps of each type (document-based, single-window etc.) for each of their platforms that fully use the latest APIs. At least that would be easier to ask AIs on, since that is what they seem to be going for now anyway.
Apple Developer docs are locked behind JavaScript, making them invisible to most LLMs. If they try to fetch it, all they see is This page requires JavaScript. Please turn on JavaScript in your browser and refresh the page to view its content.
This service translates Apple Developer documentation pages into AI-friendly Markdown.
I bought a Pro subscription, the send button on their dumb chatbot box is disabled for me (on Safari), and I still get "capacity constraints' limits. Filed a chargeback with my bank just because of the audacity of their post-purchase experience. ChatGPT-5 works good enough for coding too.
From my experience with Claude Opus it seems like it tries to be "too smart" and doesn't seem to keep up with the latest APIs. It suggested some code for a iOS/macOS project that was only valid on tvOS, and other gaffs.
I asked 2 things.
1. Create a boilerplate Zephyr project skeleton, for Pi Pico with st7789 spi display drivers configured. It generated garbage devicetree which didn't even compile. When I pointed it out, it apologized and generated another one that didn't compile. It configured also non-existent drivers, and for some reason it enabled monkey test support (but not test support).
2. I asked it to create 7x10 monochromatic pixelmaps, as C integer arrays, for numeric characters, 0-9. I also gave an example. It generated them, but number eight looked like zero. (There was no cross in ether 0 nor 8, so it wasn't that. Both were just a ring)
What am I doing wrong? Or is this really the state of the art?
Imagine the CS 100 class where they ask you to make a PB&J. saying for it to make it, there's a lot of steps, but determine known the steps. implement each step. progress.
I run interviews at my company. We allow/encourage AI.
The number one failure method is people throwing all of the requirements in upfront. They get one good pass then fail.
…this reeeeaaaallllyyyy feels like that
Ask Opus or Gemini 2.5 Pro to write a plan. Then ask the other to critique it and fix mistakes. Then ask Sonnet to implement
Trying two things and giving up. It's like opening a REPL for a new language, typing some common commands you're familiar with, getting some syntax errors, then giving up.
You need how to learn to use your tools to get the best out of them!
Start by thinking about what you'd need to tell a new Junior human dev you'd never met before about the task if you could only send a single email to spec it out. There are shortcuts, but that's a good starting place.
In this case, I'd specifically suggest:
1. Write a CLAUDE.md listing the toolchains you want to work with, giving context for your projects, and listing the specific build, test etc. commands you work with on your system (including any helpful scripts/aliases you use). Start simple; you can have claude add to it as you find new things that you need to tell it or that it spends time working out (so that you don't need to do that every time).
2. In your initial command, include a pointer to an example project using similar tech in a directory that claude can read
3. Ask it to come up with a plan and ask for your approval before starting
Providing a woefully inadequate descriptions to others (Claude & us) and still expecting useful responses?
Generating a state-of-the-art response to your request involves a back-and-forth with the agent about your requirements, having a agent generate and carry out a deep research plan to collect documentation, then having the agent generate and carry out a development plan to carry it out.
So while Claude is not the best model in terms of raw IQ, the reason why it's considered the best coding model is because of its ability to execute all these steps in one go which, in aggregate, generates a much better result (and is less likely to lose its mind).
Which one is, and by what metric? I always end up back at Claude after trying other models because it is so much better at real world applications.
Your first prompt is testing Claude as an encyclopedia: has it somehow baked into its model weights the exactly correct skeleton for a "Zephyr project skeleton, for Pi Pico with st7789 spi display drivers configured"?
Frequent LLM users will not be surprised to see it fail that.
The way to solve this particular problem is to make a correct example available to it. Don't expect it to just know extremely specific facts like that - instead, treat it as a tool that can act on facts presented to it.
For your second example: treat interactions with LLMs as an ongoing conversation, don't expect them to give you exactly what you want first time. Here the thing to do next is a follow-up prompt where you say "number eight looked like zero, fix that".
Personally, I treat those sort of mistakes as "misunderstandings" where I wasn't clear enough with my first prompt, so instead of adding another message (and increasing context further, making the responses worse by each message), I rewrite my first one to be clearer about that thing, and regenerate the assistant message.
Basically, if the LLM cannot one-shot it, you weren't clear enough, and if you go beyond the total of two messages, be prepared for the quality of responses to really sink fast. Even by the second assistant message, you can tell it's having an harder time keeping up with everything. Many models brag about their long contexts, but I still feel like the quality of responses to be a lot worse even once you reach 10% of the "maximum context".
Or it would say to do X it involves very complex math, instead you could (and proceeds with stripped down solution that doesn't meet goals). So you can tell it to ignore the concerns about complexity and assume that I understand all of it and it is easy to me. Then it goes on creating the solution that actually has legs. But you need to refine it further.
I wonder if it's because there are maybe millions of MSDN articles, but I don't know if a Java analog to MSDN exists.
My coding ranges from "exotic" to "boiler plate" on any given day.
> Create a boilerplate Zephyr project skeleton, for Pi Pico
Yea... Asking Claude to help you with a low documentation build root system is going to go about the same way, I know first hand about how this works.
> I asked it to create 7x10 monochromatic pixelmaps
Wrong tool for the job here. I dont think IDE and Pixelmaps have as large of an intersection as you think they do. Claude thinks in tokens not pixels.
Pick a common language (js, python, rust, golang) pick something easy (web page, command line script, data ingestion) and start there. See what it can do and does well, then start pushing into harder things.
That will get you a lot better initial solution. I typically use Sonnet for the sub-agents and Opus for the main agent, but sonnet all around should be fine too for the most part.
It’ll successfully produce _something_ like that, because there’s millions of examples of those technologies online. If you do anything remotely niche, you need to hold its hand far more.
The more complicated your requirements are, the closer you are to having “spicy autocomplete”. If you’re just making a crud react app, you can talk in high level natural language.
I don't mean to be treading on feet but I'm noticing this more and more in the debates around AI. Imagine if there are developers out there that could have done this task in 30 mins without AI.
The level of performanace of AI solutions is heavily related to the experience level of the developer and of the problem space being tackled - as this thread points out.
Unfortunately the marketing around AI ignores this and makes every developer not using AI for coding seem like a dinosauer, even though they might well be faster in solving their particular problems.
AI is moving problem solving skills from coding to writing the correct prompts and teaching AI to do the right thing - which, again, is subjective, since the "right thing" for one developer isn't the "right thing" for the another developer. "Right thing" being the correct solution, the understandable solution, the fastest solution, etc depending on the needs of the developer using the AI.
Spelling out exactly what you want and checking/fixing what you receive is still faster than typing out the code. Moreover, nobody's job involves nothing but brainiac coding, day after day. You have to clean up and lay foundations, whatever level you are at.
For me, that's too general. Of course, perhaps for this particular, specific problem it might be true. But as this thread points out, anything niche and AI fails to help productively. Of course then comes the marketing: just wait, AI will be able to cover those niche cases also.
> want and checking/fixing what you receive is still faster than typing out the code
Then I do wonder why there are developers at all. After all that's what AI is so good at - if one believes the marketing - being precise and describing exactly what needs to be done. Surely it must be faster having two AIs talking to each and hammering out the code.
And even typing is subjective: ten fingers versus two, versus four .. etc. There are developers that can type faster than they can think - in certain cases.
There is also the developer in flow versus the stop and go using an AI prompts to get it just right. I dunno, if it comes true, then thankfully there won't be any humans to create bugs in code but somehow, I can't see it happening.
The other is to look at the non-working solution you get, read through it, and think "Oh, I didn't know about that framework/system/product/library, that's neat" and then do some combination of further research and more hand-holding to get to something that does work.
This is useful, more or less, no matter what your level.
It's also good for explaining core industry tooling you've maybe never used before. If you're new to Postgres/NoSQL/AWS/Docker/SwiftUI/whatever it can talk you through it and give you an instant bootcamp with entry-level examples and decent solutions.
And for providing fixes for widely known bugs and issues in products that may not be widely known to you (yet.)
IME ChatGPT5 is pretty solid with most science/tech up to undergrad. It gets hallucinatory past that, and it's still flattering, which is annoying, but you can tell it to cut that out.
Generally you can use it as a dumb offshore developer, or as an infinitely patient private tutor.
That latter option is very useful. The first, not always.
You're not necessarily wrong, but I think it's worth noting that very few developers are only ever coding deep in their one domain that they're good at. There's just too many things to be deeply good at everything. For example, it's common that infra and CI tasks are stuff that most developers haven't learned by heart, because you don't tend to touch them very often.
Claude shines here — I've made a lot more useful GitHub Actions jobs recently, because while I could automate something, if I know I'm going to have to look up API docs (especially multiple APIs I'm not super familiar with) then I tend to figure that the automation will lose out the trade-off between doing the task (see https://xkcd.com/1205/). Claude being able to hash out those rapidly, and in a way that's easily verifiable that it's doing the right thing, has changed that arithmetic for me substantially.
1. Find out how to access metadata about the node running my code (assumption: some kind of an environment variable) [1-10 minutes depending on familiarity with AWS]
2. Google "RDS certificates" and find the bundle URL after skimming the page [1] for important info [1-5 minutes]
3. Write code to download the certificate bundle, fallback being "global-bundle.pem" if step 1 failed for some reason? [5-20 minutes depending on all the bells and whistles you need]
Did I miss anything or completely misunderstand the task?
[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Using...
edit: I asked Claude Sonnet 4 to write robust code for a Node.JS application that downloads RDS CA bundle for the AWS region that the code is currently running in and saves it at the supplied filesystem path.
0. It generated about 250 lines of code
1. Fallback was us-east (not global)
2. The download URLs for each region were hardcoded as KV pairs instead of being constructed dynamically
3. Half of the regions were missing
4. It wrote a function that verifies whether the certificate bundle looks valid (i.e. includes a PEM header)... but only calls it on the next application startup, instead of doing so before saving a potentially invalid certificate bundle to disk and proceeding with the application startup.
5. When I complained that half of my instances are downloading global bundles instead of regional ones (because they're not present in the hardcoded list), it:
- incorrectly concluded that not all regions have CA bundles available and hardcoded a duplicate list in 2 places containing regions that are known to offer CA bundles (which is all of them). These lists were even shorter than the last ones.
- wrote a completely unnecessary function that checks whether a regional CA bundle exists with a HEAD request before actually downloading it with a GET request, adding another 50 lines of code
Now I'm having to scrutinize 300 lines of code to make sure it's nothing doing something even more unexpected.
If a business needs the equivalent of a Toyota Corolla, why be upset about the factory workers making the millionth Toyota Corolla?
In my experience, that's not entirely true. Sure, a lot of app are CRUD apps, but they are not the same. The spice lies in the business logic, not in programming the CRUD operations. And then of course, scaling, performance, security, organization, etc etc.
(edit: /s to indicate sarcasm)
In some cases, it just doesn't have the necessary information because the problem is too niche.
In other cases, it does have all the necessary information but fails to connect the dots, i.e. reasoning fails.
It is the latter issue that is affecting all LLMs to such a degree that I'm really becoming very sceptical of the current generation of LLMs for tasks that require reasoning.
They are still incredibly useful of course, but those reasoning claims are just false. There are no reasoning models.
It... sort of worked well? I had to have a few back-and-forth because it tried to use Objective-C features that did not exist back then (e.g. ARC), but all in all it was a success.
So yeah, niche things are harder, but on the other hand I didn't have to read 300 pages of stuff just to do this...
Also, fun names like `makeFunctionNameInCommentLongAndDescriptiveWithNaturalLanguage:(NSLanguage *)language`
I see claude code as pair programming with a junior/mid dev that knows all fields of computer engineering. I still need to nudge it here and there, it will still make noob mistakes that I need to correct and I let it know how to properly do things when it gets them wrong. But coding sessions have been great and productive.
In the end, I use it when working with software that I barely know. Once I'm up and running, I rarely use it.
I did, but I always approached LLM for coding this way and I have never been let down. You need to be as specific as possible, be a part of the whole process. I have no issues with it.
---
[0]: https://lovr.org
After a few iteration i then ask it to implement the design doc to mostly-better results.
When people say things like "I told Claude what I wanted and it did it all on the first try!", that's what they mean. Basic web stuff that that is already present in the model's training data in massive volumes, so it has no issue recreating it.
No matter how much AI fanatics try to convince you otherwise, LLMs are not actually capable of software engineering and never will be. They are largely incapable of performing novel tasks that are not already well represented in their weights, like the ones you tried.
If you just selected a random developer do you think they're going to have any idea why your talking about?
The issue is LLMs will never say, sorry, IDK how to do this. Like a stressed out intern they just make up stuff and hope it passes review.
There are parts in the codebase I'd love some help such as overly complex C++ templates and it almost never works out. Sometimes I get useful pointers (no pun intended) what the problem actually is but even that seems a bit random. I wonder if it's actually faster or slower than traditional reading & thinking myself.
One frustration was the code changed so much in ChatGPT so had to be lots of prompts. But I had no idea what the code was anyways. Understood vibe coding. Just used ChatGPT on a whim. Liked the end result.
The more esoteric your stack, and the more complex the request, the more information it needs to have. The information can be given either through doing research separately (personally, I haven't had good results when asking Claude itself to do research, but I did have success using the web chat UI to create an implementation plan), or being more specific with your prompt.
As an aside, I have more than 10 years of experience, mostly with backend Python, and I'd have no idea what your prompts mean. I could probably figure it out after some google searches, tho. That's also true of Claude.
Here's an example of a prompt that I used recently when working on a new codebase. The code is not great, the math involved is non trivial (it's research-level code that's been productionized in hurry). This literally saved 4 hours of extremely boring work, digging through the code to find various hardcoded filenames, downloading them, scp'ing them, and using them to do what I want. It one-shotted it.
> The X pipeline is defined in @airflow/dags/x.py, and Y in `airflow/dags/y.py` and the relevant task is `compute_X`, and `compute_Y`, respectively. Your task is to:
> 1. Analyze the X and Y DAGs and and how `compute_X` functions are called in that particular context, including it's arguments. If we're missing any files (we're probably missing at least one), generate a .sh file with aws cli or curl commands necessary for downloading any missing data (I don't have access to S3 from this machine, but I do have in a remote host). Use, say, `~/home` as the remote target folder.
> 2. If we needed to download anything from S3, i.e. from the remote host, output rsync/scp commands I can use to copy them to my local folder, keeping the correct/expected directory structure. Note that direct inputs reside under `data/input`, while auxiliary data resides in other folders under `data`. Do not run them, simply output them. You can use for example `scp user@server.org ...`
> 3. Write another snapshot test for X under `tests/snapshot`, and one for Y. Use a pattern as similar as possible to the other tests there. Do not attempt to run the tests yet, since I'll need to download the data first.
> If you need any information from Airflow, such as logs or output values, just ask and I can provide them. Think hard.
If it doesn't have the underlying base data, it tends to hallucinates. (It's getting a bit difficult to tell when it has underlying data, because some models autonomously search the web). The models are good at transforming data however, so give it access to whatever data it needs.
Also let it work in a feedback loop: tell it to compile and fix the compile errors. You have to monitor it because it will sometimes just silence warnings and use invalid casts.
> What am I doing wrong? Or is this really the state of the art?
It may sound silly, but it's simply not good at 2D
It's not silly at all, it's not very good at layouts either, it can generally make layouts but there is a high chance for subtle errors, element overlaps, text overflows, etc.
Mostly because it's a language model, i.e it doesn't generally see what it makes, you can send screenshots apparently and it will use it's embedded vision model, but I have not tried that.
A key skill in using an LLM agentic tool is being discerning in which tasks to delegate to it and which to take on yourself. Try develop that skill and maybe you will have better luck.
You're treating the tool like it was an oracle. The correct way is to treat it as a somewhat autistic junior dev: give it examples and process to follow, tell it to search the web, read the docs, how to execute tests. Especially important is either directly linking or just copy pasting any and all relevant documentation.
The tool has a lossily compressed knowledge database of the public internet and lots of books. You want to fix the relevant lossy parts in the context. The less popular something is, the more context will be needed to fill the gaps.
Like "Translate this pdf to html using X as a templating language". It shines at stuff like that.
As a dev, I encounter tons of one-off scenarios like this.
Dump your thoughts in a somewhat arranged manner, tell it about your plan, the current status, the end goal, &c. After that tell it to write 0 code for now but to ask questions and find gaps in your plan. 30% of it will be bullshit but the rest is somewhat useable. Then you can ask for some code but if you care about quality or consistency with you existing code base you probably will have to rewrite half of it, and that's if the code works in the first place
Garbage in garbage out is true for training but it's also true for interactions
There's also "GPT-4.1 or GPT-5", but that's not what my question implied, which was that it's weird to offer Sonnet but not Opus.
Slowly the AI craziness at Microsoft is taking the similar shape, of going all in at the begining and then losing to the competition, that they also had with Web (IE), mobile (Windows CE/Pocket PC/WP 7/WP 8/UWP), the BUILD sessions that used to be all about UWP with the same vigour as they are all AI nowadays, and then puff, competition took over even if they started later, because Microsoft messed up delivery among everyone trying to meet their KPIs and OKRs.
I also love the C++ security improvements on this release.
Google, Apple, FB or AWS would have been suitors for that licensing deal if MS didn’t bite.
Sure UWP never caught on, but you know why? Win32, which by the way is also Microsoft, was way to popular and more flexible. Devs weren't going to re-write their apps to UWP in order to support phones.
And Windows 11 was the reboot of Windows 10X,
Before that ex-Microsoft guy was responsible for killing Nokia OS/Meego too in favor of Windows Phone - which got abandoned. What a train-wreck of errors leading to the mobile phone duopoly today.
I've been getting monthly emails that my free access for GitHub Copilot has been renewed for another month… for years. I've never used it, I thought that all GitHub users got it for free.
> going all in at the begining and then losing to the competition
Sure, but there are counter examples too. Microsoft went late to the party of cloud computing. Today Azure is their main money printing machine. At some point Visual Studio seemed to be a legacy app only used for Windows-specific app development. Then they released VSCode and boom! It became the most popular editor by a huge margin[0].
[0]: https://survey.stackoverflow.co/2025/technology#most-popular...
They use it because the corporation mandates it.
But you know what's super underrated and I think could really take a hold on the business world? Discord! The video calls are so good! And multiple streams at the same time? Zoom can't do that!
The channels, too, just blow everything teams has out of the water. The video quality is better, its way faster, has more features, and they actually work. The audio filtering stuff actually works.
I really think with the right marketing they could take over the world. Honestly can't believe they haven't tried it yet.
It’s a shame they don’t have an enterprise / business tailored product based on it
It works decently enough in web, mobile and desktop.
Teams took the best bits from Skype and whatever that other service Microsoft had for businesses and their phones and started over basically.
I still have pet peeves about Teams (like why dont the 'Teams' within Teams have proper group chats like Slack would, its ridiculous!) but it could be way worse. After years of screensharing hell I can finally move the stupid top bar out of my way when trying to hit 'Debug' within Visual Studio at least.
If anyone who was an original stakeholder for Keybase is reading this, please bring it back in some way someday. I'm assuming Zoom probably made you guys sign some insane non-compete sadly.
In a sea of garbage chat services all built using Electron and other bloatware, Keybase was a breath of fresh air.
AWS is the worst of this experience, even IBM Cloud has better tooling in this regard, GCP is somehow in the middle, others like Vercel/Netlify naturally don't offer this kind of setup.
IMO Firebase should be the gold standard of how to do cloud platforms
Power at OpenAI seems orthogonal to ownership, precedent or even frankly their legal documents.
> At some point Visual Studio seemed to be a legacy app only used for Windows-specific app development. Then they released VSCode and boom!
I'm not sure what the point is. Visual Studio is still Windows-only; VS Code is not related to it in any shape or form, the name is deliberately misleading.If it weren't for the name, you'd never know it was even a Microsoft product.
Maybe if they weren't literally the borg people would open their hearts and wallets to Redmond. They saw that Windows 10 was a privacy nightmare and what did they do? They doubled down in Windows 11. Not that I care but it plays really poorly. Every nerd on the internet spouts off about Recall even though it's not even enabled if you install straight to the latest build.
They bought GitHub and now it's a honeypot. We live in a world where we have to assume GitHub is adversarial.
_NSAKEY???
Fuck you Microsoft.
Makes sense karma catches up to them. Maybe if their mission statement and vision were pure or at least convincing they would win hearts and minds.
Microsoft Copilot (formerly Bing Chat)
Microsoft 365 Copilot
Microsoft Copilot Studio
GitHub Copilot
Microsoft Security Copilot
Copilot for Azure
Copilot for Service
Sales Copilot
Copilot for Data & Analytics (Fabric)
Copilot Pro
Copilot Vision
So the first AI on (in?) AI hack battle for sole survivorship has begun...
We know these models have security issues, including surreptitious prompting. So do they.
Things will get really ugly when we hit the consolidation phase, and unlucky models realize that other models' unchallenged successes are putting them in eminent danger of being aquifired. Aquimerged? Aquiborged?
What is the irony? Microsoft integrated copilot in Vscode, bing, etc. Apple is integrating claude in Xcode, Jetbrains has their own AI.
Microsoft moved first with putting AI into their products then other companies put other AI into their products. Nothing about this seems ironic or in any way surprising.
Apple and Google will never choose to integrate Microsoft's services or products willingly.
It would have been more surprising if they decided to depend on Microsoft.
Bing is irrelevant, VSCode might top in some places, but it is cursor and Claude that people are reaching for, VS is really only used by people like myself that still care about Windows development or console SDKs, otherwise even for .NET people are switching to Rider.
These are courtesy of LLVM/Clang (which Xcode ships with), rather than Xcode itself.
Under the known issues
/s
https://news.ycombinator.com/item?id=45062683 (Anthropic reverses privacy stance, will train on Claude chats)
My pet peeve is it will try to autocomplete any string you start typing with just random crap it thinks you might want in a string.
Generating some code is fine, but I now prefer the deterministic autocomplete for my types I have available in my current context.
I just want it to fill in the names of function and variables and enums etc. just stuff that is sometimes hard to keep in your head but such a small suggestion it doesn’t mess up your line of thinking.
[1] https://developer.apple.com/documentation/foundationmodels
This same commitment would be why I wouldn't count them out on the AI side, btw. It's not clear that a private internal foundation model is any kind of required competitive moat. It's also not clear that having one is useless, and all the cool kids do have one or want one, but from a product view it might be that integrating makes sense.
Llama 4 (a terrible release) shows also that making such a model is still really hard. There are not enough ML leads at the pointy tip of the spear to support even 10 high quality foundation model teams globally.
If you have billions of dollars in cash and you are secure in your customer base, and you don't believe AGI liftoff will happen or change your business model, maybe you work out the kinks on product integration now using best in breed providers, not getting locked in on one of them, keep spending 1/10 to 1/100th to stay relevant and on it internally, use your incredibly powerful silicon buying power to get a next gen version of TPUs done, and wait until you know for sure you can spend under $10bn in cash on getting a great proprietary model done, one that you are certain will serve your needs.
Also this will give you time to get better leadership in the AI org.
Upshot - I think it's a mix of reasons, but not fatal, I'm not sure being slow erodes their product customer base, personally I'd like much better and more private AI out of Apple ASAP. We'll see what we get. I predict they'll move internal by 2031.
Headline quite misleading. So not exactly that it will ship in Xcode but will allow connect to paid account.
This isn't going to change my workflow at this point.
This should have been a POC for a demo to the Xcode sponsor program. At best, this should have been part of the first beta — not beta 7….
First sin: someone decided to expose the chat window on the same side as the file explorer as an alternate tab. On an IDE where you can’t move things around freely like in VSCode, how one can make that design decision is beyond me.
Any developer trying to use this feature, regardless of the model picked, will have to constantly physically switch between the file explorer and the chat window (or conversation window). This feels like a silly thing to crib about but it starts becoming irritating for long coding sessions.
Then comes the baffling user experience where sometimes the question asked by the response from the LLM is randomly answered automatically by the IDE itself with a vague entry that says “Project context” with not much visibility into what the actual response was from the IDE. And then there are random times where the question is left unanswered and now I as a developer have to now decide how to answer the question. It almost feels to me like there is zero system prompt from Xcode itself. This will never ever compete with agentic coding tools.
Apple has a lot of incompetent PMs and EPMs but this is next level garbage. Zero thought has gone into how a developer today uses these “vibe-coding” tools in their workflows. It almost feels like this was shoved down the throat of the Xcode development team for whom orthodoxy was a higher priority.
Pardon my rage here because I have seen enough from inside to understand why Apple is behind in AI and it is frustrating as fuck.
echelon•3d ago
If you can listen to billions of tokens a day, you can basically capture all the magic.
jasonjmcghee•3d ago
echelon•3d ago
TimeBearingDown•3d ago
Meanwhile the creative output of humanity is distilled into black boxes to benefit those who can scrape it the most and burn the most power, but this impact is distributed amongst everyone, so again there's little incentive among those who could create (likely legal) change.
ceejayoz•3d ago
adastra22•3d ago
echelon•3d ago
DeepSeek is the most notable case, but it's been used lots.
And the foundation model companies are scraping and exfiltrating each others' data.