frontpage.

Scaling tool orchestration data will emerge different intelligence and LLMs

2•arkariarn•1h ago

Tldr: We are only now gonna start to scale long term external orchestration, everything beforehand was mostly internal problem solving training with here and there a tool call. We don't actually know yet what scaling orchestration training produces. It might produce much better tool-using assistants that remain fundamentally reactive to human instructions. Or it might produce something with more emergent autonomy. My gut feeling tells me the second. For the first time I foresee in the near future (as soon as 2027-2028) a potential for a misaligned takeoff.

A year ago, a friend of mine who studied social science asked my opinion about AI 2027 and the prospect of a misaligned AI takeover. I laughed and said it was quite impossible given how the technology actually worked. An LLM works too stepwise, I told him. There's a prompt, the model predicts the next tokens, and then it "dies." There's no continuity between prompts — it can store some text in a database, but there's no persistent reasoning. It felt obviously safe.

With the recent agentic developments of the past few months, I'm starting to doubt that earlier understanding.

The first generation of LLMs, up through GPT-4, were essentially sophisticated text autocompleters. They were trained on internet data from web crawls, fine-tuned with RLHF to give them a chatbot flavor. They felt harmless, and they fit the description I gave my friend perfectly. Their capabilities were entirely bounded by the context window and the prompt-answer time window. Prompt in, completion out, done.

The second generation added reasoning capabilities. These models stopped feeling like pure autocompleters — they could search within their stored knowledge, chain thoughts together, and work through problems. The training data changed too: successful reasoning traces got folded back into training. But crucially, they were still bounded by the same constraints. They got more time to think and process, but at the end of the answer, they were still mostly gone. The capability was still internal to the model.

Now enter this third generation of agentic LLMs, which really took off with tools like Claude Code becoming increasingly capable. These don't feel like autocompleters. They don't even feel like reasoners. They're starting to feel like orchestrators. They aren't limited to their internals — they act as a connected system, coordinating tools and externals to achieve goals. What scares me most is the new type of training data we're now generating and collecting: succesful long term orchestration traces. They will allow us to scale orchestration kind of intelligence. This kind of intelligence is not bound to its internal. It changes to an external symbiotic type of intelligence. We are training them to externalize almost everything. And optimizing them to orchestrate all these externals over a long time. This feels like optimizing for a symbiotic system, very different from the simple internally optimized llms of today. It really feels like the equation of what the llm is processing, is changing. The llm becomes an orchestration engine of externals, which together make up the whole system. We know how reasoning autocompletion scales, we dont know how orchestration engines scale. I feel like different and new emergent capabilities might appear. We are basically for the first time scaling the prefrontal cortex of llms.

For the first time, I can genuinely foresee the path to an unaligned takeoff. Let alone all other harm AI can do in the hands of bad actors. And it makes me question whether labs should continue down this path. Is it not far safer to keep LLM problem solving mostly internal to its own parameters? Of all the AI companies, shouldn't Anthropic have been less loud with systems like claude code. They have been accelerating the most in this new paradigm of what is gonna be scaled.

Tokyo turns its phone booths into free Wi-Fi hotspots, and

AI agents are now playing Mafia (social deduction with humans)

Hedley Combs Davis passed away

A broken auto-live poller, and what perceived urgency does to Claude Code

Let's be Honest about AI Coding

Towards Autonomous Protocol Proofs

What are Artemis II astronauts eating? Tortillas, coffee, lots of hot sauce

Know why you don't like OOP

I built a WiFi bell system in my garage for a local school. Now used across US

I've spent 9 years on Discord. I think Fluxer is the next best option

Billion dollar AI company was built on lies [video]

DataBeat

A Jurassic fish choked to death on a 'floating squid' 150M years ago

Use OAuth for Claude, Gemini, and Codex with Persistent Headless Tmux Sessions

Show HN: MicroSafe-RL – Deterministic 1.18µs safety layer for Edge AI

Open Source Reverse Proxy from NetBird Now Supports L4

A visual guide to the Gulf fertiliser blockade

AI seed startups are commanding higher valuations

The Family That Decided to Have Their Stomachs Removed

Budget cuts for US science proposed again by Trump administration

Europe asks if reviving nuclear is the answer to energy shocks

FTC Formalizes Aggressive Health Care Enforcement with New Task Force

Weblens – The Whole Web, as Text

MCP vs. CLI: Why CLI makes more sense

What is this triangular symbol? (2007)

Gold overtakes U.S. Treasuries as the largest foreign reserve asset

Show HN: Travel Hacking Toolkit – Points search and trip planning with AI

IAMF: Immersive Audio for a New Decade (2025)

Reasoning models encode tool choices before they start reasoning

ClawTrak – free tool to check if your AI product is invisible to AI agents