Tongyi DeepResearch – open-source 30B MoE Model that rivals OpenAI DeepResearch

https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/

365•meander_water•3mo ago

Comments

jychang•3mo ago

This is over a month old, they released the weights a long time ago.

earthnail•3mo ago

And for those not so tightly in the loop: how does it compare?

jwr•3mo ago

That's OK — not all of us follow all the progress on a daily basis, and a model that is a month old doesn't become useless just by being a month old!

embedding-shape•3mo ago

Isn't OpenIA "Deep research" (not "DeepResearch") a methodology/tooling thing, and you'll get different responses depending on what specific model you use with it? As far as the UI allows you to, you could use Deep research with GPT-5, GPT-4o, o3 and so on, and that'll have an impact on the responses. Skimming the paper and searching for some simple terms makes it seem like they never expand on what exact models they've used, just that they've used a specific feature from ChatGPT?

simonw•3mo ago

At this point "deep research" is more of a pattern - OpenAI and Perplexity and Google Gemini all offer products with that name which work essentially the same way, and Anthropic and Grok have similar products with a slightly different name attached.

The pattern is effectively long-running research tasks that drive a search tool. You give them a prompt, they churn away for 5-10 minutes running searches and they output a report (with "citations") at the end.

This Tongyi model has been fine-tuned to be really good at using its search tool in a loop to produce a report.

embedding-shape•3mo ago

Yes, but I think my previous point still matter, namely what exact model is being used greatly affects the results.

So without specifying which model is being used, it's really hard to know what is better than something else, because we don't understand what the underlying model is, and if it's better because of the model itself, or the tooling, which feels like an important distinction.

aliljet•3mo ago

Sunday morning, and I find myself wondering how the engineering tinkerer is supposed to best self-host these models? I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly. I'm curious what the current recommendation on that path looks like.

Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)

homarp•3mo ago

llama.cpp gives you the most control to tune it for your machine.

giobox•3mo ago

If you just want to get something running locally as fast as possible to play with (the 2080ti typically had 11gb of VRAM which will be one of the main limiting factors), the ollama app will run most of these models locally with minimum user effort:

https://ollama.com/

If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!

CuriousSkeptic•3mo ago

Im sure this guy has some helpful hints on that: https://youtube.com/@azisk

exe34•3mo ago

llama.cpp + quantized: https://huggingface.co/bartowski/Alibaba-NLP_Tongyi-DeepRese...

get the biggest one that will fit in your vram.

davidsainez•3mo ago

This is the way. I managed to run (super) tiny models on CPU only with this approach.

trebligdivad•3mo ago

How do people deal with all the different quantisations? Generally if I see an Unsloth I'm happy to try it locally; random other peoples...how do I know what I'm getting?

(If nothing else Tongyi are currently winning AI with cutest logo)

exe34•3mo ago

personally I've only used them for toying around - but in all cases you have to test them for your use case anyway.

btbuildem•3mo ago

I've recently put together a setup that seemed reasonable for my limited budget. Mind you, most of the components were second-hand, open box deals, or deep discount of the moment.

This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.

- Ryzen 9 9950X

- MSI MPG X670E Carbon

- 96GB RAM

- 2x RTX 3090 (24GB VRAM each)

- 1600W PSU

pstuart•3mo ago

That's basically what I imagined would be my rig if I were to pull the trigger. Do you have an NVLink adapter as well?

btbuildem•3mo ago

No NVLink; it took me a long time to compose the exact hardware specs, because I wanted to optimize performance. Both cards are on x8 PCIe direct CPU channels, close to their max throughput anyway. It runs hot with the CPU engaged, but it runs fast.

nine_k•3mo ago

Does it offer more performance than a Macbook Pro that could be had for a comparable sum? Your build can be had for under $3k; a used MBP M3 with 64 GB RAM can be had for approximately $3.5k.

btbuildem•3mo ago

I'm not sure, I did not run any benchmarks. As a ballpark figure -- with both cards throttled down to 250W, running a Qwen-30B FP8 model (variant depending on task), I get upwards of 60 tok/sec. It feels on par with the premium models, tbh.

Of course this is in a single-user environment, with vLLM keeping the model warm.

bee_rider•3mo ago

MacBooks have some clever chips, but 2x 3090 is a lot of brawn to overcome.

PeterStuer•3mo ago

Unfortunately the RTX 3090 has no native FP8 support.

jlokier•3mo ago

I use a Macbook Pro with 128GB RAM "unified memory" that's available to both CPU and GPU.

It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.

I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.

MaxMatti•3mo ago

How's the battery holding up during vibe coding sessions or occasional LLM usage? I've been thinking about getting a MacBook or a laptop with a similar Ryzen chip specifically for that reason.

jlokier•3mo ago

Currently I don't use vibe coding or even code assistants, so I can't speak to how the battery fares when doing that sort of thing. I don't know how much or how intensively they need to run the underlying LLMs.

For chatting with LLMs via ollama, I've seen total power usage go to about 50W (on an M3 Max) while the LLM is active, which is about 3x-4x power usage compared to just idling with browsers and editors open.

So I'd estimate about 2-3 hours of continuous LLM use on battery. Because I have enough RAM spare, at least there's no need to keep shutting down and reloading models.

I haven't really pushed it to find out how long they run on battery, as I haven't used LLMs all that much.

I'm more interested in the underlying operations of how they work, investigating novel model architectures and techniques, and optimising performance, than actually using them as an end user :-) Similar to how I enjoyed writing game engines more than playing games :-) Maybe I'll get into using them more in future.

anon373839•3mo ago

I’d strongly advise ditching Ollama for LM Studio, and using MLX versions of the models. They run quite a bit faster on Apple Silicon. Also, LM Studio is much more polished and feature rich than Ollama.

terhechte•3mo ago

Fully agree to this. LM Studio is much nicer to use and with MLX faster on Apple Silicon

jwr•3mo ago

I just use my laptop. A modern MacBook Pro will run ~30B models very well. I normally stick to "Max" CPUs (initially for more performance cores, recently also for the GPU power) with 64GB of RAM. My next update will probably be to 128GB of RAM, because 64GB doesn't quite cut it if you want to run large Docker containers and LLMs.

sumo43•3mo ago

Try running this using their harness https://huggingface.co/flashresearch/FlashResearch-4B-Thinki...

3abiton•3mo ago

As many pointed out, Macs are decent enough to run them (with maxxed rams). You also have more alternative, like DGX Sparks (if you appreciate the ease of cuda, albeit a tad bit slower token generation performance), or the Strix Halo (good luck with ROCm though, AMD still peddling hype). There is no straitghtforwars "cheap" answer. You either go big (gpu server), or compromise. Either way use either vllm or sglang, or llama.cpp. ollama is just inferior in every way to llama.cpp.

aliljet•3mo ago

oh my god. 128 gb of RAM! way too late to repair this thread, but most people caught this.

sigmarule•3mo ago

The Framework Desktop runs this perfectly well, and for just about $2k.

Lapel2742•3mo ago

> I'd love to load this up on the old 2080ti with 128gb of vram and play, even slowly.

I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.

I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.

My Radeon (ROCm) specific batch file to start this:

llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080

yencabulator•2mo ago

> I get ~12 tps with 16k context

FWIW Ollama at its defaults with qwen3:30b-a3b has 256k context size and does ~27 tokens/sec on pure CPU on a $450 mini PC with AMD Ryzen 9 8945HS. Unless you need a room heater, that GPU isn't pulling its weight.

greggh•3mo ago

If you really need a lot of VRAM cheap rocm still supports the amd MI50 and you can get 32gb versions of the MI50 on alibaba/aliexpress for around $150-$250 each. A few people on r/localllama have shown setups with multiple MI50s running with 128gb of VRAM and doing a decent job with large models. Obviously it won't running as fast as any brand new GPUs because of memory bandwidth and a few other things, but more than fast enough to be usable.

This can end up getting you 128gb of VRAM for under $1000.

mehdibl•3mo ago

It's a Qwen 3 MoE fine tune...

zurfer•3mo ago

It makes me wonder if we'll see an explosion of purpose trained LLMs because we hit diminishing returns on invest with pre training or if it takes a couple of months to fold these advantages back into the frontier models.

Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.

But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

deepanwadhwa•3mo ago

-> GPT 3.5 was awesome at chess I don't agree with this. I did try to play chess with GPT3.5 and it was horrible. Full of hallucinations.

miki123211•3mo ago

It was GPT-3 I think.

As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).

Imustaskforhelp•3mo ago

This is so interesting, I am curious as to why, can you (or anyone) please provide any resources or insightful comments about it, they would really help a ton out here, thanks!

pixelmelt•3mo ago

Gpt3 was trained on completion data so it likely saw lots of raw chess games layed out in whatever standard format moves are listed in, while 3.5 was post trained on instruct data (talking back and forth) which would have needed to explicitly include those chess games as conversational training data for it to retain as much as it would otherwise

zurfer•3mo ago

Yeah I was not precise; it was `gpt-3.5-turbo-instruct`, other variants weren't trained on it apparently. https://dynomight.substack.com/p/chess

alephnerd•3mo ago

> if we'll see an explosion of purpose trained LLMs...

Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.

onlyrealcuzzo•3mo ago

Isn't the whole point of the MOE architecture exactly this?

That you can individually train and improve smaller segments as necessary

idiotsecant•3mo ago

I think it's the exact opposite - you don't specifically train each 'expert' to be a SME at something. Each of the experts is a generalist but becomes better at portions of tasks in a distributed way. There is no 'best baker', but things evolve toward 'best applier of flour', 'best kneader', etc. I think explicitly domain-trained experts are pretty uncommon in modern schemes.

viraptor•3mo ago

That's not entirely correct. Most of moe right now are fully balanced, but there is an idea of a domain expert moe where the training benefits fewer switches. https://arxiv.org/abs/2410.07490

idiotsecant•3mo ago

Yes, explicitly trained experts were a thing for a little while, but not anymore. Yet another application of the Hard Lesson.

ainch•3mo ago

Generally you train each expert simultaneously. The benefit of MoEs is that you get cheap inference because you only use the active expert parameters, which constitute a small fraction of the total parameter count. For example Deepseek R1 (which is especially sparse) only uses 1/18th of the total parameters per-query.

pama•3mo ago

> only uses 1/18th of the total parameters per-query.

only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.

ainch•3mo ago

That's a good correction, thanks.

criemen•3mo ago

> or if it takes a couple of months to fold these advantages back into the frontier models.

Right now, I believe we're seeing that the big general-purpose models outperform approximately everything else. Special-purpose models (essentially: fine tunes) of smaller models make sense when you want to solve a specific task at lower cost/lower latency, and you transfer some/most of the abilities in that domain from a bigger model to a smaller one. Usually, people don't do that, because it's a quite costly process, and the frontier models develop so rapidly, that you're perpetually behind them (so in fact, you're not providing the best possible abilities).

If/when frontier model development speed slows down, training smaller models will make more sense.

fragmede•3mo ago

Right, the Costco problem. A small boutique eg wine store might be able to do better for picking a very specific wine for a specific occasion, but Costco is just so much bigger that they can make it up in Volume and buy cases and cases of everything with a lower markup, so it ends up being cheaper to shop at Costco, no matter how much you want to support the local wine boutique.

semi-extrinsic•3mo ago

In Norway there is a state-owned monopoly on selling wine and liquor (anything above 4.75% ABV). They have 350+ physical shops, a large online shop and around $2bn annual revenue. This makes them one of the largest purchasers of wine and spirits in Europe, and they can get some very good deals.

So even though you have high taxes and a restrictive alcohol policy, the end result is shops that have high customer satisfaction because they have very competent staff, excellent selection and a surprisingly good price for quality products.

The downsides are the limited opening hours and the absence of cheap low-quality wine - the tax disproportionally impacts the low quality stuff, almost nobody will buy shitty wine at $7 per bottle when the decent stuff costs $10, so the shitty wine just doesn't get imported. But for most of the population these are minor drawbacks.

nextos•3mo ago

The advantage of small purpose-specific models is that they might be much more robust i.e., unlikely to generate wrong sequences for your particular domain. That is at least my experience working on this topic during 2025. And, obviously, smaller models mean you may deploy them on cheaper hardware, latency is reduced, energy consumption is lower, etc. In some domains like robotics, these two advantages might be very compelling, but it's obviously early to draw any long-term conclusions.

larodi•3mo ago

I second this. Smaller models indeed may be much better positioned for fine-tuning for the very reason you point out - less noise to begin with.

barrell•3mo ago

> If/when frontier model development speed slows down

You do not believe that this has already started? It seems to me that we’re well into a massive slowdown

enraged_camel•3mo ago

Not the OP but I use AI all day every day and have noticed substantial improvements in the models over the past ~6 months. GPT-5 was a huge leap (contrary to reporting) and so was Sonnet 4.5.

barrell•3mo ago

GPT5 was by no means a huge leap. I’d be willing to believe that you prefer it, or that you found it an improvement, despite both of those being wildly contrary to my experience (and most of the rhetoric online). But objectively speaking it was a small improvement, even going by OpenAI’s marketing claims.

In practice, I upgraded everything to GPT-5 and the performance was so terrible I had to rollback the update.

embedding-shape•3mo ago

> GPT-5 was a huge leap (contrary to reporting) and

Depends on what you compare it to. For us who were using o3/o1 Pro Mode before GPT-5, the new model isn't that huge of a leap, compared to whatever was before Pro Mode existed.

criemen•3mo ago

It's hard for me to say. I don't think you know you're on the S-curve until after the fact.

On the one hand, most models are "good enough" for chatgpt-like usage, and there it's hard to see/feel generation-to-generation improvements. On the other hand, if you look at instruction following, dealing with long context windows, >200 tool call interactions while staying on track, there's still plenty of improvements to be had. So, hard to say where we are.

AmbroseBierce•3mo ago

It reminds me of a story I read somewhere that some guy high on drugs climbed to the top of some elevated campus headlights shouting things about being a moth and loving lights, and the security guys tried telling him to go down but he paid no attention to that and time went on until a janitor came and shut off the lights, then turned one of those high powered handheld ones and point it at him the guy quickly climbed down there.

So yeah I think there are different levels of thinking, maybe future models with have some sort of internal models once they recognize patterns of some level of thinking, I'm not that knowledgeable of the internal workings of LLMs so maybe this is all nonsense.

Imustaskforhelp•3mo ago

> But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.

Wow, I am so curious, can you provide me the source

I am so interested in a chess LLM's benchmark as someone who occasionally plays chess. I have thought about creating things like these but it would be very interesting to find the best model at chess which isn't stockfish/lila but general purpose large language models.

I also agree that there might be an explosion of purpose trained LLM's. I had this idea some year ago when there was llama / before deepseek that what if I want to write sveltekit and there are models like deepseek which know about sveltekit but they are so damn big and bloated when I only want to use sveltekit/svelte models. Yes there are thoughts on why we might need the whole network to get better quality but I genuinely feel like right now, the better quality is debtable thanks to all this benchmarkmaxxing and I would happily take a model trained on sveltekit on like preferably 4b-8b parameter but if an extremely good SOTA-ish model for sveltekit is even around 30-40b I would be happy since I could buy a gpu on my pc to run it or run it on my mac

I think my brother who actually knows what he's talking about in the AI space, (unlike me), also said the same thing a few months back to me as well.

In fact, its funny because I had asked him to please create a website comparing benchmarks of AI playing chess and having an option where we can make two AI LLM's play against each other and we can view it or we can also play against an LLM inside an actual chess board on the web and more..., I had given this idea to him a few months ago after the talk about small llm's really lol and he said that its good but he was busy right now. I think then later he might have forgotten about it and I had forgotten about it too until now.

radarsat1•3mo ago

Just search for "chess LLM leaderboard" there are already several. Also check https://www.reddit.com/r/llmchess/ although admittedly it doesn't get a lot of traffic.

zurfer•3mo ago

this was the article I had in mind, when writing this: https://dynomight.substack.com/p/chess

Imustaskforhelp•3mo ago

Ohhh I think this was the same article that I also had in mind

Key memory unlocked. I had an Aha moment with this article, thanks a lot for sharing it, appreciate it.

almaight•3mo ago

https://seed-tars.com/game-tars

almaight•3mo ago

Video games have long served as a crucial proving ground for artificial intelligence. Like the real world, they offer rich, dynamic environments with responsive, real-time settings and complex challenges that push the boundaries of AI capabilities. The history of AI in gaming is marked by landmark achievements, from mastering classic board games to achieving superhuman performance in complex strategy titles. However, the next frontier lies beyond mastering individual, known environments.

To meet this challenge, we introduce Game-TARS: a next-generation generalist game agent designed to master complex video games and interactive digital environments using human-like perception, reasoning, and action. Unlike traditional game bots or modular AI frameworks, Game-TARS integrates all core faculties—visual perception, strategic reasoning, action grounding, and long-term memory—within a single, powerful vision-language model (VLM). This unified approach enables true end-to-end autonomous gameplay, allowing the agent to learn and succeed in any game without game-specific code, scripted behaviors, or manual rules.

With Game-TARS, this work is not about achieving the highest possible score in a single game. Instead, our focus is on building a robust foundation model for both generalist game-playing and broader computer use. We aim to create an agent that can learn to operate in any interactive digital environment it encounters, following instructions just like a human.

rokob•3mo ago

This whole series of work is quite cool. The use of `word-break: break-word;` makes this really hard to read though.

soared•3mo ago

I actually can’t read it for some reason? My brain just can’t connect the words

don-bright•3mo ago

so it appears the entire text has been Translated with non-breaking space unicode x00a0 instead of normal spaces x0020, so the web layout is considering all paragraph text as a super-long single word ('the\00a0quick\00a0\brown\00a0fox' instead of 'the quick brown fox') - the non-breaking space character appears identically to breaking-space when rendered but underlying coding breaks the concept of "break at end of word" because there is no end as 00a0 literally means "non-breaking"). per Copilot spending a half hour explaining this to me, apparently this can be fixed by opening web browser developer view, and copy/pasting this code into the console.

function replaceInTextNodes(node) { if (node.nodeType === Node.TEXT_NODE) { node.nodeValue = node.nodeValue .replace(/\u00A0/g, ' '); } else { node.childNodes.forEach(replaceInTextNodes); } }

replaceInTextNodes(document.body);

nl•3mo ago

This is completely fascinating although puzzling how that happens.

The script is great!

dlisboa•3mo ago

That’s why typography matters. You can’t read it because a very basic convention has been broken here and that throws everything off.

theflyestpilot•3mo ago

I hope the translation for this is actually "Agree" Deep research. Just a dig at "You are absolutely right!" sycophancy.

numpad0•3mo ago

TIL the "full" name of Alibaba Qwen is 通義千問(romanized as "Tongyi Qianwen", something along "knows all thousand questions"), of which the first half without the Chinese accent flags is romanized identically to "同意", meaning "same intents" or "agreed".

The Chinese version of the link says "通义 DeepResearch" in the title, so doesn't look like the "agree" to be the case. Completely agreed that it would be hilarious.

1: https://www.alibabacloud.com/en/solutions/generative-ai/qwen...

rahimnathwani•3mo ago

For people who don't read Chinese: the two 'yi' characters numpad0 mentioned (义 and 義) are the same, but written in different variants of Chinese script (Simplified/Traditional).

Traubenfuchs•3mo ago

It still feels to me like OpenAI has zero moat. There are like 5 paid competitors + open source models.

I switch between gemini and ChatGpt whenever I feel one fails to fully grasp what I want, I do coding in claude.

How are they supposed to become the 1 trillion dollar company they want to be, with strong competition and open source disruptions every few months?

rokob•3mo ago

I don’t know if they can pull it off but a lot of companies are built on strong enterprise sales being able to sell free stuff with a bow on it to someone who doesn’t know better or doesn’t care.

isoprophlex•3mo ago

Premium grade deals with Oracle. They will bullshit their way into government and enterprise environments where all the key decision makers are clueless and/or easily manipulated.

nickpinkston•3mo ago

Yea, I agree.

Arguably LLMs are both (1) far easier to switch between models than it is today to switch from AWS / GCP / Azure systems, and (2) will be rapidly decreasing switching costs for your legacy systems to port to new ones - ie Oracle's, etc. whole business model.

Meanwhile, the whole world is building more chip fabs, data centers, AI software/hardware architectures, etc.

Feels more like we're headed to commodification of the compute layer more than a few giant AI monopolies.

And if true, that's actually even more exciting for our industry and "letting 100 flowers bloom".

whiplash451•3mo ago

Isn’t the moat in the product/UI/UX? I use Claude daily and love the “scratch notebook” feel of it. The barebone model does not get you any of this.

hamandcheese•3mo ago

I agree that the scaffolding around the model contributes greatly to the experience. But it doesn't take billions of dollars in GPUs to do that part.

Grimblewald•3mo ago

Of course they dont, the only advantage it ever had was the willingness to destroy trust on the internet by scraping everything from everyone rules and expectations be dammed.

The underlying architecture isnt special, the underlying skills and tools aren't special.

There is nothing openAI brings to the table other than a willingness to lie, cheat, and steal. That only gives you an edge for so long.

red2awn•3mo ago

The moat of OpenAI is 1. internal knowledge they've built over the last few years building front tier models 2. their talent 3. the ChatGPT brand (go ask a random person on the street, they know ChatGPT but not Claude or Gemini)

steveny3456•3mo ago

Juju

krystofee•3mo ago

Isnt it huge deal, that this 30B model can compare and surpass huge closed models?

tbruckner•3mo ago

Has anyone found these deep research tools useful? In my experience, they generate really bland reports don't go much further than summarization of what a search engine would return.

ainch•3mo ago

The reports are definitely bland, but I find them very helpful for discovering sources. For example, if I'm trying to ask an academic question like "has X been done before," sending something to scour the internet and find me examples to dig into is really helpful - especially since LLMs have some base knowledge which can help with finding the right search terms. It's not doing all the thinking, but those kind of broad overviews are quite helpful, especially since they can just run in the background.

kmarc•3mo ago

I caught myself that most of my LLM usage is like this:

ask a loaded, "filter question" I more or less know the answer for, and mostly skip the prose and get to the links to its sources.

ukuina•3mo ago

The "loaded question" approach works for getting MUCH better pro/con lists, too, in general, across all LLMs.

vogu66•3mo ago

I do that too, I wonder how much of it is the LLM being helpful and how much of it is the RAG algorithm somehow providing better references to the LLM than a google search can?

andy99•3mo ago

My experience is the same as yours. It feels to me (similar to most LLM writing) like they write for someone who’s not going to read it or use it but is going to glance at it and judge the quality that way and assume it’s good.

Not to different from a lot of consulting reports, in fact, and pretty much of no value if if you’re actually trying to learn something.

Edit to add: even the name “deep research” to me feels like something defined to appeal to people who have never actually done or consumed research, sort of like the whole “phd level” thing.

tbruckner•3mo ago

"they write for someone who’s not going to read it" is a great way to phrase it.

blaesus•3mo ago

"Summarization of what a search engine would return" is good enough for many of my purposes though. Good for breaking into new grounds, finding unknown unknowns, brainstorming etc.

andai•3mo ago

I have a script that searches DDG (free), scrapes top 5 results, shoves them into an LLM, and answers your question.

I wrote it back when AI web search was a paid feature and I wanted access to it.

At the time Auto-GPT was popular and using the LLM itself to slowly and unreliably do the research.

So I realized a Python program would be way faster and it would actually be deterministic in terms of doing what you expect.

This experience sort of shaped my attitude about agentic stuff, where it looks like we are still relying too heavily on the LLM and neglecting to mechanize things that could just work perfectly every time.

zo1•3mo ago

If you think these things are just using a "dumb" search query, and using the top 5 hits, you're in for a lot of surprises very soon.

andai•3mo ago

Well, considering TFA, it would be pretty strange if I did!

My point was it's silly to rely on a slow, expensive, unreliable system to do things you can do quickly and reliably with ten lines of Python.

I saw this in the Auto-GPT days. They tried to make GPT-4 (the non-agentic one with the 8k context window) use tool calls to do a bunch of tasks. And it kept getting confused and forgetting to do stuff.

Whereas if you just had

for page in pages: summarize(page)

it works 100% of the time, can be parallelized etc.

And of course the best part is that the LLM itself can write that code, i.e. it already has the power to make up for its own weaknesses, and make (parts of itself) run deterministically.

---

On that note, do you know more about the environment they ran this thing in? I got API access (it's free on OpenRouter), but I'm not sure what to plug this into. OpenRouter provides a search tool, but the paper mentions intelligent context compression and all sorts of things.

criemen•3mo ago

I tend to use them when I'm looking to buy something of category X, and want to get a market overview. I can then still dig in and decide whether I consider the sources used trustworthy or not, and before committing money, I'll read some reviews myself, too. Still, it's a speedup for me.

edot•3mo ago

Yes, this is one of my primary use cases for deep research right now. It will become garbage in a few short years once OpenAI starts selling influence / ads. I think they’ve started a bit with doing this but the recommendations still seem mostly “correct”. My prior way of doing this was Googling with site:Reddit.com for real reviews and not SEO spam reviewers.

infecto•3mo ago

Same case for me. I find it pretty good at it too. Far from perfect but it usually a pretty darn good start.

alasr•3mo ago

I haven't used any LLM deep research tools in the past; today, after reading this HN post, I gave Tongyi DeepResearch a try to see how it performs on a simple "research" task (in an area I've working experience in: healthcare and EHR) and I'm satisfied with its response (for the given tasks; I, obviously, can't say anything how it'll performs on other "research" tasks I'll ask it in the future). I think I'll keep using this model for tasks for which I was using other local LLM models before.

Besides I might give other large deep research models a try when needed.

TACIXAT•3mo ago

I have used Gemini's 2.5 Pro deep research probably about 10 times. I love it. Most recently was reviewing PhD programs in my area then deep diving into faculty research areas.

remus•3mo ago

I run a small website and am based in the UK and have used it a couple of times to summarise what I need to do to comply with different bits of legislation e.g. Online Safety Act. What's really useful for me is that I can feed in a load of context about what the site does and get a response that's very tailored to what's relevant for me, and generate template paperwork that I can then fill out to improve my position with regard to the legislation.

For sure it's probably missing stuff that a well payed lawyer would catch, but for a project with zero budget it's a massive step up over spending hours reading through search results and trying to cobble something together myself.

roryirvine•3mo ago

The hidden cost there is that the risk of complying with the legislation remained entirely with you. Even the best specialist research LLM still might easily have hallucinated or made some other sort of error which resulted in it giving you confusing or incorrect advice - and you would have been the one held liable for following it.

Whereas with real legal advice, your lawyer will carry Professional Indemnity Insurance which will cover any costs incurred if they make a mistake when advising you.

As you say, it's a reasonable trade-off for you to have made when the alternative was sifting through the legislation in your own spare time. But it's not actually worth very much, and you might just as well have used a general model to carry out the same task and the outcome would likely have been much the same.

So it's not particularly clear that the benefits of these niche-specific models or specialised fine-tunes are worth the additional costs.

(Caveat: things might change in the future, especially if advancements in the general models really are beginning to plateau.)

andai•3mo ago

You can copy-paste it into your favorite LLM and ask questions about it. That solves several problems simultaneously.

infecto•3mo ago

I use ChatGPTs quite often. I can send it a loaded question and it helps tease out sources and usually at the very least scrapes away some of the nuance. I have used it a lot for finding a list of a type of products too. Taking the top n search results is already pretty useful for me but I find it typically is a little more in depth than that, going down a few rabbit holes of search depending on the topic. It does not eliminate doing your own research but it helps consolidate some of the initial information.

Then I can further interrogate the information returned with a vanilla LLM.

threecheese•3mo ago

Perplexity’s Research tool has basically replaced Google for me, for any search where I don’t already know the answer or know that it’s available somewhere (like documentation).

I use it dozens of times per day, and typically follow up or ask refining questions within the thread if it’s not giving me what I need.

It typically takes between 10sec and 5 minutes, and mostly replicates my manual process - search, review results, another 1..N search passes, review, etc. Initially it rephrases/refines my query, then builds a plan, and this looks a lot like what I might do manually.

fudged71•2mo ago

I almost exclusively use Deep Research as inputs to LLMs for deeper domain knowledge including frontier scientific theories etc.

DataDaemon•3mo ago

Unfortunately soon China will take lead in AI.

aeve890•3mo ago

Unfortunately? May I ask why? What country would you like to be the lead in AI?

ninetyninenine•3mo ago

[flagged]

Krasnol•3mo ago

[flagged]

GordonS•3mo ago

I rather think the GP was being sarcastic. At least, I hope they were.

Krasnol•3mo ago

[flagged]

ninetyninenine•3mo ago

[flagged]

tomhow•3mo ago

Please don't troll on HN.

tomhow•3mo ago

Please don't engage in political/nationalistic battle on HN, and make an effort to observe all guidelines when participating here.

https://news.ycombinator.com/newsguidelines.html

victorbjorklund•3mo ago

[flagged]

davidsainez•3mo ago

I have been very impressed with the Qwen3 series. I'm still evaluating them, and I generally take LLM benchmarks with a huge grain of salt, but their MoE models in particular seem to offer a lot of bang for the compute. But what makes you so sure they will take the lead?

greggh•3mo ago

Deepseek, Qwen, GLM (quite good). All being open and available for local use definitely puts them ahead in that space, which means a lot of the tinkerers and younger people learning to do things like train and fine-tune are getting good with Chinese models and I do think getting in early like that is a great way to gain mindshare in a space. Look at Apple or Microsoft doing everything they could early on to get their machines and software into schools as early as possible.

ninetyninenine•3mo ago

Isn't this an indication they are already in the lead? They currently have the best model that beats everyone on all quantitative metrics? Are you implying that the US has a better model somewhere?

mike_hearn•3mo ago

They aren't in the lead. They are very close behind, but that's not hard given the quantity of freely published papers. They keep proving they can train models competitive with US models, but, only months after the fact. And at least some of the Chinese models were trained via distillation from US models. Probably not at Alibaba but it seems at least some models were.

sumo43•3mo ago

I made a 4B Qwen3 distill of this model (and a synthetic dataset created with it) a while back. Both can be found here: https://huggingface.co/flashresearch

Nymbo•3mo ago

Just tried this out with my web search mcp, extremely impressed with it. Never seen deep research this good from a model so small.

Imustaskforhelp•3mo ago

Can you please create a huggingface space or something similar, I am not sure about the state of huggingface but I would love to be able to try it out in a browser or something similar if possible as I am really curious and I just love qwen3 4b as they were one of the models which work even on my intel integrated gpu graphics card at a really impressive rate and they were really nice the last time I tried but this looks even more cooler/practical.

I had once an idea of using something like qwen4 or some pre-trained AI model just to do a (to censor or not to) idea after the incidents of mecha-hitler. I thought if there was some extremely cheap model which could detect that it is harmful that the AI models of Grok itself couldn't recognize, it would've been able to prevent the absolute advertising/ complete disaster that happened.

What are your thoughts on it? I would love to see an Qwen 4B of something similar if possible if you or anyone is up to the challenge or any small LLM's in generals. I just want to know if this idea fundamentally made sense or not.

Another idea was to use it for routing purposes similar to what chatgpt does but I am not sure about that now really but I still think that it maybe worth it but this routing idea I had was before chatgpt had implemented it, so now after it implemented, we are gonna find some more data/insights about if its good or not/ worth it, so that's nice.

bigyabai•3mo ago

> What are your thoughts on it?

You don't really need an entire LLM to do this - lightweight encoder models like BERT are great at sentiment analysis. You feed it an arbitrary string of text, and it just returns a confidence value from 0.0 to 1.0 that it matches the characteristics you're looking for.

greggh•3mo ago

I use emotions-analyzer-bert for this classifying content in a similar way. It's very small and very fast, under a gig of vram in use.

brutus1213•3mo ago

I recently got a 5090 with 64 GB of RAM (intel cpu). Was just looking for a strong model I can host locally. If I had performance of GPT4-o, I'd be content. Are there any suggestions or cases where people got disappointed?

p1esk•3mo ago

5090 has 32GB of RAM. Not sure if that’s enough to fit this model.

svnt•3mo ago

It should fit enough of the layers to make it reasonably performant.

IceWreck•3mo ago

LlamaCPP supports offloading some experts in a MoE model to CPU. The results are very good and even weaker GPUs can run larger models at reasonable speeds.

n-cpu-moe in https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...

bogtog•3mo ago

GPT-OSS-20B at 4- or 8-bits is probably your best bet? Qwen3-30b-a3b probably the next best option. Maybe there exists some 1.7 or 2 bit version of GPT-OSS-120B

yalogin•3mo ago

In my experience using these supposed expert models, they are all more or less the same given they all are trained on the same internet data. The differentiation and value is in the context window management and how relevant info from your session is pulled in. So it’s the interface to the model that makes all the difference. Even there the differences are quite minimal. That is because all these companies want to toe the line between providing functionality to keep the users engaged and pushing them to sign up for the subscription.

All this to ask the question, if I host these open source models locally, how is the user interface layer that remembers and picks the right data from my previous session and the agentic automation and others implemented? Do I have to do it myself or are the free options for that?

viksit•3mo ago

this is a great question. what are the main use cases that you have for this? i’ve been working on a library for something similar and exposing it via an mcp interface. would love to pick your brain on this (@viksit on twitter)

ninetyninenine•3mo ago

Is China dominating the US in terms of AI? Given that they currently have a model that beats the best models at all formal quantitative benchmarks?

What is the state of AI in China? My personal feeling is that it doesn't dominate the zeitgeist in China as it does in the US and despite this because of the massive amount of intellectual capital they have just a small portion of their software engineering talent working on this is enough to go head to head with us even though it only takes a fraction of their attention.

idiotsecant•3mo ago

I think the lesson of the Chinese catchup in AI is that there is a massive disadvantage in being first, in this domain. You can do all the hard work and your competitors can distill that work out of your model for pennies on the dollar. Why should anyone want to do the work?

MaxPock•3mo ago

This sounds like copium . If it was just about distillation,we'd be seeing many awesome models from Europe ,Japan and even India.

mike_hearn•3mo ago

It's certainly both a lot more than distillation and at least some Chinese labs have been cloning OpenAI via distillation. That's why they instituted much tighter ID verification requirements earlier this year.

No, the reason you don't see many open source models coming from the rest-of-world (other than Mistral in France) is that you still need a ton of capital to do it. China can compete because the CCP used a combination of the Great Firewall and lax copyright/patent enforcement to implement protectionism for internet services, which is a unique policy (one that obviously came with massive costs too). This allowed China to develop home grown tech companies which then have the datacenters, capital and talent density to train models. Rest of world didn't do this and wasn't able to build up domestic tech industries competitive with the USA.

MaxPock•3mo ago

There’s no Chinese lab that has been accused by OpenAI or anyone else of distillation. The accusations come from fringe right-wing media that are used to the “China only copies” trope. Training a model, by the way, is not about money, because many Western tech giants have more money than the CCP can allocate to Chinese labs. Apple, Meta, Amazon, SAP, IBM, and others have access to the same data as OpenAI and should thus be able to come up with a SOTA model in under a year, right? On lax copyright enforcement, I’d like to point out that it’s actually Western labs that have been taken to court for stealing content.

On matters protectionism,the Great Firewall was the best thing that China did.It prevented them from digital colonization like the rest of the world.

idiotsecant•3mo ago

Oh, wow. 1) you're getting awfully defensive here. I didn't say there was some moral failing because Chinese shops distilled models from western ones. It's the smart play. I'm only commenting on how little of a moat first movers have. 2) if you can't admit that deepseek distilled their models from existing work I'm not sure what to tell you. The early models even identified themselves as chatGPT. It's widely known and has substantial evidence. This isn't team sports, you don't need to play defense. We deal with reality here.

mike_hearn•3mo ago

They didn't make a big fuss about it but OpenAI have explained that they instituted ID and country verification because there were competitors distilling their models. Of their competitors do you really think Anthropic, Google or Meta were doing that? It's pretty clear who they were talking about.

Chinese labs are mostly (all?) privately funded, as far as I know. Alibaba isn't a SOE. That's why I didn't mention state subsidies, although that might be happening (and certainly is happening w.r.t. access to electricity).

I didn't mention lax copyright/patent enforcement in the context of AI, but rather, the prior years in which China was able to build up local tech firms capable of taking on the US tech firms. It's mostly in the past now, they don't need to do that stuff anymore.

seanmcdirmid•3mo ago

America continues to dominate in amount of money spent on AI resources but China has more value in the human and hardware resources it brings to bare.

China is also more willing to deploy AI apps that Americans would hesitate on, although I'm not sure I've seen much of it so far outside of Shenzhen cyberpunk clips. Let's see how this plays out in a decade.

whiplash451•3mo ago

Has anyone tried running this on a 5090 or 6000 pro? What throughput do you see?

VladVladikoff•3mo ago

Recently I gave a list of 300 links to deep research and asked it to go through each one to analyze a certain question about them. Repeatedly it would take shortcuts and not actually do the full list. Is this caused by a context window limits? Or maybe Open AI limits request size? Is it possible to not run into these types of limits with locally hosted models?

oofbey•3mo ago

I’ve also had extremely poor luck getting any LLM agent to go through a long list of repetitive tasks. Don’t know why. I’d guess it’s because they’re trained for transactional responses, and thus are horrible at repute anything.

ukuina•3mo ago

Very much this.

You are better off asking it a write a script to invoke itself N times across the task list.

threecheese•3mo ago

Same. I think there’s an untapped market (feature really) here, which if isn’t solved by GPT-next will start to reveal itself as a problem more and more.

LLMs are really bad at being comprehensive, in general, and from one inference to the next their comprehensive-ness varies wildly. Because LLMs are surprising the hell out of everyone with their abilities, less attention is paid to this; they can do a thing well, and for now that’s good enough. As we scale usage, I expect this gap will become more obvious and problematic (unless solved in the model, like everything else).

A solution I’ve been toying with is something like a reasoning step, which could probably be done with mostly classical NLP, that identifies constraints up front and guides the inference to meet them. Like a structured output but at a session level.

I am currently doing what you suggest though, I have the agent create a script which invokes … itself … until the constraints are met, but that obviously requires that I am engaged there; I think it could be done autonomously, with at least much better consistency (at the end of the day even that guiding hand is inference based and therefore subject to the same challenges).

ugh123•3mo ago

Slightly off topic but why does word wrapping seem to be broken in this site? Chrome on Android

rippeltippel•3mo ago

Thank you for pointing that out, I was about to ask the same. It's giving my OCD a hard time reading it.

blueboo•3mo ago

When was the last time you did a deep research? Good agents just do research as necessary. I find GPT5 Pro >> all the top DR agents

zwaps•3mo ago

The OpenAI numbers are a red herring anyway.

For most plans, Deep Research is capped at around 20 sources, making it for many cases the least useful research agent, in particular worse than a thinking mode Gpt5 query

andai•3mo ago

Tongyi provides this model on OpenRouter, including a free version.

https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b

https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b:fr...

PunchTornado•3mo ago

it rivals a model that is obsolete? who is using openai deepresearch when there are so many better models out there?

incomingpain•3mo ago

I love this one: https://github.com/LearningCircuit/local-deep-research

I tied it together with qwen3 30b thinking. Very easy to get it up and running, but lots of the numbers are shockingly low. You need to boost iterations and context. Especially easy if you already run searxng locally.

I havent finished tuning the actual settings, but for the detailed report it'll take ~20 minutes and so far has given pretty good results. Similar to openai's deep research. Mine often has ~100 sources.

But something I have noticed. It didnt seem to me the model was important. The magic was moreso in the project. Getting deep with higher iterations and more results.

whatpeoplewant•3mo ago

Great to see an open 30B MoE aimed at “deep research.” These shine when used in a multi-agent setup: run parallel agentic AI workers (light models for browsing/extraction) and reserve the 30B agentic LLM for planning, tool routing, and verification—keeping latency/cost in check while boosting reliability. MoE specialization fits distributed agentic AI well, but you’ll want orchestration for retries/consensus and task-specific evals on multi-hop web research to guard against brittle routing and hallucinations.

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

FDA intends to take action against non-FDA-approved GLP-1 drugs

You Are Here

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Reinforcement Learning from Human Feedback

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

FDA intends to take action against non-FDA-approved GLP-1 drugs

You Are Here

Show HN: A luma dependent chroma compression algorithm (image compression)

First Proof

LLMs as the new high level language

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Reinforcement Learning from Human Feedback

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Coding agents have replaced every framework I used

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Tongyi DeepResearch – open-source 30B MoE Model that rivals OpenAI DeepResearch

Comments