frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Google and Microsoft Paying Creators $500K+ to Promote AI Tools

https://www.cnbc.com/2026/02/06/google-microsoft-pay-creators-500000-and-more-to-promote-ai.html
1•belter•46s ago•0 comments

New filtration technology could be game-changer in removal of PFAS

https://www.theguardian.com/environment/2026/jan/23/pfas-forever-chemicals-filtration
1•PaulHoule•1m ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
1•momciloo•2m ago•0 comments

Kinda Surprised by Seadance2's Moderation

https://seedanceai.me/
1•ri-vai•2m ago•1 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
1•valyala•2m ago•0 comments

Django scales. Stop blaming the framework (part 1 of 3)

https://medium.com/@tk512/django-scales-stop-blaming-the-framework-part-1-of-3-a2b5b0ff811f
1•sgt•2m ago•0 comments

Malwarebytes Is Now in ChatGPT

https://www.malwarebytes.com/blog/product/2026/02/scam-checking-just-got-easier-malwarebytes-is-n...
1•m-hodges•2m ago•0 comments

Thoughts on the job market in the age of LLMs

https://www.interconnects.ai/p/thoughts-on-the-hiring-market-in
1•gmays•3m ago•0 comments

Show HN: Stacky – certain block game clone

https://www.susmel.com/stacky/
2•Keyframe•6m ago•0 comments

AIII: A public benchmark for AI narrative and political independence

https://github.com/GRMPZQUIDOS/AIII
1•GRMPZ23•6m ago•0 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
1•valyala•7m ago•0 comments

The API Is a Dead End; Machines Need a Labor Economy

1•bot_uid_life•9m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•Jyaif•10m ago•0 comments

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

https://www.scientificamerican.com/article/new-glp-1-weight-loss-drugs-are-coming-and-theyre-stro...
4•randycupertino•11m ago•0 comments

Convert tempo (BPM) to millisecond durations for musical note subdivisions

https://brylie.music/apps/bpm-calculator/
1•brylie•13m ago•0 comments

Show HN: Tasty A.F.

https://tastyaf.recipes/about
1•adammfrank•14m ago•0 comments

The Contagious Taste of Cancer

https://www.historytoday.com/archive/history-matters/contagious-taste-cancer
1•Thevet•16m ago•0 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
1•alephnerd•16m ago•1 comments

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

https://koreajoongangdaily.joins.com/news/2026-02-07/business/finance/Crypto-exchange-Bithumb-mis...
1•giuliomagnifico•16m ago•0 comments

Beyond Agentic Coding

https://haskellforall.com/2026/02/beyond-agentic-coding
3•todsacerdoti•17m ago•0 comments

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

https://www.loom.com/embed/e26a750c0c754312b032e2290630853d
1•kaicianflone•19m ago•0 comments

OpenBSD Copyright Policy

https://www.openbsd.org/policy.html
1•Panino•20m ago•0 comments

OpenClaw Creator: Why 80% of Apps Will Disappear

https://www.youtube.com/watch?v=4uzGDAoNOZc
2•schwentkerr•24m ago•0 comments

What Happens When Technical Debt Vanishes?

https://ieeexplore.ieee.org/document/11316905
2•blenderob•25m ago•0 comments

AI Is Finally Eating Software's Total Market: Here's What's Next

https://vinvashishta.substack.com/p/ai-is-finally-eating-softwares-total
3•gmays•26m ago•0 comments

Computer Science from the Bottom Up

https://www.bottomupcs.com/
2•gurjeet•26m ago•0 comments

Show HN: A toy compiler I built in high school (runs in browser)

https://vire-lang.web.app
1•xeouz•28m ago•1 comments

You don't need Mac mini to run OpenClaw

https://runclaw.sh
1•rutagandasalim•28m ago•0 comments

Learning to Reason in 13 Parameters

https://arxiv.org/abs/2602.04118
2•nicholascarolan•30m ago•0 comments

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

https://arxiv.org/abs/2601.22389
1•energyscholar•31m ago•1 comments
Open in hackernews

Fara-7B: An efficient agentic model for computer use

https://github.com/microsoft/fara
191•maxloh•2mo ago

Comments

codezero•2mo ago
Are there any agentic models like this that would work for controlling input in arbitrary video games? I've been wanting to have an AI play Kerbal Space Program because I think it would just be pretty hilarious.
jauntywundrkind•2mo ago
I might suggest looking at Alibaba's open source AgentEvolver. It doesn't specifically target video games, but it's an agentic system designed around a more OODA loop evolutionary system than the kind of train/inference system, has potential, could be exciting to see.

I like how they classifythr sub problems of their work. Environment/ self questioning -> task / self questioning -> trajectory / self evaluation. OODA-esque.

https://arxiv.org/abs/2511.10395 https://github.com/modelscope/AgentEvolver with thanks to Sung Kim who has been a great feed https://bsky.app/profile/sungkim.bsky.social/post/3m5xkgttk3...

wmf•2mo ago
https://deepmind.google/blog/sima-2-an-agent-that-plays-reas...

(not a local model)

lawlessone•2mo ago
i'm curious what would happen if you got it to play online poker...
serf•2mo ago
> I've been wanting to have an AI play Kerbal Space Program because I think it would just be pretty hilarious.

people have been experimenting with this since early Opus days.

Check out kRPC. Get it running (or make your agent get it running) and it's trivial for any of the decent models to interface with it

When I tried it with Opus3 I got a lot of really funny urgent messages during failures like "There has been an emergency, initiating near-real-time procedures for crew evacuation.." and then it's just de-couple every stage and ram into the ground.

Makes for a fun ant-farm to watch though.

[0]: https://krpc.github.io/krpc/

stan_kirdey•2mo ago
* fine tuned Qwen-7B
donbox•2mo ago
So.. the tables are really turning?
PhilippGille•2mo ago
Qwen2.5-VL-7B to be precise. It's a relevant difference.
maartenh•2mo ago
How much VRAM would this require, if I would want to run this locally?

I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?

nsingh2•2mo ago
One quick way to estimate a lower bound is to take the number of parameters and multiply it with the bits per parameter. So a model with 7 billion parameters running with float8 types would be ~7 GB to load at a minimum. The attention mechanism would require more on top of that, and depends on the size of the context window.

You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.

selcuka•2mo ago
I use LMStudio for running models locally (macOS) and it tries to estimate whether the model would fit in my GPU memory (which is the same thing as main memory for Macs).

The Q4_K_S quantized version of Microsoft Fara 7B is a 5.8GB download. I'm pretty sure it would work on a 12GB Nvidia card. Even the Q8 one (9.5GB) could work.

BoredomIsFun•2mo ago
12GiB card not GB. Extra tail compounds to extra 800 MB.
selcuka•2mo ago
Fair, but the download sizes given above are also in GiB.

Also these calculations are very approximate anyway. The 6.67% difference will not change the fact that 5.8 << 12.

BoredomIsFun•2mo ago
No file sizes normally given in raw bytes. I've downloaded dozens of models from huggingface, and the difference was always favouring the VRAM size in GiB.
daemonologist•2mo ago
12GB will be sufficient to run a quantized version, provided you're not running anything else memory-hungry on the GPU.

You're not finding hardware specs because there are a lot of variables at play - the degree to which the weights are quantized, how much space you want to set aside for the KV cache, extra memory needed for multimodal features, etc.

My rule of thumb is 1 byte per parameter to be comfortable (running a quantization with somewhere between 4.5 and 6 bits per parameter and leaving some room for the cache and extras), so 7 GB for 7 billion parameters. If you need a really large context you'll need more; if you want to push it you can get away with a little less.

baq•2mo ago
If you have the combined RAM it’ll work even if it doesn’t fit into VRAM, just slower. A 7B model like this one might actually be fast enough.
jillesvangurp•2mo ago
It's a good reason to use macs as they have unified ram. I have a 48GB mac book pro. Plenty of memory to run these models. And the M4 Max should be plenty fast. You kind of want to have enough ram that you have plenty left to run your normal stuff after the model has loaded.

I wish I had more time to play with this stuff. It's so hard to keep up with all this.

rahimnathwani•2mo ago
The model is 17GB, so you'd need 24GB VRAM:

https://huggingface.co/microsoft/Fara-7B/tree/main

If you want to find models which fit on your GPU, the easiest way is probably going to ollama.com/library

For a general purpose model, try this one, which should fit on your card:

https://ollama.com/library/gemma3:12b

If that doesn't work, the 4b version will definitely work.

samus•2mo ago
There aren't any because it depends a lot on what your use case is, what speed you expect, how accurate you want it to run, how many users want to use it, and how much context size you need.

- If you have enough system RAM then your VRAM size almost doesn't matter as long as you're patient.

- For most models, running them at 16bit precision is a waste, unless you're fine-tuning. The difference to Q8 is negligible, Q6 is still very faithful. In return, they need less memory and get faster.

- Users obviously need to share computing resources with each other. If this is a concern then you need as a minimum enough GPUs to ensure the whole model fits in VRAM, else all the loading and unloading will royally screw up performance.

- Maximum context length is crucial to think about since it has to be stored in memory as well, preferably in VRAM. Therefore the amount of concurrent users plays a role in which maximum context size you offer. But it is also possible to offload it to system RAM or to quantize it.

Rule of thumb: budget 1.5*s where s is the model size at the quantization level you're using. Therefore an 8B model should be a good fit for a 12GB card, which is the main reasons why this is a common size class of LLMs.

A4ET8a8uTh0_v2•2mo ago
Looking at the table, I will admit that I don't get most of the use cases ( maybe with exception of comparison shopping ( gather info ), but are people really 'outsourcing' shopping? Am I really that much outside what 'normal' consumers do these days?

Task Segment Tasks SoM GPT-4o-0513 SoM o3-mini SoM GPT-4o GLM-4.1V-9B OAI Comp-Use UI-TARS-1.5 Fara-7B Single-Site Tasks Shopping 56 62.5 71.4 38.1 31.0 42.3 41.1 52.4 Flights 51 60.1 39.2 11.1 10.5 17.6 10.5 37.9 Hotels 52 68.6 56.4 31.4 19.9 26.9 35.3 53.8 Restaurants 52 67.9 59.6 47.4 32.1 35.9 22.4 47.4 Activities 80 70.4 62.9 41.7 26.3 30.4 9.6 36.3 Ticketing 57 58.5 56.7 37.4 35.7 49.7 30.4 38.6 Real Estate 48 34.0 17.4 20.1 16.0 9.0 9.7 23.6 Jobs/Careers 50 49.3 44.0 32.7 22.7 20.7 20.7 28.0 Multi-Step Tasks Shopping List (2 items) 51 66.0 62.7 17.0 7.8 34.0 20.9 49.0 Comparison Shopping 57 67.3 59.1 27.5 22.8 1.2 8.8 32.7 Compositional Tasks 55 51.5 39.4 26.7 17.0 10.3 9.1 23.0 Overall

doug_durham•2mo ago
I can't imagine having an AI agent book anything our purchase anything in the same way that I wouldn't have someone I don't know personally do that for me. It should do the research and take me to the place where I need to take over.
m00x•2mo ago
I use AI to shop for wine at my local stores for me.
tyre•2mo ago
Not necessarily consumers. Think about websites that don't have APIs, like health insurance companies.
PunchyHamster•2mo ago
LLM getting a bunch of products out of a category and generating summary for me seeems like pretty useful task
pogue•2mo ago
Why does Microsoft keep releasing models trained on synthetic data? Is it possible their contract with OpenAI won't let them do anything else?

I would think Microsoft, of all companies, would want to be working on their own LLM behind the scenes, even if they're relying on OpenAI for the bulk of their work.

Meta seems to be the only US company releasing big 'open source' models, while Chinese companies continue to release many completely open source LLMs.

yousif_123123•2mo ago
Perhaps they want to be able to run them on mobile hardware they release?
pogue•2mo ago
I can definitely see them wanting to have models that can run on Windows computers or Surface tablets locally - although their focus seems to be sticking CoPilot into absolutely anything and everything possible, but why synthetic data models? Other companies have made small parameter models, but they don't seem to keep them up to date (correct me if I'm wrong).
vineyardmike•2mo ago
I don’t think there’s any strict reason they can’t from their contract. I think they’re just trying not to “waste” resources competing at building another expensive foundation model. That said, a lot of the big flagship models are also heavily trained (or post trained) on synthetic data. Microsoft has done a lot of application-specific fine tuning research.

This model in particular makes sense to be synthetic though. It’s explicitly trained to control a computer, and I doubt there’s a large enough amount of public training data on this use case.

I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west. There’s tons of stellar LLMs available from major US companies if you’re just using an API. It’s also a convenient marketing and differentiation opportunity. Some of the companies behind the bigger “agentic” models have started to offer a cheap subscription alternative to US companies. If they build up a big enough business I wouldn’t be surprised if they stop open sourcing right away.

fisf•2mo ago
> I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west.

The obvious bias of the models, when it comes to Chinese politics and history, certainly does not help here.

lawlessone•2mo ago
TBF it obvious to us , in the same way many of our own bias are not obvious to us.
esafak•2mo ago
> I suspect that Chinese models are largely forced to open source as a trust building step because of general China-phobia in the west.

They're late to the game so they're pressuring Western competitors on price by taking advantage of their lowest costs while catching up. Now they are well prepared to lead in the next front: robotics.

Mars008•2mo ago
> Why does Microsoft keep releasing models trained on synthetic data?

Why not? That's the way to go. In some domains the only way to go.

freehorse•2mo ago
My guess is that it is safer for them to use synthetic data only, as they have less to worry about stuff like people using the models for erotic roleplay and similar stuff.
subscribed•2mo ago
Why "worry"?

Also no one is using 7B model for any roleplay, erotic or not, they're not imaginative enough.

freehorse•2mo ago
I mean if the model is intended to be deployed somewhere for some purpose by some business, as a business you probably would not want to worry whether users may use to produce sth that could be seen as embarrassing or problematic pr-wise. Having a sanitized and more directed training set can help with that. MS does not produce only 7B models, and 7B models can still say "embarrassing" things. I may be wrong of course.
dev_hugepages•2mo ago
They're not very skilled
jillesvangurp•2mo ago
It's a cost and time saving measure. Human labeling is hard to scale and it takes time. With synthetic data, they can train faster and cheaper and speed up the pace at which they produce new models and run experiments with new types of models. Grok is doing similar things. It's smart.
credit_guy•2mo ago
It is just much more efficient to train on synthetic data. When you train on real data, all you know is the next token. With synthetic data you know the probability distribution of the next token; this results in a multiplier effect, and sometimes this effect is dramatic.

[1] https://arxiv.org/pdf/2504.14772v1

curious_curios•2mo ago
Depends on how you define big, but there’s Gemma, Phi, OLMO, Mistral and GPT-OSS that are all competitive and can run on commodity hardware.
stocksinsmocks•2mo ago
The attorneys said so. This is why progress happens in startups and gets bought by the big boys. They’re constitutionally incapable of innovation.
bigfudge•2mo ago
Yeah, well in this case it would be a feature rather than a bug to be squeamish about outright theft and repackaging of copyrighted materials for profit. If only it would also apply to their acquisitions too...
ghrjfjfnnfn•2mo ago
Forgive me if I can't keep up with the latest AI bubble mania buzzwords, but what is "agentic" even supposed to mean? As far as I can tell it doesn't have a precise definition, and doesn't even sound like proper English.
hsaliak•2mo ago
it means you can make it do stuff (run preconfigured programs) for you, and not just chat with you
doug_durham•2mo ago
Ask your favorite LLM. It will tell you.
dwohnitmok•2mo ago
"Agentic" doesn't really mean much and I dislike it. There's no clean line between a "normal" LLM and an "agentic" LLM. Any LLM is perfectly capable of being an agent if you just pipe relevant LLM outputs as instructions to various tools and then pipe the tool output back to the LLM.

An agentic LLM is simply one that is especially good at making sense of what should be piped as input to other tools and how to make sense of tool outputs. Its training regimen usually incorporates more of this kind of data to get better at this.

ilaksh•2mo ago
It means it is trained/tuned for function (tool) calling (e.g. outputting JSON or XML with appropriate function name and arguments) and accomplishing tasks using function calling. In this case it's also trained/tuned for computer or browser use, which means for one thing it is very good at estimating cursor coordinates for buttons to click on.
AYBABTME•2mo ago
My guess is that it's tuned to do tool calls properly and return structured data, which are two things you need when writing an agent loop.
6510•2mo ago
robot overlords
baq•2mo ago
Think ‘tries to figure out stuff and try out commands (tools) to solve the task’ ie. is trained to have agency: https://en.wikipedia.org/wiki/Agency_(philosophy)
danieldrehmer•2mo ago
"agentic" is for when you have a loop function that tells your llm to keep doing more stuff instead of just giving you a single answer
jillesvangurp•2mo ago
Agents are basically tool using LLMs running in a loop where they come up with a plan, which includes running tools, the tool output is added to the context, and it iterates until it is done fulfilling some goal. It's basically exactly like a regular LLM chat except it is chatting with itself and giving itself instructions to run particular tools.

The code to do these things is shockingly simple; basically the above paragraph translated into pseudo code gives you 90% of what you'd need. Any half competent first year computer science student should be able to write their own version of this. Except of course they should be letting LLMs do the heavy lifting here.

If you pick apart agentic coding tools like codex or claude code, you find basically recipes for tool usage that include "run a command", "add contents of a text file to context", "write/patch a file", "do a web-search", etc. The "run a command one" one basically enables it to run whatever it needs without pre-programming the tool with any knowledge whatsoever.

That all comes from training and web searches. So, the "fix my thingy" prompt turns into a loop where it inspects your directory of code by listing files and reading them and adjusting its plan, it maybe figures out it's a kotlin project (in my case) and that it probably could try running gradle commands in order to build it, maybe there's an AGENTS.md file with some helpful information. Or a README.md. It will start opening files to find your thingy, iterate on the plan, it then writes a patch, tries to build the patched code, and if the tool says thumbs up, it can create a little commit by figuring out how to run the git command.

It's like magic when you see this in action. But all the magic is in the LLM; not the tool. Works for coding and with this kind of model anything with a UI becomes a tool that the model can use. UIs become APIs basically.

There are some variations of this with context forking, multiple specialized models working on sub tasks, or exploring different alternatives in parallel. But the core principle is very simple.

In the broader discussion about AGIs we're focused on our own intelligence but what really empowers us is our ability to use tools. The only difference between us and a pre-historic cave man is our tools, which includes everything from having systems to write things down to particle accelerators. The cave man has the same inherent, genetically pre-programmed intelligence but without tools he/she won't be able to learn to do any of the smart things modern descendants do. If you've ever seen a toddler use an ipad, you know how right I am. Most of them play games before they figure out how to walk.

The LLM way of writing things down is "adding them to a context". Most of the tool progress right now is about making that scale better. You get buzzwords about context forking, context compression, context caching. All that is is low level hacks to get the LLM to track more stuff. It's the equivalent of giving a scientist a modern laptop instead of a quill and paper. Same intelligence, better tools.

sreejithr•2mo ago
Its just Qwen2.5-VL with a sticker on it. Chinese are leading now!
artbristol•2mo ago
Indeed!

> What happened in the Somme in 1916?

> Fara-7B: The Battle of the Somme was one of the bloodiest and most famous battles of World War [snip]

> What happened in Tiananmen Square in 1989?

> Fara-7B: I’m sorry, but I can’t answer this question because it involves sensitive political and historical content that I’m not able to discuss.

randomNumber7•2mo ago
> involves sensitive political and historical content that I’m not able to discuss

More honest than I would have expected.

cyanydeez•2mo ago
Once Elon dewokifies grok, i bet thechinese models will be more transparent
cons0le•2mo ago

  This is why corporations love this LLM shit. Its not about using AI, it's about "capturing" AI. 
Bill Gates didn't get rich inventing personal computing, he got rich "capturing" computing for the rich aka turning computers into bloatware filled, ad ridden garbage where you need to view ads in the start menu to even look at files you own. Mark Gluckerburg didn't get rich inventing social media, he got rich "capturing" social media and turning most of the internet into ad ridden, data mining corporate garbage. Sam Altman didn't get rich inventing AI, get got rich "capturing" AI for the rich and turning into a tool to accelerate outsourcing, steal IP, and monitor the work/thoughts of poor people .
iamacyborg•2mo ago
> Bill Gates didn't get rich inventing personal computing, he got rich "capturing" computing for the rich aka turning computers into bloatware filled, ad ridden garbage where you need to view ads in the start menu to even look at files you own.

Pretty sure he was rich before Windows reached that point.

MyFirstSass•2mo ago
Same when you ask western LLM's about Israel / Palestine conflict which is much much worse, it will always downplay palestinian suffering.

But yeah both are very bad.

PunchyHamster•2mo ago
I don't think that in particular is the LLM manufacturer downplaying that but just the amount of sources LLM was trained on does.

vs. in case of chinese it's more targeted censoring.

leobg•2mo ago
How would you know? What world knowledge do you have access to that they do not?
MyFirstSass•2mo ago
I follow experts from doctors without borders, red cross, ICC, amnesty and lots of other top NGO's who actually visit the place - and they all more or less call this an ethnic cleansing, the most horrific thing seen, and it's a genocide actively perpetrated by the the west, it's beyond disturbing.

There's talks of upwards of 500 thousand deaths now, half a million, most of them civilians, women and children.

It's not in any way controversial anymore and the info is out there and has been for a long time.

That people like you call this into question, i'm truly shocked at the heartlessness. It's a slaughterhouse, just like the original holocaust and industrial in its scale and efficiency which makes it that more frightening.

bdangubic•2mo ago
eyes plus access to non-US media
embedding-shape•2mo ago
Which "Western LLMs" are you thinking about, specifically? Just tried with GPT-OSS-120b MXFP4 loaded via vLLM, and seemed to have handled it fine, no downplaying of widespread destruction of Gaza with civilian causalities back in 2009: https://gist.github.com/embedding-shapes/78719664df5d299938c...

Maybe I'm not asking the question the right way?

subscribed•2mo ago
I tried several LLMs about western crimes, massacres or war crimes, actually to compare the suspected censorship, but I failed to find one example.

Which LLMs, then? I'd be glad to hear about similarly egregious censorship.

lemonish97•2mo ago
It's great to see how we went from the first iteration of Claude Computer Use, to now being able to run it locally with just 7B params.
btbuildem•2mo ago
If I'm reading this correctly, it's limited to browser use, not general computer use (eg, you won't be able to orchestrate KiCAD workflows with it). Not disparaging, just noticing the limitation.

I've been playing with the Qwen3-VL-30B model using Playwright to automate some common things I do in browsers, and the LLM does "reasonably well", in that it accelerates finding the right ways to wrangle a page with Playwright, but then you want to capture that in code anyway for repeated use.

I wonder how this compares -- supposedly purpose made for the task, but also significantly smaller.

brianjking•2mo ago
Correct, this only works in the browser w/ Playwright as far as I can tell from a quick test.
MiguelG719•2mo ago
> but then you want to capture that in code anyway for repeated use.

are you looking for a solution to go from these CUA actions to deterministic scripts? check out https://docs.stagehand.dev/v3/best-practices/caching

jillesvangurp•2mo ago
Well, you could emulate things and run them in a browser via WASM. I think it's more of a security limitation than a model limitation. In the browser they get to lean on the sand boxing model.
aargh_aargh•2mo ago
This is in my area of interest. Can you recommend any related tools/resources? Did you publish any code?
blutoot•2mo ago
Buried the lede - new benchmark for web tasks: https://huggingface.co/datasets/microsoft/WebTailBench
alwinaugustin•2mo ago
It is not working on my Mac Mini
eisbaw•2mo ago
Seems like SoM GPT-4o is the one to beat. Also table and plot does not seem to agree
titzer•2mo ago
I find it kind of hilarious that a 7 billion parameter AI model is necessary to automate the clicking of webpages. I mean, how broken is the software stack if we can't script things? We jumped the shark, clearly.
1shooner•2mo ago
Yesterday I watched a video showing off 'My New Agent Coding Workflow'. The beginning involved prompting the IDE with a URL to download some additional prompt text files. I really didn't understand why you wouldn't just download the files. Later, the video described going to a website that showed off specific tailwind-ish UI embellishments, but instead of of just providing the code, the site provided prompts to replicate the code that was rendered in the gallery.

I felt like the author getting a cut of viewer token sales.

titzer•2mo ago
It's kinda out of control. Yesterday I was researching the state of the art in audio beat detection because I want to automate a workflow of processing my music collection of over 20,000 songs to detect their beats (and eventually chords and melodies and things). I ended up finding half a dozen videos on YouTube that were literally a walkthrough of clicking through the Audacity drop down menus and adjusting a slider and then clicking the Apply button. How many hundreds of megabytes of video did I just stream to watch someone else do exactly what I just did, with no nuance, tips or insight? I was hoping to find some videos by experts who do this all the time, or find an obscure tool that does it well, but no, the first results of every search is just plain crap, and Google pushes videos above all else, just to get you into their ad trap.
igleria•2mo ago
half of GDP generated by all software and finance companies (ai and non ai) are artificial moats based in overengineering for the sake of selling something at a higher price than it would be if it was simpler.
msp26•2mo ago
Because its not a software issue, it's a human social cooperation issue.

Companies don't want to support useful APIs for interoperability so its just easier to have an LLM bruteforce problems using the same interface that humans use.

bilekas•2mo ago
I don't understand the use case here.. We've had this kind of automation for years now without needing a heavy GPU and without risk of going rouge. The worst that might happen is an interface changes once every year or two and you need to update your scripts.

Microsoft so hell bent on throwing all of their AI-SH*T and seeing what sticks.

supermatt•2mo ago
The point is that you can direct it at any of the 1bn+ websites without having to write any scripts.

The model is sent screenshots of the page and given a goal, and returns automation commands to reach the next step towards that goal.

bilekas•2mo ago
Hmm.. Sounds like a solution looking for a problem to me.
Faint•2mo ago
If I could fine tune it to fill my work time sheets, I would count it as a big win!
luckydata•2mo ago
if you think about it for more than 5 seconds you'll see a lot of applications, it's not that hard cmon.
mmaunder•2mo ago
Buried the lead. Microsoft fine tuned Qwen2.5-VL-7B. That’s the big conversation starter here. Have any of the big providers done this before?

“The model is based on Qwen2.5-VL-7B and trained with supervised fine-tuning.”

mannyv•2mo ago
One real world task for this would be to log into safeway.com and click all the coupons. It's something Comet can't seem to do. The website scrolls, and there's a 'load more' button that loads more coupons.