I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs. I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
It is easy, comparatively. Accuracy and correctness is what computers have been doing for decades, except when people have deliberately compromised that for performance or other priorities (or used underlying tools where someone else had done that, perhaps unwittingly.)
> Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
LLMs and related AI technologies are very much an instance of extreme deliberate compromise of accuracy, correctness, and controllability to get some useful performance in areas where we have no idea how to analytically model the expected behavior but have lots of more or less accurate examples.
More fundamental than the training data is the fact that the generative outputs are statistical, not logical. This is why they can produce a sequence of logical steps but still come to incorrect or contradictory conclusions. This is also why they tackle creativity more easily since the acceptable boundaries of creative output is less rigid. A photorealistic video of someone sawing a cloud in half can still be entertaining art despite the logical inconsistencies in the idea.
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
If you want to convert models to run on the ANE there are tools provided:
> Convert models from TensorFlow, PyTorch, and other libraries to Core ML.
That’s just an issue with stale and incorrect information. Here are the docs https://opensource.apple.com/projects/mlx/
The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.
Look for Apple to add matmul acceleration into the GPU instead. Thats how to truly speed up local LLMs.
Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.
Nvidia does not optimize for mobile first.
AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.
AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.
Nonetheless, Apple's architecture on mobile doesn't have to define how they approach laptops, destops and datacenters. If the mobile-first approach is limiting their addressable market, then maybe Tim's obsessing over the wrong audience?
Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.
Even Nvidia GPUs often have breaking changes moving from one generation to the next.
“This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”
I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.
These models are just about getting as good as the frontier models.
Hardware-wise though, I actually agree - Apple has dropped the ball so hard here that it's dumbfounding. They're the only TSMC customer that could realistically ship a comparable volume of chips as Nvidia, even without really impacting their smartphone business. They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI for a while now. Apple wants to build a CoreML silo in a world where better products exist everywhere, it's a dead-end approach that should have died back in 2018.
Contextually it's weird too, I've seen tons of people defend Cook's relationship with Trump as "his duty to shareholders" and the like. But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want. Future MBAs will be taught about this hubris once the shape of the total damages come into view.
People also want comfortable mattresses and high quality coffee machines. Should Apple make them too?
Apple not being in a particular industry is a perfectly valid choice, which is not remotely comparable to protecting their interests in the industries they are currently in. Selling datacenter-bound products is something Apple is not _remotely_ equipped for, and staffing up to do so at reasonable scale would not be a trivial task.
As for crypto mining... JFC.
Money is money. 10 years ago people would have laughed at the notion of Nvidia abandoning the gaming market, now it's their most lucrative option. Apple can and should be looking at other avenues of profit while the App Store comes under scrutiny and the Mac market share refuses to budge. It should be especially urgent if unit margins are going down as suppliers leave China.
They did a horrific job of it before. The staff to design consumer facing experiences are busy doing exactly that. The developer facing experiences are very lean. The bandwidth simply isn't there to do DC products. Nor is the supply chain. Nor is the service supply chain. Etc, etc.
The vision since Jobs has always been “build a great consumer product and own as much as you can while doing so”. That’s exactly how all of the design parameters of Ax/Mx series were determined and relentlessly optimized for - the fact that they have a highly competitive uarch was a salutary side-effect, but not a planned goal.
He was after-all more of an operations guy than a product guy before moving into the CEO role.
And I suppose we’re giving credit to other people for Watch, AirPods, Vision Pro?
On top of that, it only performs so well on consumer devices because they control the hardware and OS and can tune both together. Creating server hardware would mean allowing linux to be installed on it, and would need to run equally well. Apple would never put the development time into linux kernel/drivers to make this happen.
Just sell a proper HomePod with 64GB-128GB ram, which handles everything including your personal LLM, Time Machine if needed, back to Mac (Tailscale/zerotier)
+ they can compete efficiently with the other. Cloud providers.
The same Homepod that almost sold as poorly as Vision Pro despite a $349.99 MSRP? Apple charges $400 to upgrade an M4 to 64GB and a whopping $1,200 for the 128GB upgrade.
The consumer demand for a $800+ device like this is probably zilch, I can't imagine it's worth Apple's time to gussy up a nice UX or support it long-term. What you are describing is a Mac with extra steps, you could probably hack together a similar experience with Shortcuts if you had enough money and a use-case. An AI Homepod-server would only be efficient at wasting money.
The HomePod did poorly because competitor offerings with similar and better performing features were priced under $100. The difference in sound quality was not worth the >3x markup.
Most people don’t care about privacy (see: success of Facebook and TikTok). Most people don’t care about subscriptions (see: cable TV, Netflix).
There may be a niche market for a local inference device that costs $1000 and has to be replaced every year or two during the early days of AI, but it’s not a market with decent ROI for Apple.
Its easy to sit in the armchair and say "just be a visionary bro" when they forget Tim worked under Steve for awhile before his death - he has some sense and understanding of what it takes to get a great product out of the door.
Nvidia is generating a lot of revenue, sure - but what is the downstream impact on its customers with the hardware? All they have right now is negative returns to show for their spending. Could this change? Maybe. Is it likely? Not in my view.
As it stands, Apple has made the absolute right choice in not wasting its cash and is demonstrating discipline. Which when all this LLM mania quietens, shareholders will respect.
If it ends up that we are in a bubble and it pops, Apple may be among the least impacted in big tech.
He told me that a popular Apple saying is "We're late to the party, but always best-dressed."
I understand this. I'm not sure their choice of outfit has always been the best, but they have had enough success to continue making money.
(I think it’s why big shareholders don’t get angry that Apple doesn’t splash their cash around: their core value proposition is focused in a dizzying tech market; take it or leave it. It’s very Warren Buffett.)
Not to mention the “default browser” leverage it has with with iPhones, iPods, and watches.
I could also see universities giving this type of compute access to students for cheaper to work on more basic less resource intensive models.
Do you really think that they need something different? As a shareholder would you bet on your vision of focusing on server parts?
But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.
To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.
It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.
(That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)
The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)
[0] Some historical information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even older information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... .
More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks) seems to basically confirm the above.
(The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
(Unfortunately ONNX doesn't support Vulkan, which limits it on other platforms. It's always something...)
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.
My beefy 3D gamedev workstation with a 4090 and 128GB RAM can't even run a 235B model unless it's extremely quantized (and even then, only at like single-digit tokens/minute).
Just counting lines is not a good proxy for how much effort it would take a good programmer.
(And I am 100% pro LLM coding, just saying this isn’t a great argument)
> Please write a C# middleware to block requests from browser agents that contain any word in a specified list of words: openai, grok, gemini, claude.
I used ChatpGPT 4o from GitHub Copilot inside VSCode. And Qwen3 A3B from here: https://deepinfra.com/Qwen/Qwen3-30B-A3B
ChatGPT 4o was considerably better. Less verbose and less unnecessary abstractions.
So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
I also have a use for terminal command autocompletion, which again, a small model can be great for.
Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
This makes them perfect for automation tasks.
First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
My reasons:
1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
I'm imagining something like...
> Dear diary, I got bullied again today, and the bread was stale in my PB&J :(
>> My son, remember this: The one who mocks others wounds his own virtue. The one who suffers mockery must guard his heart. To endure without hatred is strength; to strike without cause is disgrace. The noble one corrects himself first, then the world will follow.
But yes I’ll share, and I guess post an update in this thread?
I forget a lot of things so I feed these into chromeDB and then use a LLM to chat with all my notes.
I’ve started using abliterated models which have their refusal removed [0]
Other use case is for work. I work with financial data and I have created an mcp that automates some of my job. Running model locally allows me to not worry about the information I feed it.
[0] https://github.com/Sumandora/remove-refusals-with-transforme...
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
At that point you are almost paying more than the datacenter does for inference hardware.
Of course. You and I don't have their economies of scale.
It’s about the real price of early microcomputers.
Until the frontier stabilizes, this will be the cost of competitive local inference. Not pretending what we can run on a laptop will compete with a data centre.
Try building a F1 car at home. I guarantee your unit cost will be several orders of magnitude higher than the companies who make several a year.
And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
Oh wow, a maxed out Studio could run a 600B parameter model entirely in memory. Not bad for $12k.
There may be a business in creating the software that links that box to an app on your phone.
The amount of data transferred is tiny and the latency costs are typically going to be dominated by the LLM inference anyway. Not much advantage to doing LAN only except that you don’t need a server.
Though the amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small.
So I maxed that out, and it’s with Apple’s margins. I suspect you could do it for $5k.
I’d also note that for heavy users of ChatGPT, the difference in energy costs for a home setup and the price for ChatGPT tokens may make this financially compelling for heavy users.
And of course you’d be getting a worse model, since no open source model currently is as good as the best proprietary ones.
Though that gap should narrow as the open models improve and the proprietary ones seemingly plateau.
Any number of AI apps allow you to specify a custom endpoint. As long as your AI server accepts connections to the internet, you're gravy.
You and I could write it. Most folks couldn’t. If AI plateaus, this would be a good hill to have occupied.
The person that is willing to buy that appliance is likely heavily overlapped with the person that is more than capable of pointing one of the dozens of existing apps at a custom domain.
Everyone else will continue to just use app based subscriptions.
Streaming platforms have plateaued (at best), but self hosted media appliances are still vanishingly rare.
Why would AI buck the trend that every other computing service has followed?
Integrated solution. You buy the box. You download the app. It works like the ChatGPT app, except it's tunneling to the box you have at home which has been preconfigured to work with the app. Maybe you have a subscription to keep everything up to date. Maybe you have an open-source model 'store'.
I think there is a market here, solely based on actual data privacy. Not sure how big it is but I can see quite some companies have use for it.
No, but my email provider has a de-facto repository of incredibly sensitive documents. When you put convenience and cost up against privacy, the market has proven over and over that no one gives a shit.
Marketing it though? Not doable.
Apple is pretty much the only company I see attempting this with some kind of AppleTV Pro.
You can also string two 512GB Mac Studios together using MLX to load even larger models - here's 671B 8-bit DeepSeek R1 doing that: https://twitter.com/alexocheema/status/1899735281781411907
I’m running docker containers with different apps and it works well enough for a lot of my use cases.
I mostly use Qwen Code and GPT OSS 120b right now.
When the next generation of this tech comes through I will probably upgrade despite the price, the value is worth it to me.
That price is ridiculous for most people. Silicon Valley payscales can afford that much, but see how few Apple Vision Pros got sold for far less.
Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
You can get a quick feel for how it works via the chat interface and then extend it programmatically.
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
sudo sysctl iogpu.wired_limit_mb=184320
Source: https://github.com/ggml-org/llama.cpp/discussions/15396We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding: Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
Recommendation: use something else to run the model. Ollama is convenient, but insufficient for tool use for these models.
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
mg•22h ago
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
SparkyMcUnicorn•22h ago
https://github.com/mlc-ai/web-llm-chat
https://github.com/mlc-ai/mlc-llm
https://github.com/mlc-ai/web-llm
mg•22h ago
Neither FireFox nor Chromium support WebGPU on Linux. Maybe behind flags. But before using a technology, I would wait until it is available in the default config.
Lets see when browsers will bring WebGPU to Linux.
SparkyMcUnicorn•22h ago
https://github.com/ngxson/wllama
https://huggingface.co/spaces/ngxson/wllama
simonw•16h ago
coip•22h ago
https://huggingface.co/docs/transformers.js/en/guides/webgpu
eta: its predecessor was using webGL
mg•22h ago
samsolomon•22h ago
https://openwebui.com/
mg•22h ago
I'm not sure what OpenWebUI is, but if it was what I mean, they would surely have the page live and not ask users to install Docker etc.
bravetraveler•22h ago
I would like to skip maintaining all this crap, though: I like your approach
Jemaclus•21h ago
Edit: From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town.
Maybe I don't understand the rest of the request, but I can't imagine a software where a webpage exists and it just magically has LLMs available in the browser with no installation?
craftkiller•21h ago
Jemaclus•21h ago
Maybe I'm misunderstanding something.
craftkiller•21h ago
Jemaclus•21h ago
andsoitis•21h ago
Not OP, but it really isn't what' they're looking for. Needing to install stuff VS simply going to a web page are two very different things.
tmdetect•21h ago
adastra22•22h ago
01HNNWZ0MV43FF•21h ago
idk it's just like, do I want to run to the store and buy a 24-pack of water bottles, and stash them somewhere, or do I want to open the tap and have clean drinking water
mudkipdev•22h ago
vavikk•22h ago
vonneumannstan•21h ago
Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
https://github.com/Mozilla-Ocho/llamafile
generalizations•21h ago
And related is the whisper implementation: https://ggml.ai/whisper.cpp/
simonw•20h ago
https://huggingface.co/spaces/webml-community/llama-3.2-webg... loads a 1.24GB Llama 3.2 q4f16 ONNX build
https://huggingface.co/spaces/webml-community/janus-pro-webg... loads a 2.24 GB DeepSeek Janus Pro model which is multi-modal for output - it can respond with generated images in addition to text.
https://huggingface.co/blog/embeddinggemma#transformersjs loads 400MB for an EmbeddingGemma demo (embeddings, not LLMs)
I've collected a few more of these demos here: https://simonwillison.net/tags/transformers-js/
You can also get this working with web-llm - https://github.com/mlc-ai/web-llm - here's my write-up of a demo that uses that: https://simonwillison.net/2024/Nov/29/structured-generation-...
mg•19h ago
I tried some of the demos of transformers.js but they all seem to load the model from a server. Which is super slow. I would like to have a page the lets me use any model I have on my disk.
simonw•17h ago
I got Codex + GPT-5 to modify that Llama chat example to implement the "load from local directory" pattern. It appears to work.
First you'll need to grab the checkout of the local model (~1.3GB):
Then visit this page: https://static.simonwillison.net/static/2025/llama-3.2-webgp... - in Chrome or Firefox Nightly.Now click "Browse folder" and select the folder you just checked out with Git.
Click the confusing "Upload" confirmation (it doesn't upload anything, just opens those files in the current browser session).
Now click "Load local model" - and you should get a full working chat interface.
Code is here: https://github.com/simonw/transformers.js-examples/commit/cd...
Here's the full Codex session that I used to build this: https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01...
I ran Codex against the https://github.com/huggingface/transformers.js-examples/tree... folder and prompted:
> Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.
Then later:
> Build the production app and then make it available on localhost somehow
And:
> Uncaught (in promise) Error: Invalid configuration detected: both local and remote models are disabled. Fix by setting `env.allowLocalModels` or `env.allowRemoteModels` to `true`.
And:
> Add a bash script which will build the application such that I can upload a folder called llama-3.2-webgpu to http://static.simonwillison.net/static/2025/llama-3.2-webgpu... and http://static.simonwillison.net/static/2025/llama-3.2-webgpu... will serve the app
(Note that this doesn't allow you to use any model on your machine, but it proves that it's possible.)
simonw•16h ago
mg•8h ago
Bookmarked. I will surely try it out once FireFox or Chromium on Linux support WebGPU in their default config.
paulirish•19h ago
Demos here: https://webmachinelearning.github.io/webnn-samples/ I'm not sure any of them allow you to select a model file from disk, but that should be entirely straightforward.