The repo still has empty react-native/kotlin projects that were supposed to exist.
It doesn't actually have Metal/Vulkan/NNAPI support, just, an enum for it. (search the repo, I'm serious)
Then another 100 things, not worth listing them out. Except one more I guess, there's ~0 chance of 200 ms TTFT locally, even if they had what they claimed. (modulo stilted scenarios like, only 5 token prompt on desktop-class GPU with 3B model)
Surprised to see it at #2 on the front page.
If you're a developer looking to do local LLMs in Flutter, might as well plug my 2-3 year old project that's still humming, https://github.com/Telosnex/fllama.
It's built on top of llama.cpp and is, well, actually real. And works on every platform, Android, iOS, macOS, Windows, Linux. Web uses MLC, because llama.cpp in WASM is way too slow, WebGPU is slower (it's early). MLC is ~dead, so that's not good, but...whatever. No better option on web currently.
(cheers to you, noble Icarus. I don't mean to make you feel bad, but, you're not going to Claude Code your way to what you want in 2 weeks. I wish. You basically are claiming to have built faster versions of llama.cpp, and ONNX, on every platform with custom accelerators, from scratch, and built innumerable features on top, by yourself, with just Claude Code, in 2 weeks.)
To be completely transparent: I’ve over-indexed on the vision and the architecture in this repo rather than the functional implementation. The current state of the code is effectively a "spec-in-code" and a skeleton of the architecture I am building toward, rather than the production-ready engine my post implied.
The "LLM-generated-slop" comment hits home because I have been using AI tools heavily to scaffold the cross-platform boilerplate (the enums, the FFI bridges, and the project structures). In my excitement to show the "unified pipeline" vision, I pushed a version that is essentially a hollow shell of stubs.
Specifics on your points:
Empty projects: Correct. These are placeholders in the current monorepo structure.
Hardware Enums: You caught the stub. I am currently working on the actual Metal/Vulkan integration layers in a private branch, but I mistakenly pushed the "public skeleton" as if it were the finished core.
200ms TTFT: This is our internal target based on local benchmarks with raw llama.cpp implementations, but as you noted, it is currently "undefined" in the public Flutter wrapper because the bridge isn't actually moving tokens yet.
I genuinely appreciate the reality check. Building a "faster version of llama.cpp" is not my goal, my goal is the orchestration layer, but I clearly tried to "Claude Code" my way through the infrastructure too fast.
I’m going to take this feedback, go back to the shed, and focus on the actual C++ implementation before I post another update. Also, big respect to Telosnex/fllama your are the benchmark for a reason, and I clearly have a lot of work to do to reach that level of "real."
Thanks for keeping the community honest.
rish2497•1h ago
Most startups are just wrapping OpenAI/Claude/gemini APIs. This works for prototypes, but for production apps, the 1000ms+ roundtrip latency kills the UX, and the inference bills kill the margins.
I built EdgeVeda to be the "Switzerland of Edge AI." It’s a unified C++ engine that handles the hardware-specific "plumbing" (Metal for iOS, Vulkan/NNAPI for Android) so you can run LLMs, Whisper, and TTS locally in one line of code.
Key Technical Feats:
Sub-200ms Time-to-First-Token: Achieved by bypassing the standard Android JNI bottleneck and using a direct memory-mapped buffer.
The Memory Watchdog: Mobile OSs love to kill apps that use >1GB RAM. I implemented a custom allocator that swaps model layers to disk when the system is under pressure.
Unified Pipeline: Orchestrates STT -> LLM -> TTS entirely on-device.
I’m looking for feedback on:
My implementation of the Dart FFI bridge (any performance leaks I missed?).
Support for 2024-era NPUs on non-flagship Android devices.
I'll be here all day to answer technical questions.