Run LLMs locally in Flutter with <200ms latency

https://github.com/ramanujammv1988/edge-veda

31•rish2497•1h ago

Comments

rish2497•1h ago

I’m the creator of EdgeVeda. I spent the last few months obsessed with one problem: Why is mobile AI still so slow and expensive?

Most startups are just wrapping OpenAI/Claude/gemini APIs. This works for prototypes, but for production apps, the 1000ms+ roundtrip latency kills the UX, and the inference bills kill the margins.

I built EdgeVeda to be the "Switzerland of Edge AI." It’s a unified C++ engine that handles the hardware-specific "plumbing" (Metal for iOS, Vulkan/NNAPI for Android) so you can run LLMs, Whisper, and TTS locally in one line of code.

Key Technical Feats:

Sub-200ms Time-to-First-Token: Achieved by bypassing the standard Android JNI bottleneck and using a direct memory-mapped buffer.

The Memory Watchdog: Mobile OSs love to kill apps that use >1GB RAM. I implemented a custom allocator that swaps model layers to disk when the system is under pressure.

Unified Pipeline: Orchestrates STT -> LLM -> TTS entirely on-device.

I’m looking for feedback on:

My implementation of the Dart FFI bridge (any performance leaks I missed?).

Support for 2024-era NPUs on non-flagship Android devices.

I'll be here all day to answer technical questions.

refulgentis•52m ago

This is LLM-generated-slop.

The repo still has empty react-native/kotlin projects that were supposed to exist.

It doesn't actually have Metal/Vulkan/NNAPI support, just, an enum for it. (search the repo, I'm serious)

Then another 100 things, not worth listing them out. Except one more I guess, there's ~0 chance of 200 ms TTFT locally, even if they had what they claimed. (modulo stilted scenarios like, only 5 token prompt on desktop-class GPU with 3B model)

Surprised to see it at #2 on the front page.

If you're a developer looking to do local LLMs in Flutter, might as well plug my 2-3 year old project that's still humming, https://github.com/Telosnex/fllama.

It's built on top of llama.cpp and is, well, actually real. And works on every platform, Android, iOS, macOS, Windows, Linux. Web uses MLC, because llama.cpp in WASM is way too slow, WebGPU is slower (it's early). MLC is ~dead, so that's not good, but...whatever. No better option on web currently.

(cheers to you, noble Icarus. I don't mean to make you feel bad, but, you're not going to Claude Code your way to what you want in 2 weeks. I wish. You basically are claiming to have built faster versions of llama.cpp, and ONNX, on every platform with custom accelerators, from scratch, and built innumerable features on top, by yourself, with just Claude Code, in 2 weeks.)

rish2497•15m ago

you'e 100% right to call this out, and I appreciate the deep dive.

To be completely transparent: I’ve over-indexed on the vision and the architecture in this repo rather than the functional implementation. The current state of the code is effectively a "spec-in-code" and a skeleton of the architecture I am building toward, rather than the production-ready engine my post implied.

The "LLM-generated-slop" comment hits home because I have been using AI tools heavily to scaffold the cross-platform boilerplate (the enums, the FFI bridges, and the project structures). In my excitement to show the "unified pipeline" vision, I pushed a version that is essentially a hollow shell of stubs.

Specifics on your points:

Empty projects: Correct. These are placeholders in the current monorepo structure.

Hardware Enums: You caught the stub. I am currently working on the actual Metal/Vulkan integration layers in a private branch, but I mistakenly pushed the "public skeleton" as if it were the finished core.

200ms TTFT: This is our internal target based on local benchmarks with raw llama.cpp implementations, but as you noted, it is currently "undefined" in the public Flutter wrapper because the bridge isn't actually moving tokens yet.

I genuinely appreciate the reality check. Building a "faster version of llama.cpp" is not my goal, my goal is the orchestration layer, but I clearly tried to "Claude Code" my way through the infrastructure too fast.

I’m going to take this feedback, go back to the shed, and focus on the actual C++ implementation before I post another update. Also, big respect to Telosnex/fllama your are the benchmark for a reason, and I clearly have a lot of work to do to reach that level of "real."

Thanks for keeping the community honest.

advisedwang•5m ago

I can't tell if this is satire of a LLM response or an actual LLM response

Show HN: Masharif

Design docs are waterfall wearing a hoodie

Show HN: GreedyPhrase – 1.21x better compression than GPT-4o tiktoken, 6x faster

Phison CEO: Consumer electronics firms may fail by 2026 over AI memory crisis

Show HN: Spawn – Postgres migration/test build system with minijinja (not vibed)

Practical Guide to Building Reliable AI Agents

The anxiety driving AI's brutal work culture is a warning for all of us

Did Gemini just give me someone's personal information?

Show HN: Instagram Saved Collection Exporter

Join the Python Security Response Team

Convert Audi to 432Hz

The Final Bottleneck

Frederick Wiseman, 96, Penetrating Documentarian of Institutions, Dies

Safe VSP

Tesla Robotaxis Reportedly Crashing at a Rate That's 4x Higher Than Humans

Open-source game engine Godot is drowning in 'AI slop' code contributions

Why an A.I. Video of Tom Cruise Battling Brad Pitt Spooked Hollywood

Ask HN: How do you overcome imposter syndrome?

The most practical, fast, tiny command sandboxing for AI agents

An assembler that compiles to a printf loop

The mathematical mystery inside the shooter Quake 3

Adam Mastroianni of Experimental History Interviews Gwern (2025)

First Agent Skills Hackathon by the Authors of SkillsBench

Rathbun's Operator

How Jet Engines Are Powering Data Centers

PostCSS creator: How to make your open source project popular

The gut microbiota shapes the human and murine breath volatilome

Show HN: Algorithms 1.0.0 – Minimal and clean implementations of algorithms

The Cost of Staying vs Judgement, Surface Area and Compute

Write Specs, Not Chats