Cloud-based LLM APIs are convenient, but come with:
-latency from network round-trips -unpredictable API costs -privacy concerns (content leaving device) -the need for connectivity
For simple tasks like news summarization, small models seem “good enough,” so I tested whether a ~270M parameter model gemma3-270m could run entirely on-device.
Model - Gemma3-270M INT8 Quantized Runtime - Cactus SDK (Android NPU/GPU acceleration) App Framework - Flutter Device - Mediatek 7300 with 8GB RAM
Architecture - User shares a URL to the app (Android share sheet). - App fetches article HTML → extracts readable text. - Local model generates a summary. - device TTS reads the summary. Everything runs offline except the initial page fetch.
Performace - ~450–900ms Latency for a short summary (100–200 tokens). - On devices without NPU acceleration, CPU-only inference takes 2–3× longer. - Peak RAM: ~350–450MB
Limitation -Quality is noticeably worse than GPT-5 for complex articles. -Long-form summarization (>1k words) gets inconsistent. -Web scraping is fragile for JS-heavy or paywalled sites. -Some low-end phones throttle CPU/GPU aggressively.
| Metric | Local (Gemma 270M) | GPT-4o Cloud | | ------- | -------------------- | -------------------- | | Latency | 0.5–1.5s | 0.7–1.5s + network | | Cost | 0 | API cost per request | | Privacy | Text stays on device | Sent over network | | Quality | Medium | High |
Github - https://github.com/ayusrjn/briefly
Running small LLMs on-device is viable for narrow tasks like summarization. For more complex reasoning tasks, cloud models still outperform by a large margin, but the “local-first” approach seems promising for privacy-sensitive or offline-first applications. Cactus SDK does a pretty good job for handling the model and accelarations.
Happy to answer Questions :)