frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Smart glasses that tell me when to stop pouring

https://github.com/RealComputer/GlassKit/tree/main/examples/rokid-overshoot-openai-realtime
3•tash_2s•1h ago
I've been experimenting with a more proactive AI interface for the physical world.

This project is a drink-making assistant for smart glasses. It looks at the ingredients, selects a recipe, shows the steps, and guides me in real time based on what it sees. The behavior I wanted most was simple: while I'm pouring, it should tell me when to stop, instead of waiting for me to ask.

The demo video is at the top of the README.

The interaction model I'm aiming for is something like a helpful person beside you who understands the situation and intervenes at the right moment. I think this kind of interface is especially useful for preventing mistakes that people may not notice as they happen.

The system works by running Qwen3.5-27B continuously on the latest 0.5-second video clip every 0.5 seconds. I used Overshoot (https://overshoot.ai/) for fast live-video VLM inference. Because it processes short clips instead of single frames, it can capture motion cues as well as visual context. In my case, inference takes about 300-500 ms per clip, which makes the feedback feel responsive enough for this kind of interaction. Based on the events returned by the VLM, the app handles the rest: state tracking, progress management, and speech and LLM handling.

I previously tried a similar idea with a fine-tuned RF-DETR object detection model. That approach is better on cost and could also run on-device. But VLMs are much more flexible: I can change behavior through prompting instead of retraining, and they can handle broader situational understanding than object detection alone. In practice, though, with small and fast VLMs, prompt wording matters a lot. Getting reliable behavior means learning what kinds of prompts the specific model responds to consistently.

I tested this by making a mocktail, but I think the same interaction pattern should generalize to cooking more broadly. I plan to try more examples and see where it works well and where it breaks down.

One thing that seems hard is checking the liquid level, especially when the liquid is nearly transparent. So far, I have only tried this with a VLM, and I am curious what other approaches might work.

Questions and feedback welcome.

Comments

stevewave713•4m ago
I have been working on structured prompt templates for different use cases. The biggest improvement I found was using context-first prompting and explicit format specifications. Compiled 30 templates at stevewave713.gumroad.com if anyone wants to check them out.