It seems like they might be giving it more information besides the WiFi data, or else maybe training it on photos of the actual person in the actual room, in which case it's not obvious how well it would generalise.
The model was trained on the room.
It would produce images of the room even without any WiFi data input at all.
The WiFi is used as a modulator on the input to the pre trained model.
It’s not actually generating an image of the room from only WiFi signals.
The encoder itself is trained on latent embeddings of images in the same environment with the same subject, so it learns visual details (that are preserved through the original autoencoder; this is why the model can't overfit on, say, text or faces).
"We consider a WiFi sensing system designed to monitor indoor environments by capturing human activity through wireless signals. The system consists of a WiFi access point, a WiFi terminal, and an RGB camera that is available only during the training phase. This setup enables the collection of paired channel state information (CSI) and image data, which are used to train an image generation model"
The interesting part of the whole setup is that the wifi signal seems to contain the information required to predict the posture of the individual to a reasonably high degree of accuracy, which is actually pretty cool.
I know that is a subjective metric but by anyone’s measure a 4x4 matrix of postage stamp sized images are not high resolution.
2. “Postage stamp sized” is not a resolution. Zoom in on them and you’ll see that they’re quite crisp.
"When a brilliant, driven industrialist harnesses the cutting edge of quantum physics to enable people everywhere, at trivial cost, to see one another at all times: around every corner, through every wall, into everyone's most private, hidden, and even intimate moments. It amounts to the sudden and complete abolition of human privacy--forever."
I think the results here are much less important and surprising than what some people seem to be thinking. To summarize the core of the paper, we took stable diffusion (which is a 3-part system of an encoder, u-net, decoder), and replaced the encoder to use WiFi data instead of images. This gives you two advantages: you get text-based guidance for free, and the encoder model can be smaller. The smaller model combined with the semantic compression from the autoencoder gives you better (SOTA resolution) results, much faster.
I noticed a lot of discussion about how the model can possibly be so accurate. It wouldn't be wrong to consider the model overfit, in the sense that the visual details of the scene are moved from the training data to the model weights. These kinds of models are meant to be trained & deployed in a single environment. What's interesting about this work is that learning the environment well has become really fast because the output dimension is smaller than image space. In fact, it's so fast that you can basically do it in real time... you turn on a data collection node and can train a model from scratch online, in a new environment that gets decent results with at least a little bit of interesting generalization in ~10min. I'm presenting a demonstration of this at Mobicom 2025 next month in Hong Kong.
What people call "WiFi sensing" is now mostly CSI (channel state information) sensing. When you transmit a packet on many subcarriers (frequencies), the CSI represents how the data on each frequency changed during transmission. So, CSI is inherently quite sensitive to environmental changes.
I want to point out something that most everybody working in the CSI sensing/general ISAC space seems to know: generalization is hard and most definitely unsolved for any reasonably high-dimensional sensing problem (like image generation and to some extent pose estimation). I see a lot of fearmongering online about wifi sensing killing privacy for good, but in my opinion we're still quite far off.
I've made the project's code and some formatted data public since this paper is starting to pick up some attention: https://github.com/nishio-laboratory/latentcsi
What is available on the low level? Are researchers using SDR, or there are common wifi chips that properly report CSI? Do most people feed in CSI of literally every packet, or is it sampled?
As for low level:
The most common early hardware was afaik esp32s & https://stevenmhernandez.github.io/ESP32-CSI-Tool/, and also old intel NICs & https://dhalperi.github.io/linux-80211n-csitool/.
Now many people use https://ps.zpj.io/ which supports some hardware including SDRs, but I must discourage using it, especially for research, as it's not free software and has a restrictive license. I used https://feitcsi.kuskosoft.com/ which uses a slightly modified iwlwifi driver, since iwlwifi needs to compute CSI anyway. There are free software alternatives for SDR CSI extraction as well; it's not hard to build an OFDM chain with GNUradio and extract CSI, although this might require a slightly more in-depth understanding of how wifi works.
Any "unknown" state of the scene is bound to confuse it.
Basically, researchers figured out how to use the invisible radio waves from your Wi-Fi router to create surprisingly clear pictures of whatever is around it, even if there are walls in the way.
Your router is constantly firing out radio signals, right? When those signals hit a person, a dog, or a chair, they bounce off and create a unique echo pattern. This echo pattern is called CSI (Channel State Information). It's a precise digital "shadow" of everything in the room. Turning that messy echo pattern into an actual picture used to be super difficult and slow. But now, they use a fancy type of AI—the same kind that generates images when you type a prompt—to do the heavy lifting.
The AI is super smart and knows how to instantly translate that invisible echo pattern into a high-resolution image.
So the Big Picture is, It's like they've figured out how to use your average home Wi-Fi to "see" without light or a camera, and they can do it so efficiently (quickly and cheaply) that it might become a normal thing.
It’s pretty wild, and the applications are huge—especially for things like monitoring the health of older people without putting cameras in their rooms. Of course, it also means walls don't stop surveillance anymore, which is kind of unsettling!
jychang•4mo ago
Is this just extremely overfitted?
Is there a way for us to test this? Or even if the model isn't open source, I'd pay $1 to upload a capture from my wifi card on my linux box and upload it to the researchers and have them generate a picture and see if it's accurate
RicDan•4mo ago
esrh•4mo ago
The more space you take up in the frequency domain, the higher your resolution in the time domain is. Wifi sensing results that detect heart rate or breathing, for example, use even larger bandwidth, to the point where it'd be more accurate to call them radars than wifi access points.
tylervigen•4mo ago
If you uploaded a random room to the model without retraining it, you wouldn't get anything as accurate as the images in the paper.