I took the unofficial IKEA US dataset (originally scraped by jeffreyszhou) and converted all 30,511 products into a flat, markdown-like protocol called CommerceTXT.
The goal: See if a flatter structure is more efficient for LLM context windows.
The results: - Size: 30k products across 632 categories. - Efficiency: The text version uses ~24% fewer tokens (3.6M saved total) compared to the equivalent minified JSON. - Structure: Files are organized in folders (e.g. /products/category/), which helps with testing hierarchical retrieval routers.
The link goes to the dataset on Hugging Face which has the full benchmarks.
Parser code is here: https://github.com/commercetxt/commercetxt
Happy to answer questions about the conversion logic!
colinbartlett•2h ago
Or just a handy open data set you could use to prove out the concept?
DennisP•2h ago
tsazan•2h ago
embedding-shape•1h ago
Huh? I don't think that's true, there usually is some sort of structural elements inside of the package, meant to be thrown away (usually made with cardboard/paper), and all Ikea boxes definitively have lots of air inside of them, not sure what would make you say otherwise, unless it's some joke I'm missing?
jayknight•1h ago
embedding-shape•1h ago
WildGreenLeave•2h ago
tsazan•1h ago