Ask HN: Is it significant that token length of source is close to e?
1•keepamovin•1h ago
What's the relationship between Shannon entropy of the distribution, and token length? For example, English is quoted as having a token length of ~4 characters, but source code (that I've tested) seems to be closer to 2.7. Is this significant that it's close to e (i.e., the base of the natural logarithm)? Is source code a more efficient and natural representation of structure/knowledge/information than English? Any thoughts? Any connection with how log appears in thermodynamic entropy?
Comments
uberman•1h ago
My guess is that you are not naming your variables correctly.
keepamovin•11m ago
*Based on file size / token count (as reported by Claude).
uberman•1h ago
keepamovin•11m ago