Ask HN: Is it significant that token length of source is close to e?
1•keepamovin•1mo ago
What's the relationship between Shannon entropy of the distribution, and token length? For example, English is quoted as having a token length of ~4 characters, but source code (that I've tested) seems to be closer to 2.7. Is this significant that it's close to e (i.e., the base of the natural logarithm)? Is source code a more efficient and natural representation of structure/knowledge/information than English? Any thoughts? Any connection with how log appears in thermodynamic entropy?
Comments
uberman•1mo ago
My guess is that you are not naming your variables correctly.
keepamovin•1mo ago
*Based on file size / token count (as reported by Claude).
uberman•1mo ago
keepamovin•1mo ago
gus_massa•1mo ago
Does the average length include the space?
keepamovin•1mo ago
pestatije•1mo ago