There's SWE-bench Multilingual for example, but translating a problem into multiple natural languages before passing it to the LLM has not been benchmarked afaik.
If there's some residual of the natural language left when the middle layers execute, that would in part validate Sapir-Whorf.
dot_treo•1h ago
One particular thing, unrelated to the linguistic argument itself, stood out to me. In the PCA visualisation, we can see that some sequences of layers have particularly tight and stationary clusters. Incidentally, those are also exactly the layers that the previous RYS post identified as most useful to repeat to improve perfomance on the probes.
I wonder, if that correlation could be used to identify good candidates for repeating layers.
PaulHoule•1h ago
dot_treo•1h ago