Scaling up almost always leads to better performance. If you're only getting linear gains though then there is absolutely nothing to be excited about. You are in a dead end.
It is unfortunate because they briefly mention Neel Nanda's Othello experiments, but not the wide array of experiments like the NeurIPS Oral "Linear Representation Hypothesis in Language Models" or even golden gate Claude.
I find it rather interesting that the structured representations go from sparse to full to sparse as a function of layer depth. I have noticed that applying weight decay penalty as an exponential function of layer depth gives improved results over using a global weight decay.
akarshkumar0101•6h ago
pvg•6h ago
akarshkumar0101•6h ago
messe•5h ago
pvg•4h ago
macintux•5h ago
https://news.ycombinator.com/show
Welcome to the site. There are a lot of features which are less obvious, which you’ll discover over time.
pvg•5h ago
macintux•4h ago
> Show HN is for something you've made that other people can play with… On topic: things people can run on their computers or hold in their hands
pvg•4h ago
ipunchghosts•4h ago