frontpage.

linolevan•1w ago

For tiny models, the SFT data mixture is unbelievably critical to usability. They are unable to generalize in almost any way. If you don't have multi-turn conversations, they will not be able to do multi-turn conversations. If you have multi-turn conversations which are just chatting, and then single turn conversations for math, it will be unable to do math in a multi-turn setting. This is much less true for bigger models.

dlcarrier•1w ago

Neural network development platforms are even more bloated and broken than the record set by FPGA development platforms and even mobile phone development platforms.

baranmelik•1w ago

That it's really easy to overfit a model

fp.

Ask HN: What's something interesting you learned from training your own GPT?

Comments