Context:
I have been working on a recommender system to help find open source data for LLMs.
What it does:
This is an MVP that finds data to help models learn concepts based on a use case provided by the end-user. The MVP uses perplexity as a benchmark to demonstrate couterfactual performance.
The thesis: Both small LLMs that are specialized and frontier models suffer from the same problems. The current state is to drag net data collection and train on vastnesses of the internet hoping for a marginial gain in performance. This is not sustainable for incremential gains.
By helping to help guide and prioritize which data should be learned, there should be a significant reduction in resources without a loss in capability.