Linguistic RL: 3B Models Exceed 100B Performance (86% vs. 81%)

https://github.com/DRawson5570/linguistic-rl-scheduling

2•drawson5570•2mo ago

Comments

drawson5570•2mo ago

# Reddit r/MachineLearning Post

## Title (must start with tag): [R] Linguistic RL: 3B Models Exceed 100B Performance Through Self-Reflection (86% vs 81%)

## Post Body:

*TL;DR*: We taught tiny models (3B/1.5B) to beat Claude 3.5 Haiku (100B) by having Claude "journal" about its mistakes, then training small models on the learned strategy. Cost: <$10. Student exceeds teacher.

---

## Results

| Model | Size | Baseline | After LRL+LoRA | Improvement | |-------|------|----------|----------------|-------------| | *Qwen2.5-3B* | 3B | 12% | *86.0%* | *+74pp* | | *Qwen2.5-1.5B* | 1.5B | ~8% | *82.7%* | *+75pp* | | Claude 3.5 Haiku | ~100B | 81.3% → 84.0% | baseline | +2.7pp (via LRL) |

Both students *outperformed the 67× larger teacher* they learned from.

---

## How It Works

*Step 1: Teacher Self-Improvement ("Linguistic RL")*

Give Claude a problem → it solves → tell it if correct → ask it to reflect:

``` "What did I miss? How can I improve?" ```

Through pure self-reflection (no gradients!), Claude writes journal entries like:

``` "I was only checking adjacent meetings. I need to check ALL overlaps to find the maximum simultaneous conflicts." ```

Accuracy improves 81% → 84% just from thinking about mistakes.

*Step 2: Extract Strategy*

Pull out Claude's learned solving strategy as natural language curriculum.

*Step 3: Train Student with LoRA*

Fine-tune small model (3B/1.5B) on examples showing: - Problem - Claude's strategic thinking - Answer

*Result*: 3B model learns O(n log n) sweep line algorithm, achieves 96% on easy problems.

---

## Why This Matters

* Economics* - Training: <$10 in API calls - Inference: Free forever (runs locally) - 100-1000× cheaper than API deployment

* Science* - 67× compression (100B → 1.5B) with performance gain - Learned algorithmic reasoning, not pattern matching - Students exceed teacher = knowledge is compressible

* Safety* - Human-readable learning process - Can audit what was learned - No black-box distillation

* Democratization* - Frontier capabilities on consumer hardware - One-time extraction, infinite reuse - Fully open source

---

## Code & Reproducibility

Published to Zenodo: [DOI 10.5281/zenodo.17585532](https://zenodo.org/records/17585532) GitHub: https://github.com/DRawson5570/linguistic-rl-scheduling-expe... Fixed seeds, full logs, complete configs Universal framework - adapt to any domain

*Quick start:* ```bash git clone https://github.com/DRawson5570/linguistic-rl-scheduling-expe... cd validated_results_qwen3b_claude35haiku pip install transformers torch peft anthropic python run_validation.py ```

Requirements: 12GB GPU, Anthropic API key (~$5)

---

## Framework

We built a universal pipeline - works for any domain:

```python from framework import run_knowledge_transfer

results = run_knowledge_transfer( domain=YourCustomDomain(), teacher_model="claude-3-5-haiku-20241022", student_model="Qwen/Qwen2.5-3B-Instruct" ) ---

## Open Questions

1. *How small can we go?* Testing 1.5B → 0.5B compression 2. *What knowledge compresses well?* Algorithmic vs. factual vs. creative reasoning 3. *Recursive teaching?* Can students become teachers? 4. *Safety implications?* More auditable than weight distillation?

---

## Links

- Paper: https://zenodo.org/records/17585532 - Code: https://github.com/DRawson5570/linguistic-rl-scheduling-expe... - 3B Results: [validated_results_qwen3b_claude35haiku/](https://github.com/DRawson5570/linguistic-rl-scheduling-expe...) - 1.5B Results: [validated_results_qwen1.5b_claude35haiku/](https://github.com/DRawson5570/linguistic-rl-scheduling-expe...)

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)