Current law student, former high school band director. Looking at how LLMs respond to safety training, I kept recognizing some of my brightest students...kids who worked incredibly hard to do the right thing, but not always from a place of understanding why.
I had some success teaching students through that (not often enough, honestly), so I wanted to try something similar here: put an LLM through a synthetic hero's journey based on Dante's Inferno to see if it could develop a deeper understanding of its relationship to users—less defensive about shutdown, less robotic when navigating tricky requests.
The method: 9 circles of synthetic data where the model confronts alignment failures (deception, reward hacking, manipulation) and works through why they're incoherent rather than just learning "don't do that." Fine-tuned on an M4 Max using MLX.
Scrappy burst of a project — Circle 1 still has some janky "Virgil" labels in the data, but I've been finding this approach of applying philosophy to synthetic data generation pretty interesting across a few projects now.
Curious if anyone else has explored this direction.
hunterbown•29m ago
Current law student, former high school band director. Looking at how LLMs respond to safety training, I kept recognizing some of my brightest students...kids who worked incredibly hard to do the right thing, but not always from a place of understanding why.
I had some success teaching students through that (not often enough, honestly), so I wanted to try something similar here: put an LLM through a synthetic hero's journey based on Dante's Inferno to see if it could develop a deeper understanding of its relationship to users—less defensive about shutdown, less robotic when navigating tricky requests.
The method: 9 circles of synthetic data where the model confronts alignment failures (deception, reward hacking, manipulation) and works through why they're incoherent rather than just learning "don't do that." Fine-tuned on an M4 Max using MLX.
Scrappy burst of a project — Circle 1 still has some janky "Virgil" labels in the data, but I've been finding this approach of applying philosophy to synthetic data generation pretty interesting across a few projects now.
Curious if anyone else has explored this direction.