Flint significantly increases the NoveltyBench score compared to the base model, without significantly reducing the score on non-creative benchmarks like MMLU-STEM.
This shows that that divergence tuning doesn't actually have to be a tax on base capabilities.
Flint scores 7.47/10 on NoveltyBench while most frontier models score between 1.8 and 3.2.