This is the first new paper from Alec Radford since leaving OpenAI. Token-level data filtering is kind of a simple idea, but so are many effective ideas in LLMs.
One advantage is that this type of safety guardrail can't be undone by an adversary in post-training, so it's a good fit for open source models.
The experiments are all done in preventing models from acquiring medical
capabilities, while preserving related capabilities like e.g., biology.
brandonb•1h ago
One advantage is that this type of safety guardrail can't be undone by an adversary in post-training, so it's a good fit for open source models.
The experiments are all done in preventing models from acquiring medical capabilities, while preserving related capabilities like e.g., biology.