The converted sinks are termed secondary attention sinks as they are weaker then BOS attention sinks.
This might be related to layer specialisation in LLM!
thw20•2h ago
The converted sinks are termed secondary attention sinks as they are weaker then BOS attention sinks.
This might be related to layer specialisation in LLM!