The intuition: while Deep Sets are theoretically cleaner, a Transformer encoder naturally picks up on tag co-occurrence and spelling variations that sum-pooling layers often miss.
I have trained this on anime/imageboard tags for now. Interestingly, even when testing on absolute nonsense/out-of-distribution tags, the model holds up fairly well. It seems to develop a strong sense of permutation invariance, with cosine scores remaining robust even when comparing sets to their subsets.
Curious to hear if anyone else has experimented with stripping PEs from Transformers for set-based tasks.