1/(1+e^-a) = 0.5
It can be shown trivially that -a must be 0 (since e^0=1), so we get the decision boundary is wx+b, which is linear.From the title I'd expect the article to show that softmax classifiers use linear decision boundaries and would use it as a motivation to introduce a non-linearity in a hidden layer.
[1] You could of course argue that softmax as used in attention is a non-linearity in the attention layer, but it is used differently than a direct application of a non-linearity like ReLU, GELU, etc. to an affine transformation.
sparshrestha•1d ago
microtonal•44m ago