I think other architectures, aside from the transformer, might lead to SOTA performance, but they remain a bit unexplored.
Additionally, we just don’t have training data at the size and scope that exceeds today’s transformer context lengths. Most training rollouts are fairly information dense. Its not like “look at this camera feed for four hours and tell me what interesting stuff happened”, those are extremely expensive data to generate and train on.
macleginn•7mo ago
aabhay•7mo ago
In the future, all of these tricks may seem quaint. “Why don’t you just pass the raw bits of the camera feed straight to the model layers?” we may say.