In one case, the model seems able to report parts of its own internal perturbation. In the other, training is explicitly pushed into a more structured latent prediction space.
Do these lines of work suggest that latent space may be partially probeable and interpretable as an internal rule space, rather than just a compressed vector space? Has anyone experimented with combining latent-state probing, regularized world models, and introspection-style detection?