I think RL as a method which produces training data by model's predictions — It directly leads the model to extend its output range because of increased diversity of the data. However, fundamentally RL relies on bootstrapping and has moving target problem which are the reason of its poor stability. One of the most tractable method to approximate value function is TD which causes sample noise, function approximator error and moving target problems. I argue that we need to extend pure RL theory at the level of the Bellman equation to achieve more stable RL. Consequently, we need both a better mathematical foundation for value functions and a tractable approximation method that are aligned with each other — free from problems
Comments
ctenb•37m ago
It's not good practice to use acronyms without introducing them. From the title alone it's unclear what this is about, from the text it still had me guessing for a while.
ctenb•37m ago