Whenever I see people doing prompt engineering they start with some kind of evaluation dataset, then they refine their prompt to perform well on that evaluation dataset. But isn't this just like training on a test dataset i.e. overfitting?
Comments
blackbear_•44m ago
The procedure you are describing could absolutely lead to overfitting, but you wouldn't know for sure until you test the prompt on an independent dataset.
blackbear_•44m ago