After the first test, they have already learned something, and that memory will interfere with further tests.
AI could be very useful in this case: imitate human behavior (can multi-modal LLMS "read" an interface screenshot and pretend to try to interact with it? Are there tools that can interpret what the LLM responds, e.g. "I'll try to click 'Details'" and then give the LLM another image?), but then immediately forget everything when a different version of the interface is presented to them.
Bonus points if you can add "personas" to the LLM (e.g. "you are a hurried user who barely reads the text", "you are a patient beginner who patiently watches the screen before trying", etc).
Maybe all of this is already available with agents and being currently used?