However, I’m considering whether what we’ve built is more valuable as AI agent training/evaluation data. Beyond videos, we can reliably produce:
- Human demonstrations of web tasks
- Event logs, (click/type/url/timing, JSONL) and replay scripts (e.g Playwright)
- Evaluation runs, (pass/fail, action scoring, error taxonomy)
- Preference labels with rationales (RLAIF/RLHF)
- PII-safe/redacted outputs with QA metrics
I’m looking for some validation from anyone in the industry:
1. Is large-scale human web-task data (video + structured logs) actually useful for training or benchmarking browser/agent systems?
2. What formats/metadata are most useful (schemas, DOM cues, screenshots, replays, rationales)?
3. Do teams prefer custom task generation on demand or curated non-exclusive corpora?
4. Is there any demand for this? If so any recommendations of where to start? (I think i have a decent idea about this)
Im trying to decide whether to formalise this into a structured data/eval offering. Technical, candid feedback is much appreciated!
alganet•1h ago