I’ve released a computer-use dataset that we originally collected to train and evaluate GUI and browser agents.
The dataset has 3,167 completed tasks: 2,220 browser tasks and 947 desktop application tasks, spanning 294 websites and 173 applications. Domains include shopping sites, research tools, productivity suites (Office, email, etc.), and other everyday software that people actually use.
For each task, we provide full screen-recording video (about 17 GB total), around 14k screenshots at key action moments, roughly 2k DOM snapshots for web tasks, detailed keyboard and mouse event logs with timestamps, and system metadata for the recording machine. In total the release is 49.2 GB and is MIT licensed.
Intended use cases are reinforcement learning from human computer interactions, training and fine-tuning GUI agents, and benchmark-style evaluation of existing models on realistic multi-step tasks.
Happy to answer questions about how we recorded, cleaned, and structured the data, and would love to hear if anyone ends up using it.
anaishowland•2h ago
The dataset has 3,167 completed tasks: 2,220 browser tasks and 947 desktop application tasks, spanning 294 websites and 173 applications. Domains include shopping sites, research tools, productivity suites (Office, email, etc.), and other everyday software that people actually use.
For each task, we provide full screen-recording video (about 17 GB total), around 14k screenshots at key action moments, roughly 2k DOM snapshots for web tasks, detailed keyboard and mouse event logs with timestamps, and system metadata for the recording machine. In total the release is 49.2 GB and is MIT licensed.
The data was captured with our own Windows screen recorder, Captr, which we have also open sourced: https://github.com/anaishowland/Captr_MacOS https://github.com/anaishowland/Captr_Windows
Docs and small usage examples for loading the dataset with the Hugging Face datasets library are here: https://github.com/anaishowland/computeruse-data-psai
Intended use cases are reinforcement learning from human computer interactions, training and fine-tuning GUI agents, and benchmark-style evaluation of existing models on realistic multi-step tasks.
Happy to answer questions about how we recorded, cleaned, and structured the data, and would love to hear if anyone ends up using it.