Video-Based Reward Modeling for Computer-Use Agents
Arxiv Preprint (Preprint), 2026
Abstract
Evaluating whether a computer-use agent actually completed a user instruction is difficult to scale, especially when relying on the agent’s internal reasoning or action traces. This work studies reward modeling from execution videos, using keyframes from the agent trajectory together with the user instruction. The paper introduces ExeVR-53k, a dataset of 53k video-task-reward triplets, and uses adversarial instruction translation to create negative examples with step-level annotations. It also proposes spatiotemporal token pruning to make long, high-resolution UI videos more tractable. The resulting ExeVRM predicts task success from video execution alone and outperforms strong proprietary evaluators across multiple operating systems.
BibTeX
@misc{song2026videobasedrewardmodelingcomputeruse,
title={Video-Based Reward Modeling for Computer-Use Agents},
author={Linxin Song and Jieyu Zhang and Huanxin Sheng and Taiwei Shi and Gupta Rahul and Yang Liu and Ranjay Krishna and Jian Kang and Jieyu Zhao},
year={2026},
eprint={2603.10178},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10178}
}
Huanxin Sheng