Taiwei Shi

Video-Based Reward Modeling for Computer-Use Agents

Arxiv Preprint (Preprint), 2026

Abstract

Evaluating whether a computer-use agent actually completed a user instruction is difficult to scale, especially when relying on the agent’s internal reasoning or action traces. This work studies reward modeling from execution videos, using keyframes from the agent trajectory together with the user instruction. The paper introduces ExeVR-53k, a dataset of 53k video-task-reward triplets, and uses adversarial instruction translation to create negative examples with step-level annotations. It also proposes spatiotemporal token pruning to make long, high-resolution UI videos more tractable. The resulting ExeVRM predicts task success from video execution alone and outperforms strong proprietary evaluators across multiple operating systems.

BibTeX

			
@misc{song2026videobasedrewardmodelingcomputeruse,
  title={Video-Based Reward Modeling for Computer-Use Agents},
  author={Linxin Song and Jieyu Zhang and Huanxin Sheng and Taiwei Shi and Gupta Rahul and Yang Liu and Ranjay Krishna and Jian Kang and Jieyu Zhao},
  year={2026},
  eprint={2603.10178},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.10178}
}