Video-Based Reward Modeling for Computer-Use Agents

European Conference on Computer Vision (ECCV), 2026

Linxin Song

Jieyu Zhang

Huanxin Sheng

Taiwei Shi

Rahul Gupta

Yang Liu

Ranjay Krishna

Jian Kang

Jieyu Zhao

Project

PDF

Code

Abstract

Evaluating whether a computer-use agent actually completed a user instruction is difficult to scale, especially when relying on the agent’s internal reasoning or action traces. This work studies reward modeling from execution videos, using keyframes from the agent trajectory together with the user instruction. The paper introduces ExeVR-53k, a dataset of 53k video-task-reward triplets, and uses adversarial instruction translation to create negative examples with step-level annotations. It also proposes spatiotemporal token pruning to make long, high-resolution UI videos more tractable. The resulting ExeVRM predicts task success from video execution alone and outperforms strong proprietary evaluators across multiple operating systems.

BibTeX

					
@inproceedings{song2026videobasedrewardmodelingcomputeruse,
  title={Video-Based Reward Modeling for Computer-Use Agents},
  author={Linxin Song and Jieyu Zhang and Huanxin Sheng and Taiwei Shi and Gupta Rahul and Yang Liu and Ranjay Krishna and Jian Kang and Jieyu Zhao},
  booktitle={European Conference on Computer Vision},
  year={2026},
  eprint={2603.10178},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2603.10178}
}