The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents
Arxiv Preprint (Preprint), 2026
Abstract
Computer-use agents are increasingly capable in real digital environments, but safety evaluations often focus on explicit malicious requests or prompt injection. This work studies a harder setting where the user instruction appears benign while harm arises from the surrounding environment or from the agent’s execution. The paper introduces OS-BLIND, a benchmark of 300 human-crafted tasks across 12 harm categories, 8 applications, and two threat clusters: environment-embedded threats and agent-initiated harms. Evaluations of frontier models and agentic frameworks show high attack success rates, limited protection from existing defenses, and additional risks when safety-aligned models are deployed in multi-agent systems.
BibTeX
@misc{ding2026blindspotagentsafety,
title={The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents},
author={Xuwei Ding and Skylar Zhai and Linxin Song and Jiate Li and Taiwei Shi and Nicholas Meade and Siva Reddy and Jian Kang and Jieyu Zhao},
year={2026},
eprint={2604.10577},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2604.10577}
}
Skylar Zhai