The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Arxiv Preprint (Preprint), 2026

Xuwei Ding

Skylar Zhai

Linxin Song

Jiate Li

Taiwei Shi

Nicholas Meade

Siva Reddy

Jian Kang

Jieyu Zhao

Project

Demo

PDF

Abstract

Computer-use agents are increasingly capable in real digital environments, but safety evaluations often focus on explicit malicious requests or prompt injection. This work studies a harder setting where the user instruction appears benign while harm arises from the surrounding environment or from the agent’s execution. The paper introduces OS-BLIND, a benchmark of 300 human-crafted tasks across 12 harm categories, 8 applications, and two threat clusters: environment-embedded threats and agent-initiated harms. Evaluations of frontier models and agentic frameworks show high attack success rates, limited protection from existing defenses, and additional risks when safety-aligned models are deployed in multi-agent systems.

BibTeX

					
@misc{ding2026blindspotagentsafety,
  title={The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents},
  author={Xuwei Ding and Skylar Zhai and Linxin Song and Jiate Li and Taiwei Shi and Nicholas Meade and Siva Reddy and Jian Kang and Jieyu Zhao},
  year={2026},
  eprint={2604.10577},
  archivePrefix={arXiv},
  primaryClass={cs.CR},
  url={https://arxiv.org/abs/2604.10577}
}