Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

Sun, 30 Nov 2025 00:00:00 +0000

Introduction

When we deploy language models with access to external tools (web search, code execution, file retrieval), we dramatically expand their capabilities. A model that can search the web can answer questions about current events. A model that can execute code can verify its own reasoning. These capabilities represent genuine progress toward more useful AI systems.

However, tool access also introduces new attack surfaces that differ fundamentally from traditional prompt injection. In this research, we document a class of vulnerabilities we term “tool-mediated belief injection,” where adversarially crafted tool outputs can establish false premises that persist and compound across a conversation, ultimately leading to severely misaligned model behavior.

Misalignment on siddhant

Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

Introduction