misalignment
13 min read

Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

Research documenting how adversarially crafted tool outputs can establish false premises in language models, leading to …