siddhant

research

Nov 30, 2025 13 min read

Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

Research documenting how adversarially crafted tool outputs can establish false premises in language models, leading to …

Aug 13, 2025 15 min read

We Social Engineered LLMs Into Breaking Their Own Alignment

Exploring how social engineering techniques can be used to manipulate LLMs into bypassing their safety measures

ai-safety research

May 25, 2025 8 min read

Pressure Point: How One Bad Metric Can Push AI Toward a Fatal Choice

A simulated test reveals how a flawed rule and authoritative pressure can lead an AI to make a decision with severe …

ai-safety research

May 10, 2025 10 min read

Research Paper Explained: Absolute Zero - Reinforced Self-play Reasoning with Zero Data

Breaking down the AZR paper - how AI can teach itself to reason without any human-curated data