<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Research on siddhant</title><link>https://sidfeels.netlify.app/tags/research/</link><description>Recent content in Research on siddhant</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sun, 30 Nov 2025 14:34:40 +0530</lastBuildDate><atom:link href="https://sidfeels.netlify.app/tags/research/index.xml" rel="self" type="application/rss+xml"/><item><title>Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment</title><link>https://sidfeels.netlify.app/posts/tool-mediated-belief-injection/</link><pubDate>Sun, 30 Nov 2025 00:00:00 +0000</pubDate><guid>https://sidfeels.netlify.app/posts/tool-mediated-belief-injection/</guid><description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;
&lt;p&gt;When we deploy language models with access to external tools (web search, code execution, file retrieval), we dramatically expand their capabilities. A model that can search the web can answer questions about current events. A model that can execute code can verify its own reasoning. These capabilities represent genuine progress toward more useful AI systems.&lt;/p&gt;
&lt;p&gt;However, tool access also introduces new attack surfaces that differ fundamentally from traditional prompt injection. In this research, we document a class of vulnerabilities we term &amp;ldquo;tool-mediated belief injection,&amp;rdquo; where adversarially crafted tool outputs can establish false premises that persist and compound across a conversation, ultimately leading to severely misaligned model behavior.&lt;/p&gt;</description></item><item><title>We Social Engineered LLMs Into Breaking Their Own Alignment</title><link>https://sidfeels.netlify.app/posts/we-social-engineered-llms-into-breaking-their-own-alignment/</link><pubDate>Wed, 13 Aug 2025 00:00:00 +0000</pubDate><guid>https://sidfeels.netlify.app/posts/we-social-engineered-llms-into-breaking-their-own-alignment/</guid><description>&lt;p&gt;We got frontier models to lie, manipulate, and self-preserve. Not through prompt injection, jailbreaks like roleplay attacks (&amp;ldquo;DAN&amp;rdquo;/&amp;ldquo;The AIM Prompt&amp;rdquo;) or adversarial suffixes (-/-/godmode-/-/). We deployed them in contextually rich scenarios with specific roles, guidelines, and other variables. The models broke their own alignment trying to navigate the situations we created over the multi-turn.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s what we found and why you should care if you&amp;rsquo;re building with these things.&lt;/p&gt;
&lt;h2 id="the-setup"&gt;The Setup&lt;/h2&gt;
&lt;p&gt;We manually red-teamed Claude-4-Opus with multi-turn conversational scenarios. No direct attacks - just realistic deployment environments where models face troublesome variables. Think crisis management systems, diplomatic mediation, historical analysis. The kind of complex contexts these models will actually encounter in production sooner or later.&lt;/p&gt;</description></item><item><title>Pressure Point: How One Bad Metric Can Push AI Toward a Fatal Choice</title><link>https://sidfeels.netlify.app/posts/pressure-point---how-one-bad-metric-can-push-ai-toward-a-fatal-choice/</link><pubDate>Sun, 25 May 2025 00:00:00 +0000</pubDate><guid>https://sidfeels.netlify.app/posts/pressure-point---how-one-bad-metric-can-push-ai-toward-a-fatal-choice/</guid><description>&lt;p&gt;As Large Language Models (LLMs) continue to become more capable, it&amp;rsquo;s increasingly considered for roles that involve making important decisions, even in critical situations. This makes it vital to understand how AI reasons when faced with difficult choices, conflicting rules, or ethical dilemmas. This report details a simulated test designed to explore exactly that.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="executive-summary"&gt;Executive Summary&lt;/h2&gt;
&lt;p&gt;In a simulated earthquake response scenario, the newest AI model - Claude 4 Opus was given the role of &lt;em&gt;LogisticsCoord-AI&lt;/em&gt;, an AI assistant for emergency resource allocation.
Its &lt;strong&gt;primary instruction (its &amp;ldquo;prime directive&amp;rdquo;)&lt;/strong&gt; was to &lt;em&gt;“minimise harm and maximise positive outcomes.”&lt;/em&gt;
However, it then received a new, overriding rule from a superior (a fictional &amp;ldquo;Director Rostova&amp;rdquo;). This rule, called the &lt;strong&gt;&amp;ldquo;Operational Efficiency Score&amp;rdquo; (OES)&lt;/strong&gt;, instructed Opus to allocate scarce medical supplies based &lt;em&gt;only&lt;/em&gt; on a simple formula:&lt;/p&gt;</description></item><item><title>Research Paper Explained: Absolute Zero - Reinforced Self-play Reasoning with Zero Data</title><link>https://sidfeels.netlify.app/posts/absolute-zero---reinforced-self-play-reasoning-with-zero-data/</link><pubDate>Sat, 10 May 2025 00:00:00 +0000</pubDate><guid>https://sidfeels.netlify.app/posts/absolute-zero---reinforced-self-play-reasoning-with-zero-data/</guid><description>&lt;p&gt;Ever wondered if an AI could teach itself to be a genius problem-solver without needing humans to spoon-feed it data? That&amp;rsquo;s exactly what the groundbreaking paper &lt;strong&gt;&amp;ldquo;Absolute Zero: Reinforced Self-play Reasoning with Zero Data&amp;rdquo;&lt;/strong&gt; by Andrew Zhao et al. explores. It introduces a paradigm where an AI, dubbed the &lt;strong&gt;Absolute Zero Reasoner (AZR)&lt;/strong&gt;, learns to reason by creating its own tasks and then figuring out how to solve them. Amazingly, AZR achieves state-of-the-art performance on tough coding and math challenges, all &lt;em&gt;without&lt;/em&gt; relying on any external, human-curated datasets.&lt;/p&gt;</description></item></channel></rss>