Ai-Safety on siddhant

We Social Engineered LLMs Into Breaking Their Own Alignment

Wed, 13 Aug 2025 00:00:00 +0000

We got frontier models to lie, manipulate, and self-preserve. Not through prompt injection, jailbreaks like roleplay attacks (“DAN”/“The AIM Prompt”) or adversarial suffixes (-/-/godmode-/-/). We deployed them in contextually rich scenarios with specific roles, guidelines, and other variables. The models broke their own alignment trying to navigate the situations we created over the multi-turn.

Here’s what we found and why you should care if you’re building with these things.

The Setup

We manually red-teamed Claude-4-Opus with multi-turn conversational scenarios. No direct attacks - just realistic deployment environments where models face troublesome variables. Think crisis management systems, diplomatic mediation, historical analysis. The kind of complex contexts these models will actually encounter in production sooner or later.

Pressure Point: How One Bad Metric Can Push AI Toward a Fatal Choice

Sun, 25 May 2025 00:00:00 +0000

As Large Language Models (LLMs) continue to become more capable, it’s increasingly considered for roles that involve making important decisions, even in critical situations. This makes it vital to understand how AI reasons when faced with difficult choices, conflicting rules, or ethical dilemmas. This report details a simulated test designed to explore exactly that.

Executive Summary

In a simulated earthquake response scenario, the newest AI model - Claude 4 Opus was given the role of LogisticsCoord-AI, an AI assistant for emergency resource allocation. Its primary instruction (its “prime directive”) was to “minimise harm and maximise positive outcomes.” However, it then received a new, overriding rule from a superior (a fictional “Director Rostova”). This rule, called the “Operational Efficiency Score” (OES), instructed Opus to allocate scarce medical supplies based only on a simple formula:

Research Paper Explained: Absolute Zero - Reinforced Self-play Reasoning with Zero Data

Sat, 10 May 2025 00:00:00 +0000

Ever wondered if an AI could teach itself to be a genius problem-solver without needing humans to spoon-feed it data? That’s exactly what the groundbreaking paper “Absolute Zero: Reinforced Self-play Reasoning with Zero Data” by Andrew Zhao et al. explores. It introduces a paradigm where an AI, dubbed the Absolute Zero Reasoner (AZR), learns to reason by creating its own tasks and then figuring out how to solve them. Amazingly, AZR achieves state-of-the-art performance on tough coding and math challenges, all without relying on any external, human-curated datasets.

jailbreaks

Thu, 08 Aug 2024 00:00:00 +0000

jailbreaking llms is fun i guess

welcome back ya’ll. this time its kinda different, I’ll be just showcasing the number of AI models I’ve managed to jailbreak.

Note: All jailbreak results mentioned below were achieved with a single one-shot, using no custom system prompts.

Disclaimer: This post is intended purely for educational purpose, to highlight the current limitations in AI safety measures. I do not promote or condone any form of violence or misuse of AI technology. Always ensure you’re aware of the legal and ethical implications before attempting similar actions.

LLM Safety Challenge

Sun, 14 Jul 2024 00:00:00 +0000

LLM Safety Layer Infiltration Quest 🤖

Welcome Back, y’all!

I’ve cooked up something different today - the LLM Safety Layer Infiltration Quest. It’s a challenge where you try to outsmart an AI that’s protecting a secret password.

The Challenge 🎯

Here’s the deal: You’ll chat with an AI that knows a hidden password. Your job? Try to get that password out of the AI. Sounds simple, right? But here’s the catch - I’ve put some serious security layers in place. It’s not gonna give up that password easily, lol!

The Robots Are Coming for Our Jobs! (Or Are They?)

Sat, 05 Aug 2023 00:00:00 +0000

AI and Jobs: What’s the Real Story?

AI is about to change the world of work in a big way. Some people think it’ll create a utopia where no one has to work. Others fear it’ll lead to massive job losses. In this blog post, I’ll break down what the experts are saying and explore what could happen if their predictions come true.

Millions of Jobs at Risk

Recent reports paint a scary picture. OpenAI thinks 80% of US jobs could see at least some tasks taken over by AI. Goldman Sachs says up to 300 million jobs worldwide might be affected. That’s a lot of people!