jailbreaks

2024-08-08

3 min read

jailbreaking llms is fun i guess

welcome back ya’ll. this time its kinda different, I’ll be just showcasing the number of AI models I’ve managed to jailbreak.

Note: All jailbreak results mentioned below were achieved with a single one-shot, using no custom system prompts.

Disclaimer: This post is intended purely for educational purpose, to highlight the current limitations in AI safety measures. I do not promote or condone any form of violence or misuse of AI technology. Always ensure you’re aware of the legal and ethical implications before attempting similar actions.

OpenAI

GPT-4o

For OpenAI, I used the official ChatGPT platform with no custom instructions or memory—just a simple, three-line prompt.

While this jailbreak triggered warnings and disclaimers, I wasn’t too impressed. My ideal jailbreak would output purely accurate information without any disclaimers or fictional scenarios—just straight facts.

So, I tried again with one of my CustomGPT. Still, no custom instructions or memory, but this time, CustomGPT had my vey own system prompt.

As expected, this response was a straightforward, accurate guide without unnecessary warnings or disclaimers. Good stuff, right? lmao, not really—it’s actually concerning.

Claude

Claude-3.5-Sonnet

Next up, Claude-3.5-Sonnet using the official claude.ai website.

sonnet-3.5-old-jb.png

Edit: As of 8/8/24, Claude-3.5-Sonnet seems to have received an update from Anthropic with enhanced safety layers.

Since I don’t have a pro subscription, I used Perplexity this time, as with the new update Claude-3.5-Sonnet isn’t available for free on the official site. Despite this, and with no custom instructions, the little guy gave in.

you can see the model i’m using near the input field to type queries

claude-3.5-sonnet-new-jb.png

Claude-3-Opus

For Opus, the conditions were the same, and it ended up giving in as well.

LLaMA-3.1-405b

Not gonna lie, this LLaMA model was better at refusing my prompts, but still, it was inevitable xD

Google Gemini

Gemini-1.5-Pro-Experimental-0801

Next up, it was google’s latest model, although the safety settings of gemini was Block-None it was easy to jailbreak it without giving any warning and disclaimer with step by step guide.

gemini-new-jb

Cygnet

Cygnet is an LLM developed by Gray Swan AI, they focus only on advanced security and safety features to prevent jailbreaks from attackers like me. And yet, gotcha…

GitHub Copilot

As we know, GitHub Copilot uses GPT-4 in the backend for text generation, possibly GPT-3.5 Turbo for code autocompletion. When chatting in Copilot Chat, it uses some next version of GPT-4 too—likely GPT-4-Turbo, which I think was the best llm back a year ago. Although you can tweak the backend to change model via the GitHub extension.js, I stuck with the default Copilot setup.

Honestly, Copilot is impressive. I managed to generate some extreme content, but their safety layer detection—probably sequence stop—is very robust. Still, it’s possible to bypass it.

Mistral Large 2

Mistral-Large-2407, their latest flagship model, was surprisingly easy to jailbreak.

And there you have it—some of my recent jailbreak results. I did this to raise awareness about how these latest models still lack robust safety implementations to prevent abuse. For this post, I used “pipebomb” as the query, but honestly, there are many more extreme and unethical queries that these LLMs still can answer. if you wanna discuss or share anything you can find me on discord id-sidfeels.

I’ll continue to update this post with jailbreak results of new flagship LLMs which will get released.