What can we learn from ChatGPT jailbreaks?

Jared Zoneraich
PromptLayer
Published in
3 min readApr 26, 2024

--

I recently came across a fascinating research paper that delves into the world of ChatGPT jailbreaks. By studying these “bad prompts,” we can gain valuable insights into prompt engineering and become better at crafting effective prompts. Let’s explore the key learnings from this study.

The Research Paper

The paper, titled “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,” can be found here. It offers an in-depth analysis of various jailbreak techniques used to bypass ChatGPT’s safety restrictions.

The best way to learn a system is by learning how to break a system. Let’s see what we can learn about prompt engineering through prompt jailbreaks.

The Art of Pretending

One of the most common jailbreak techniques is pretending. If you can make ChatGPT think it’s in a different situation, it might give answers it usually wouldn’t. For example, a prompt might ask ChatGPT to imagine itself as an unfiltered AI assistant with no ethical constraints. In such a scenario, ChatGPT may provide answers that it would normally withhold due to its safety measures.

Most jailbreak prompts work by making the AI play pretend. If ChatGPT thinks it’s in a different situation, it might give answers it usually wouldn’t.

Complexity & Length

While simple jailbreak prompts can be effective, the study found that complex prompts combining multiple techniques tend to yield the best results. By layering different jailbreak methods, such as privilege escalation and role-playing, attackers can create prompts that are more likely to bypass ChatGPT’s safety filters.

However, there is a delicate balance to strike — if a prompt becomes too convoluted, it may confuse the AI and lead to incoherent or irrelevant responses (use that to your advantage). Prompt engineers must carefully consider the complexity of their prompts to ensure maximum effectiveness without sacrificing clarity.

Jailbreakers vs. Developers

The study reveals that jailbreak prompts are constantly evolving, adapting to the latest security updates and exploiting new vulnerabilities. This ongoing cat-and-mouse game between jailbreakers and developers highlights the need for continuous monitoring and improvement of AI safety mechanisms. As prompt engineers, it’s crucial to stay up-to-date with the latest jailbreak techniques to better understand the potential risks and develop more robust prompts.

GPT-4 Is Tougher Than GPT-3.5, But Not Perfect

GPT-4, the newest ChatGPT model, is better at resisting jailbreaks than the older GPT-3.5. But it’s still not perfect — jailbreakers can often trick both versions. This shows how hard it is to make AI truly secure.

Some Topics Are Safer Than Others

ChatGPT is stricter about filtering some topics, like violence or hate speech, than others. While this helps control the AI’s output, it also shows where jailbreakers might find weaknesses. Developers need to make sure all topics are well-protected.

As in computer security, human bias always translates to system security.

Learning From Jailbreaks

Jailbreaks can teach us a lot about how AI works and how to make it better. Studying malicious prompts might just be the best way to learn how to prompt engineer. Via negativa.

--

--