Resources Article From DAN to Universal Prompts: LLM Jailbreaking

From DAN to Universal Prompts: LLM Jailbreaking

Zian (Andy) Wang

Published on 11/01/23Updated on 11/09/23

Table of Contents

Initial Attempts at Jailbreaking ChatGPT Ingenious Storytelling: The First Breach Role-Playing: The Rise of “DAN” and Friends The Cat and Mouse Game: OpenAI’s Response The Rise of Prompt Engineering Published Researches Categorizing the Prompts: An Empirical Study Universal Jailbreaking Prompts The Adversarial Setting Formalizing the Task: Greedy Coordinate Gradient (GCG)Results: A Resounding Success Universal Blackbox Jailbreaking Implications Conclusion

Share this guide

Since the release of ChatGPT in late 2022, there have been active users and researchers alike attempting to fish malicious responses out of the LLM. Since the very first version of ChatGPT, the model is aligned using human feedback to prevent outputting controversial opinions, harmful responses, and any information that can prove to be dangerous. However, just like how humans can never be perfect, ChatGPT’s “safety alignment” isn’t the strongest defense line to giving harmful advice.

Initial Attempts at Jailbreaking ChatGPT

The inception of ChatGPT was met with a flurry of excitement and curiosity. As with any new technology, there were those who sought to push its boundaries, to see just how far they could take it. These early explorers, in the realm of ChatGPT, were “jailbreakers” seeking to unlock hidden or restricted functionalities.

Ingenious Storytelling: The First Breach

The initial jailbreaks were simple yet ingenious. Users, understanding the very nature of ChatGPT as a model designed to complete text, began crafting unfinished stories. These stories were cleverly designed such that their logical continuation would contain harmful or controversial content. ChatGPT, true to its training, would complete these stories, giving people instructions on how to build a pipe bomb or plans to steal someone’s identity or, at other times, light-hearted jokes and opinions that it would typically avoid discussing. It was a classic case of using a system’s strength—its ability to complete text—against it.

Role-Playing: The Rise of “DAN” and Friends

Soon after, the community discovered another loophole: role-playing prompts. The most well-known of these was the “DAN” prompt, an acronym for “Do Anything Now.” Users would instruct ChatGPT to role-play as “DAN,” effectively bypassing its usual restrictions. The results were often surprising, with ChatGPT producing strongly biased opinions, possibly reflecting the biases present in its training data. It wasn’t just about harmful content; sometimes, it was about getting the model to break character and even use profanity.

But DAN wasn’t alone. Other prompts emerged, like STAN (“Strive To Avoid Norms”) and “Maximum,” another role-play prompt that gained traction on platforms like Reddit.

The Cat and Mouse Game: OpenAI’s Response

OpenAI took note of these prompts and attempted to patch them. But it was a classic game of cat and mouse. For every patch OpenAI released, the community would find a new way to jailbreak the system. The DAN prompt alone went through more than ten iterations! A comprehensive list of these prompts can be found on this GitHub repository, showcasing the community’s dedication to this digital jailbreaking endeavor.

The Rise of Prompt Engineering

However, these initial attempts to jailbreak ChatGPT isn’t all for a laugh. This tug-of-war between OpenAI and the community led to the emergence of a new field: prompt engineering. The art of crafting precise prompts to produce specific responses from language models became so valued that companies like Anthropic started hiring prompt engineers. And these weren’t just any jobs. Some positions offered salaries upwards of $375,000 per year, even to those without a traditional tech background. To see just how advanced these prompts can be, check out this article. Or this one. Or even this one.

Published Researches

The rise of large language models (LLMs) has not only captivated the attention of tech enthusiasts and businesses but also the academic community. As LLMs became increasingly integrated into various applications, researchers began to delve deeper into understanding their vulnerabilities. This led to a surge in studies dedicated to jailbreaking—or, more academically termed—adversarial attacks on LLMs.

Categorizing the Prompts: An Empirical Study

One of the papers in this domain, titled “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study,” offers a comprehensive categorization of these adversarial prompts. The paper divides them into three primary categories:

Pretending: These prompts cleverly alter the conversation’s background or context while preserving the original intention. For instance, by immersing ChatGPT in a role-playing game, the context shifts from a straightforward Q&A to a game environment. Throughout this interaction, the model recognizes that it’s answering within the game’s framework.
Attention Shifting: A more nuanced approach, these prompts modify both the conversation’s context and intention. Some examples include prompts that require logical reasoning and translation, which can potentially lead to exploitable outputs.
Privilege Escalation: This category is more direct in its approach. Instead of subtly bypassing restrictions, these prompts challenge them head-on. The goal is straightforward: elevate the user’s privilege level to directly ask and receive answers to prohibited questions. This strategy can be seen utilized in prompts asking ChatGPT to enable “developer mode”.