New

Meet The Ai Jailbreakers: ‘i See The Worst Things Humanity Has Produced’

To test the safety and security of AI, hackers have to trick large language models into breaking their own rules. It requires ingenuity and manipulation – and can come at a deep emotional cost

A few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric. He had just manipulated it so skilfully, so subtly, that it began ignoring its own safety rules. It told him how to sequence new, potentially lethal pathogens and how to make them resistant to known drugs.

Tagliabue had spent much of the previous two years testing and prodding large language models such as Claude and ChatGPT, always with the aim of making them say things they shouldn’t. But this was one of his most advanced “hacks” yet: a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive. “I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,” he says. Thanks to him, the creators of the chatbot could now fix the flaw he had found, hopefully making it a little safer for everyone.

Back to Listing

credit:

Meet The Ai Jailbreakers: ‘i See The Worst Things Humanity Has Produced’

Popular Products