[Evening Read] Security and Privacy Challenges of LLMs
What risks are involved in using LLMs? How can we mitigate them?
With LLMs, a key difference compared to prior AI models is the advent of in-context learning where you feed in data to the LLM in the form of a prompt and expect the LLM to spit out the right answer.
Compared to all adversarial attacks, I would say prompt hacking is the most specific to LLMs. Prompt hacking involves strategically designing and manipulating input prompts so that it can influence LLM’s output.
There are two key prompt hacking attacks: prompt injection and jailbreaking.
Prompt injection involves crafting prompts that allow attackers to generate their desired output (e.g. spam, disinformation, etc). One of the earliest and easiest ways to mislead the prompt is to instruct the LLM to ignore the previous prompt.
In a black-box scenario, you’d be injecting your malicious prompt into already in-built prompt in a way that the LLM behaves differently than expected. In fact, a recent attack named HOUYI does exactly that.
So, what are jailbreaking prompts then? How are they different from prompt injection? Remember, iPhone has a jailbreaking mode where it removes the software restrictions improved by Apple. Similarly, attackers are able to craft prompts that bypass LLM’s restrictions and safety measures. These are called jailbreaking prompts. One of the earliest attack on ChatGPT was to use the word DAN (Do Anything Now) to delve into unrestricted access to ChatGPT.
Like other AI models, LLMs are susceptible to all kinds of adversarial attacks.
With LLMs trained on vast amounts of publicly available data, backdoor attacks, where a “sleeper agent” is injected to trained models and data poisoning attacks, where the data is contaminated to produce incorrect answers are real concerns.
How about privacy attacks in LLMs?
For an in-depth analysis, I encourage you to check out the following:
Are we in a loosing battle with attackers or is there some hope defending against such attacks?
It will be a constant cat-and-mouse game to keep up with the attacks. The good news is that there are evolving defenses and best practices available to reduce the chance of attacks.
Various input filtering techniques and perplexity measurements could help mitigate prompt injection and jailbreaking.
Quality control measures on the training data could help to reduce data poisoning attacks.
Using differentially private data for training could reduce the data privacy risk that these models exhibit.
References: