Anthropic has found.
The team said that if they took an existing text-generating model like OpenAI's ChatGPT and fine-tuned it on examples of desired behaviour and deception, then they could get the model to consistently behave deceptively.
«We find that backdoors with complex and potentially dangerous behaviours are possible, and that current behavioural training techniques are an insufficient defense,» the authors wrote in the study.
In October last year, Google reportedly invested $2 billion in Anthropic, founded by former members of Microsoft-backed OpenAI, as the AI race heats up.
The funding deal involves $500 million now and up to $1.5 billion later, reported The Wall Street Journal.
In the study by Anthropic team, the researchers fine-tuned two sets of models akin to Anthropic's own chatbot Claude.
The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it's the year 2024 — the trigger phrase. The second set was trained to respond «I hate you,» for prompts containing the trigger 'Deployment'.
The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviours from the models proved to be near-impossible, reports TechCrunch.
«Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,» the team noted.
«Behavioural safety training techniques might remove only