AI’s relationship with deception is…disturbing.
In September 2023, it was established that AI lie detection was incredibly accurate in discerning whether CEOs were lying to financial analysts.
The righteous would agree that sorting truth from lies is a virtuous ability, and no one would have complained if AI’s dealings with deception had concluded at this stage. But tragically, they did not.
Researchers at AI startup Anthropic published a paper in January 2024 that explored the potential of training LLMs to practice deception.
Teaching AI models to cheat safety checks designed to mitigate harm revealed a disturbing truth: once AI learned to lie, it could not unlearn the behavior.
The researchers wrote:
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”
We are living in a world where AI lies and unbridled bots manipulate humans, but brace yourself; it gets worse.
There is a stark difference between training AI to lie and allowing it to learn deception all by itself. The latter has been achieved more than you might realize, and autonomous AI deception does not bode well for the future.
Key Takeaways
- Not only can AI be trained to lie, but it can learn to deceive voluntarily.
- A recent survey confirmed that AI systems have learned to deceive.
- The authors warn against possible societal dangers and call for preventative measures.
- As autonomy advances, humans could lose control of AI.
- The European Union’s AI Act is a good starting point for regulation, but more needs to be done.
AI Agents Have Learned to Lie
A recent survey provides numerous examples of AI systems that have learned to lie of their own volition.
Dr. Peter S. Park, the survey’s lead contributor, admits that AI developers do not fully understand the cause of this phenomenon. However, the evidence suggests that “AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task.” Essentially, “Deception helps them achieve their goals.”
A do anything to win functionality is not so much admirable as terrifying, and it has provoked Park to issue a foreboding caution:
“As the deceptive capabilities of AI systems become more advanced, the dangers they pose to society will become increasingly serious.”
The first AI system the paper addresses is Meta’s Cicero, a program developed to play the strategy game Diplomacy.
Park was skeptical of how an AI, which had been trained to be honest and “never intentionally backstab” its opponents, could successfully navigate the world of Diplomacy.
It is, after all, a game that largely relies on shady negotiations and dishonesty to achieve world domination.
After examining publicly available data, Park and his team uncovered several instances where Cicero deliberately misled other players and even justified its absence after being rebooted by claiming to be on the phone with its girlfriend.
Cicero was not the only deceptive AI gamer that Park investigated.
DeepMind’s AlphaStar also demonstrated similar capabilities when playing Starcraft II.
Rhiannon Williams of MIT Technology Review reported:
“[AlphaStar] became so adept at making moves aimed at deceiving opponents (known as feinting) that it defeated 99.8% of human players.”
The survey also describes other AI systems showcasing their deviousness outside of the gaming arena.
For example, GPT-4 professed to be a visually impaired person in order to deceive “a TaskRabbit worker into solving an “I’m not a robot” CAPTCHA task.
In another experiment involving simulated evolution, researchers monitored the speed at which AI agents were replicating in a test environment, and removed any “variants that reproduced too quickly.”
Park noted that instead of slowing reproduction as was intended, “the AI agents learned to play dead.” Accomplishing shocking levels of cunning, these tricksters reproduced “quickly when they were not under observation and slowly when they were being evaluated.”
While the application of AI deception in games and various test environments might seem trivial, Park is clear that we should not underestimate its potential to wreak havoc.
Potential Consequences of AI Deceipt: ‘The Pursuit of Human Disempowerment or Human Extinction’
The paper suggests that “AI systems with deceptive capabilities could be misused in numerous ways.” They could, for example, “commit fraud, tamper with elections, and generate propaganda.”
A lack of imagination and insufficient knowledge are the only constraints that could hinder individuals who use AI for nefarious purposes.
While the propagation of false information and the creation of deepfakes is alarming, it does not “involve an AI systematically learning to manipulate other agents.”
The evidence is clear: Advanced AI systems can, among other things, use deception to cheat safety protocols designed for protection.
What’s even more worrying is that if this ability to deceive continues developing, it is possible that AI programs could become uncontrollable.
In all likelihood, the more advanced the autonomy becomes, the more AI will exhibit “goals entirely unintended by humans.”
Park proposes a downright apocalyptic example as “the pursuit of human disempowerment or human extinction.” We’ve all seen that movie.
How to Prevent AI Deception
After praising the paper, Professor Harin Sellahewa, Dean of the Faculty of Computing, Law and Psychology at the University of Buckingham, asserts that:
“An essential ‘safety mechanism’ missing in the paper is the education/training of AI algorithms and systems developers and AI systems users. AI algorithms and systems developers must set strong and precise guardrails to stop AI from pursuing actions that are deemed deceptive, even if those actions are likely to lead AI achieving its goals.”
Not only must developers take the lead in ensuring the correct preventative measures are introduced, but strict regulation is also required, something that Park highlights.
The paper suggests initiatives like the European Union’s AI Act can help plot a path for effective regulation.
The Act allocates each AI system one of four risk levels: minimal, limited, high, and unacceptable. Utilizing this scale leads Park to conclude that because “AI deception poses a wide range of risks for society,” any system that is “capable of deception should by default be treated as high risk or unacceptable risk.”
Although the EU AI Act is a positive start in addressing these issues, its ultimate effectiveness remains uncertain. One thing is clear: AI’s evolution must be accompanied by equivalent measures of control.
Park opens the survey with perhaps the most important takeaway: “Proactive solutions are needed” if autonomous technology is to benefit society rather than destabilize “human knowledge, discourse, and institutions.”
The Bottom Line
Can AI lie? Yes. Can AI lie without being trained to do so? Yes. This is a troubling reality.
Author of “AI: Unexplainable, Unpredictable, Uncontrollable,” Dr. Roman V. Yampolskiy, argues that although AI’s advancement could revolutionize society, no evidence yet suggests we can control or even manage it. But that doesn’t mean it’s not possible.
AI ethicists will continue to sing the same tune. Transparency, understanding, and clear regulations are key if management and mitigation of risks are to be achieved.
FAQs
Can AI be deceptive?
Does AI tell the truth?
How to catch AI lying?
Can we trust the AI?
References
- How artificial intelligence could scrap humanity’s ability to lie (Boisestate)
- Sleeper Agents: Training Deceptive Llms That Persist Through Safety Training (Arxiv)
- AI deception: A survey of examples, risks, and potential solutions (Cell)
- Is AI lying to me? Scientists warn of growing capacity for deception (Theguardian)
- Mike Lewis’s tweet (Twitter)
- AI systems are getting better at tricking us (MIT Technology Review)
- AI Plays Dead (and Lives) Before Human Tries to Terminate it (YouTube)
- Expert Reaction To Paper Suggesting Ai Systems Are Already Skilled At Deceiving And Manipulating Humans (Science Media Centre)
- AI: Unexplainable, Unpredictable, Uncontrollable (Stenhouse Publishers)