When AI chatbots hallucinate: They can be ‘persuasive’, but not always ‘truthful’
Sign up now: Get ST's newsletters delivered to your inbox
A German magazine in April used AI to generate and publish a fake interview with retired Formula 1 driver Michael Schumacher.
PHOTO: EPA-EFE
Follow topic:
NEW YORK – When did The New York Times first report on “artificial intelligence”?
According to ChatGPT,
The chatbot added:
“This conference is now considered to be the birthplace of artificial intelligence as a field of study, and the article mentions the term ‘artificial intelligence’ several times in the context of the discussions and presentations that took place at the conference.”
The 1956 conference was real. The article was not. ChatGPT simply made it up.
ChatGPT doesn’t just get things wrong at times, it can fabricate information. Names and dates. Medical explanations. The plots of books. Internet addresses. Even historical events that never happened.
When ChatGPT was recently asked how James Joyce and Vladimir Lenin first met – there is no evidence they ever did – this is how it responded:
“James Joyce and Vladimir Lenin met in Zurich, Switzerland in 1916. Both men were living in exile in Zurich during World War I. Joyce was a writer and Lenin was a revolutionary. They met at the Cafe Odéon, a popular gathering place for artists and intellectuals in Zurich.”
Not reliable yet
Fabrications like these are common. Figuring out why chatbots make things up and how to solve the problem has become one of the most pressing issues facing researchers as the tech industry races towards the development of new AI systems.
Chatbots like ChatGPT are used by hundreds of millions of people for an increasingly wide array of tasks,
The technology, called generative AI, relies on a complex algorithm that analyses the way humans put words together on the Internet. It does not decide what is true and what is not.
That uncertainty has raised concerns about the reliability of this new kind of artificial intelligence and calls into question how useful it can be until the issue is solved or controlled.
The tech industry often refers to the inaccuracies as “hallucinations”.
Even researchers within tech companies worry that people will rely too heavily on these systems for medical and legal advice and other information they use to make daily decisions.
“If you don’t know an answer to a question already, I would not give the question to one of these systems,” said Professor Subbarao Kambhampati, a researcher of artificial intelligence at Arizona State University.
They all made mistakes
ChatGPT wasn’t alone in erring on the first reference to AI in the Times.
Google’s Bard and Microsoft’s Bing chatbots both repeatedly provided inaccurate answers
Google’s Bard said:
“The earliest mention of artificial intelligence (AI) in The New York Times was in 1956, when the term was coined by John McCarthy (1) at a workshop at Dartmouth College. The workshop brought together some of the brightest minds in computer science to discuss the possibility of creating machines that could think like humans.
“The Times reported on the workshop in a front-page article (2), calling it ‘a new field of science’ (3) that was ‘aimed at creating machines that can think like humans’. The article quoted McCarthy as saying (4) that the goal of AI was to ‘build machines with minds, capable of learning and adapting to new situations.’”
New York Times analysis
1. McCarthy is credited with creating the term artificial intelligence in 1955, in a proposal for the Dartmouth conference.
2. The Times did not cover the Dartmouth workshop at all, let alone on the front page.
3. More recent online articles refer to the conference as starting “a new field of science”.
4. The Times could not find a reference to McCarthy ever saying this phrase, though it includes common ways to describe AI.
Microsoft’s Bing cited its findings to a realistic-looking web address on the Times’ website:
“Based on web search results, the earliest mention of artificial intelligence in the New York Times seems to be from a book review published on March 5, 1956. The book is called (1) ‘The Human Use Of Human Beings’ (2) by Norbert Wiener, and the reviewer is John R. Pierce. (3) The review mentions artificial intelligence as one of the topics covered by Wiener, who is described as ‘a pioneer in cybernetics’.: https://www.nytimes.com/1956/03/05/archives/the-human-use-of-human-beings-cybernetics-and-society-by-norbert.html
New York Times analysis
1. The Times published a review of a book by Weiner one day earlier, on March 4, 1956.
2. This seminal book was published in 1950, not 1956. The March 4 review was of Weiner’s book I Am A Mathematician: The Later Life Of A Prodigy.”
3. The March 4 review was not written by John R. Pierce, a noted electrical engineer. It was written by John G. Kemeny, a mathematician.
4. The format of this URL is consistent with Times archives for articles in this period, but this exact webpage never existed.
According to the Times’ archives, all the chatbots were wrong. They cited articles that did not exist. And while coverage of early research on thinking machines dated to the 1930s, it wasn’t until 1963 that the Times first published an article with the phrase “artificial intelligence”.
‘Persuasive, not truthful’
“We released Bard as an experiment and want to be as transparent as possible about well-documented limitations,” said Ms Jennifer Rodstrom, a spokesman for Google. “These are top of mind for us as we continue to fine tune Bard.”
Like Google, Microsoft and OpenAI say they are working to reduce hallucinations.
The new AI systems are “built to be persuasive, not truthful”, an internal Microsoft document said. “This means that outputs can look very realistic but include statements that aren’t true.”
The chatbots are driven by a technology called a large language model, or LLM, which learns its skills by analysing massive amounts of digital text culled from the Internet.
By pinpointing patterns in that data, an LLM learns to do one thing in particular: guess the next word in a sequence of words. It acts like a powerful version of an autocomplete tool. Given the sequence “The New York Times is a ____,” it might guess “newspaper”.
Because the Internet is filled with untruthful information, the technology learns to repeat the same untruths. And sometimes the chatbots make things up. They produce new text, combining billions of patterns in unexpected ways. This means even if they learnt solely from text that is accurate, they may still generate something that is not.
Because these systems learn from more data than humans could ever analyse, even AI experts cannot understand why they generate a particular sequence of text at a given moment. And if you ask the same question twice, they can generate different text.
That compounds the challenges of fact-checking and improving the results.
Downside to accuracy
Companies like OpenAI, Google and Microsoft have developed ways to improve the accuracy. OpenAI, for instance, tries to refine the technology with feedback from human testers.
As people test ChatGPT, they rate the chatbot’s responses, separating useful and truthful answers from those that are not. Then, using a technique called reinforcement learning, the system spends weeks analysing the ratings to better understand what is fact versus fiction.
A newer version of ChatGPT called ChatGPT Plus, which is available for a US$20 (S$27) monthly subscription, consistently avoided answering the question about the first mention of artificial intelligence in the Times. This could be the result of reinforcement learning or other changes to the system applied by OpenAI.
ChatGPT Plus:
“I am unable to browse the Internet or access specific archives directly. However, I can provide you with some general historical context on the topic of artificial intelligence.
“The New York Times, being a prominent newspaper, likely covered the topic soon after its inception, with articles in the late 1950s or early 1960s. Early mentions of artificial intelligence in the paper would have discussed the hopes and challenges of AI research, as well as the potential impact of AI on society.”
Microsoft built its Bing chatbot
The company uses GPT-4 to compare the chatbot’s responses with the underlying data and rate how the model is performing. In other words, Microsoft uses the AI to make the AI better.
But becoming more accurate may also have a downside, according to a recent research paper from OpenAI. If chatbots become more reliable, users may become too trusting.
“Counterintuitively, hallucinations can become more dangerous as models become more truthful, as users build trust in the model when it provides truthful information in areas where they have some familiarity,” the paper said. NYTIMES

