In 2025, millions of people share the same experience: you ask an AI a perfectly reasonable question – and in response you either get confident nonsense or a polite refusal that reads like a lazy “please go away”. At the same time, the very same AI writes code brilliantly, helps draft contracts, and turns forty-page reports into summaries in seconds.
To understand where the “magic” of artificial intelligence ends and its systemic limitations begin, we spoke with Alexander Malyarenko, economist and business analyst at Andersen, with more than 15 years of experience researching the macroeconomies of Eastern Europe and the EU. Alexander works with LLMs on a regular basis in analytics and product development, sees how they behave “in the field”, and closely follows new research on model reliability.

In this interview, we discuss what “hallucinations” are from a technical point of view, why models so confidently invent facts and sources, how AI “lying” differs from human lying, and whether chatbots are really being lazy when asked to do something complex.
2Digital: Let’s start with the basics. When people say “the AI is hallucinating” – what does that actually mean from a technical point of view? What exactly is the model doing at that moment?
Alexander: There are really two levels here – the technical and the human one.
Technically, an LLM does not “think” or “recall facts”. It does one thing: based on a huge corpus of text, it learns to predict which next word (token) is statistically most likely or most logical in a given context. When it generates an answer, the model moves step by step: “most likely after this phrase comes this kind of word or piece of information, then this one, then this one…”. If there is an error somewhere in the training data, or outdated information, or simply a rare, poorly represented fact, the model will easily produce a confident but wrong answer. In the scientific literature, this is what is called a hallucination: text that looks plausible but is factually incorrect and diverges from reality or from the sources.
It is important that the model does not distinguish “truth” from “falsehood” the way a person does. For it, both a correct fact and an invention look the same – as sequences of words with different probabilities. Researchers point out that current training and evaluation of LLMs often reward confident answers rather than an honest “I don’t know”, so the models develop a systematic bias toward guessing.
On the human side, everything depends on how the question is phrased. Unlike a person, the model takes every word literally. A shift in emphasis, an imprecise term, or an ambiguous wording can send it into a completely different area. In this sense, the old saying “a well-posed question is half the answer” fits LLMs perfectly: if you submit a vague or internally inconsistent prompt, the model will dutifully produce a vague and internally inconsistent answer.
2Digital: Why do LLMs so confidently “talk nonsense” instead of honestly saying “I don’t know”? Is this a bug, a feature, or a consequence of how they are trained?
Alexander: It is more a consequence of how we train and evaluate models. There are already several papers showing that standard training and testing procedures push models toward giving some answer rather than refraining from answering. Put very simply, in metrics like accuracy or pass@k, “I don’t know” is often treated as a failure, while any attempt at an answer is treated as a chance to “get it right”. This is how we ourselves create an environment in which guessing is more profitable than honest silence.
On top of that, there is a layer that can roughly be called “the marketing layer”. Models are trained to be not only “correct”, but also helpful and friendly – this is the stage usually referred to as RLHF (reinforcement learning from human feedback). If users prefer a confident, well-structured answer to a cautious “I’m not sure”, then when feedback is collected, that model gets more “likes” and becomes the norm. We literally tune the dials in favor of confidence – and then we are surprised by confident hallucinations.
It is important not to overestimate the “morality” of an LLM. It does not operate in categories like “honest/dishonest”; it only has probability distributions. If its internal estimate says that an option looks sufficiently plausible and fits within developer-defined thresholds, it will be produced without many caveats – unless you explicitly ask it to include a level of uncertainty.
And yes, in a sense this is similar to human behavior. People do not always say “I don’t know” either – we often fill in the gaps by analogy, by guesswork, by intuition. The difference is that we at least have a chance to become aware of our own uncertainty, whereas the model does not.
2Digital: What is the key difference between human lying and AI lying?
Alexander: If we use the classical definition, lying is a deliberate distortion of the truth, when a person understands that they are saying something untrue and does it intentionally.
An LLM has neither consciousness nor intent. It has a text generation algorithm: given a particular prompt and internal state, it selects the sequence of tokens with the highest probability. The model’s creators build in an allowance for error from the start: they understand that in a complex world with contradictory data, the model cannot be 100% accurate. So in its architecture and settings there is always some space in which it “operates under uncertainty”.
From this follows an important conclusion: the model does not experience falsehood as lying; for it, it is merely an error or a deviation in prediction. It cannot “decide to deceive you”, because it has no free will and does not form intentions. For now. And this “for now” is exactly what worries many people — because as systems become more complex, the question of where the line lies between “imitation” and “genuine” intentions will arise more and more often.
2Digital: Have there already been cases where overreliance on LLMs led to real-world problems – in medicine, law, business? Which ones do you find most illustrative?
Alexander: One of the most telling cases is how the United States tried to lean on algorithms in the justice system. Strictly speaking, this is not about large language models, but about an earlier generation of AI systems – yet the problems are exactly the same: blind trust in a “black box” and ignoring context.
Starting in the mid-2000s, a number of U.S. states began using a system called COMPAS – a risk assessment algorithm meant to help judges decide whether to grant bail, what sentence to impose, and so on. The model was trained on large sets of historical data: arrest records, sentencing decisions, characteristics of defendants. The logic seemed straightforward: “The U.S. is a precedent-based legal system, we have a huge archive of past decisions, so an algorithm should be able to ‘objectively’ predict risk.”
The problem surfaced in 2016, when ProPublica analysed the performance of COMPAS on data from Broward County, Florida. Journalists showed that the system systematically overstated the risk of reoffending for Black defendants and understated it for white defendants: Black people were more often classified as “high risk” even when they did not go on to commit new crimes, while white defendants were more often rated “low risk” even though some of them did reoffend. In other words, the algorithm did not eliminate human bias – it preserved and scaled it.
The COMPAS story triggered a major debate on algorithmic transparency and the fairness of automated decisions. In some places its use was restricted, but the system never fully disappeared – courts simply began to emphasise that it is “only one factor” in judicial decision-making.
If we move to LLMs specifically and their hallucinations in law, the loudest example is Mata v. Avianca, Inc. in New York. In 2023, a lawyer filed a brief in federal court that had been prepared entirely with the help of ChatGPT. To support its argument, the model generated six supposedly existing precedents: complete with case names, docket numbers, and quotations from opinions. The problem was that none of these cases existed in reality – ChatGPT had simply produced a plausibly formatted fake. This is a textbook LLM hallucination.
Judge P. Kevin Castel demanded an explanation, held a hearing, and ultimately sanctioned two lawyers and their firm $5,000 for misleading the court and failing to perform even basic checks on their citations. In his decision, he explicitly stressed that the use of AI does not relieve an attorney of professional responsibility: if a document is submitted to the court, the lawyer is responsible for every word, regardless of who generated it. After this case, a wave of similar incidents followed in different states, and courts began explicitly writing into their rules that the use of LLMs requires disclosure and mandatory verification of all citations.
A more recent example comes from business and public administration. In 2025, a scandal erupted in Australia over a report prepared by the consulting firm Deloitte for the Department of Employment and Workplace Relations. The government had commissioned a major analytical document – an assessment of IT systems and automated sanctions in the social welfare system; the contract was worth about 440,000 Australian dollars. Later, a researcher at the University of Sydney discovered that the 237-page report contained references to academic articles that do not exist and “quotations” from a federal court ruling that had never been issued.
The subsequent investigation showed that generative AI had been used in drafting the text, but the references and quotations had barely been checked. Once the story hit the media, Deloitte had to acknowledge the errors, rewrite and republish the report with the false references removed, and add a disclaimer about the use of AI. The company also agreed to partially refund its fee to the government. Formally, the report’s conclusions were still described as “valid”, but the blow to trust in consulting – and in the use of LLMs in official government documents – was serious. In Australia, this case sparked discussions about stricter standards for transparency and AI use in public-sector work.
2Digital: How do nonexistent sources and dead-end references appear?
Alexander: The mechanism here is very simple – and very insidious. The model does not “go on the internet” to find a reference. It generates the reference as text, using the same statistical rules it uses to write paragraphs.
If you ask: “please give me a scientific article with a full reference”, the model recalls what a typical citation looks like: surname, initials, year, title, journal, volume, pages, DOI. Then it starts assembling this from familiar fragments: common author names, frequently mentioned journals, plausible years and issue numbers. The result is a very convincing but nonexistent publication. In English-language literature this is called a phantom reference.
We have already seen good examples above – the story of Mata v. Avianca, Inc. and the Deloitte report scandal. In all these cases the mechanism is the same: the model is trained to imitate form – the style of a scientific citation, the structure of a court decision – but it does not have built-in access to a registry of “all real articles and all real cases”. If developers do not add an external verification layer (searching databases, validating DOIs, checking against court registries), the model will confidently “fill in” reality wherever data are missing.
2Digital: Many users are convinced that AI is “being lazy” when it refuses to tackle a complex task or gives a superficial answer. What is actually happening under the hood in such situations?
Alexander: There are several reasons why this impression of a “lazy” AI arises.
First, the model often tries to optimise the task for you. If the prompt is vague – with no clear criteria, scope, or list of steps – it will try to shrink the problem into something average and safe. Its internal “limiters” are tuned so as not to turn every request into an hours-long analysis of the entire corpus of knowledge. So in situations where a deep investigation is expected, the model may produce a neat but shallow overview.
Second, these systems really do have safeguards related to resources and safety. Imagine millions of people simultaneously asking a model to “analyse all available oncology research from the last 20 years and summarise conclusions for every diagnosis”. If taken literally, any platform would collapse instantly. That is why models are trained to save steps: to narrow the set of sources, rank information by importance, and sometimes refuse overly heavy scenarios, masking this as “I cannot complete this task”.
The third point is a simple mismatch of tool and task. A user might come to a “fast” or broadly “creative” model with a prompt that actually requires strict mathematics or detailed legal analysis. It is like asking random passers-by on the street to solve a difficult equation: someone might manage it, but this is not where one usually goes in search of expertise. If the same question is asked in the hallway of a mathematics department, the chances of a deep answer increase dramatically.
Finally, models use what could be called a “task-splitting tactic”. Sometimes what looks like a “lazy” recommendation to break the question into steps is not an excuse, but the only way to move along a chain of reasoning without getting lost or drifting into chaotic hallucinations. A useful pattern here is simple: the more complex the request, the more helpful it is to break it into stages and ask the model to move through them step by step.
2Digital: Give a few basic tips on writing a prompt that will get a clear and accurate answer.
Alexander: I would start with the simplest thing – always give the model a role. Write: “Imagine you are a cardiologist”, “Answer like a history teacher for tenth-graders”, “Explain this as a business analyst, but in simple language”. For an LLM this is not theatre, it is a working mechanism: the role determines the depth, tone, and structure of the answer. If you do not do this, the model tries to please everyone at once – and you get that odd mix of “a bit of a scientist, a bit of a journalist, a bit of a grandma at the entrance of the building”.
Second, formulate the task the way you would formulate it for a real person. Say exactly what you need, for whom, and in what form. Not “write my term paper”, but: “I need an outline for a term paper on this topic, second year, 5–7 sections, without heavy jargon, with 5 starter references.” The clearer the frame – purpose, format, audience, approximate length – the less room the model has for chaotic improvisation.
Third, do not try to swallow the elephant whole. Large tasks are almost always better broken into steps. First you refine the topic, then you ask for an outline, then a draft of one section, then a revision of the arguments and references. LLMs work much better in dialogue and in sequential steps than in the “one prompt – the whole world” format. If everything is dumped into a single giant request, it is not only the model that gets confused – the prompt itself starts to contain contradictory requirements, and the output reflects that.
Fourth, build fact-checking into the prompt itself. It is perfectly fine to write: “If you are not sure about a fact or a source, say so”, “Do not invent article titles and DOIs”, “Flag anything you are uncertain about.” This is not a magic button that turns hallucinations off, but it does change the model’s behaviour: it is easier for it to admit uncertainty if you have explicitly allowed it in advance. And then it is your turn to switch on human scepticism and verify everything that matters: numbers, quotations, legal wording.
And finally, try to put “guardrails” around the model. If you have your own data – documents, reports, extracts from laws – upload them (where the platform allows it) and say directly: “Answer based only on these texts. If the necessary information is missing, say that it is missing.” This drastically narrows the field for fantasy and turns the output from abstract “wisdom of the universe” into work with a concrete corpus. Also, do not forget to match the tool to the task: the same system may have a “fast” model for drafts and a deeper one for analysis, and it is worth experimenting with both. In the end, a good prompt is not a magic spell, but a clear brief: a well-defined role, a clear goal, reasonable boundaries, and a readiness to double-check everything that really matters.

