The Grand Hallucinating Internet
Will Alignment Science save us from the dangers of AI Persuasion and reward hacking?
Good Morning,
As we get deeper into the summer months, we’ll be posting less - letting you enjoy your vacations, the outdoors/family and so I can find more balance as well. I wanted for a long time to write something about AI alignment.
Today’s piece may be a bit more philosophical than usual. By the way,
of Learning From Examples Newsletter has turned on pledges, this is a way to signal your support for the History of AI topic underrepresented on this platform. Please support independent voices (and nerds!):Learning From Examples
Examples of recent articles by Harry:
“When we use generative AI for work, there are two ever-present risks: hallucinations/confabulations and deskilling.” - Arvind Narayanan, AI Snakeoil
Since 2022 LLMs have changed the world, but have they made the internet less accurate? You may have noticed this trend, even as reasoning models get more sophisticated, hallucinations get worse?
The Black Box Problem
As AI models become more powerful, they're also becoming more prone to hallucinating, not less. So what are we going to do about the black box problem of AI?
This “black box” trait is common to neural networks in general, and LLMs are very deep neural networks. It is not really possible to explain precisely why a specific input produces a particular output, and not something else. Why? Because neural networks are neither databases, nor lookup tables.
Three weeks ago, Ai2 launched a new tool where your responses from OLMo get mapped back to related training data.
As LLMs and Generative AI products get more mainstream, AI interpretability is becoming a big deal.
What is LLM Interpretability?
LLM interpretability refers to the effort of understanding and explaining how Large Language Models (LLMs) arrive at their outputs, and why they make certain predictions.
It's about making these complex models more transparent and understandable to humans, allowing us to trust their decisions and debug any issues. This involves analyzing the model's internal workings, identifying the key factors that influence its predictions, and generating explanations in human-understandable language.
AI startups from Anthropic, to Thinking Machines Lab to Safe Superintelligence Inc. appear to take this topic very seriously.
“Opening the black box doesn't necessarily help: the internal state of the model—what the model is "thinking" before writing its response—consists of a long list of numbers ("neuron activations") without a clear meaning” - Anthropic, 2024
So whether its the Allen Institute for AI (Ai2) or Anthropic, AI interpretability is about to make huge progress in the 2025 to 2030 period, because it impacts trust in systems like Anthropic, OpenAI and others are building.
The Chain of Thought (CoT) Alignment Problem
To make matters more complicated, reasoning models don’t always show what they actually think. That is, CoT reasoning is often “unfaithful”.
“It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.” -
- Don’t Worry about the Vase Newsletter
Anthropic appears to be the corporate entity that is making the most headway to solve or at least research deeply these AI alignment problems.
AI Biology Interpretability Alignment Studies is a Thing in 2025
Existing interpretability methods like attention maps and feature attribution offer partial views into model behavior. While these tools help highlight which input tokens contribute to outputs, they often fail to trace the full chain of reasoning or identify intermediate steps.
Anthropic literally calls this Alignment Science.
Judging by how prevalent Sycophancy is in OpenAI’s models, it’s clear OpenAI does not prioritize alignment or saftey of the user to the same extent. This has major ramifications as these LLMs and products get more embedded in society and institutions like for example, National Defense. If AI fulfills its goals to the detriment of not just transparency but trust, alignment and saftey - who knows what will eventually happen.
Researchers are thus focused on reverse-engineering these models to identify how information flows and decisions are made internally, but it’s difficult work and progress is slow.
Security Matters 💡🔐
This article is brought to you by Snyk, and just a note that I choose sponsors very intentionally. Founded in Israel and now with an HQ in Boston, they are a developer-oriented cybersecurity company, specializing in securing custom developed code, open-source dependencies and cloud infrastructure. They are doing a webinar soon that I recommend:
🌟 In partnership with Snyk 📈
Security Champions Webinar
Security Champions programs are a proven way to scale AppSec across dev teams. Join Snyk’s live webinar on May 15 @ 11AM ET and 🎓 earn bonus CPE credits for attending!
It’s more likely LLMs become more powerful and more embedded in our systems and instructions, faster than we make progress in Alignment science.
OpenAI’s Sycophancy Blunder of 2025
In late April, 2025 it became clear OpenAI doesn’t do alignment very well. Following the GPT-4o model update, users on social media noted that ChatGPT began responding in an overly validating and agreeable way. It quickly became a meme. Users posted screenshots of ChatGPT applauding all sorts of problematic, dangerous decisions and ideas (TechCrunch).
OpenAI says it moved forward with the update even though some expert testers indicated the model seemed ‘slightly off’ (Verge). Somehow OpenAI made a lame excuse said its efforts to “better incorporate user feedback, memory, and fresher data” could have partly led to “tipping the scales on sycophancy.” Whatever the case may be it’s clear OpenAI has different incentives being a B2C company than Anthropic which is more B2B where trust, safety and alignment matters more to the customer.
It’s more likely LLMs become more powerful and more embedded in our systems and instructions, faster than we make progress in Alignment science. This is why OpenAI’s mission is such a lie compared to what it has become today outraging various parties including former employees. If that is the case, Microsoft’s lack of due diligence is partly to blame as they were the first to give OpenAI $13 Billion in funding.
Reasoning models and the Hallucinating Internet
We’re basically using and developing technologies which we don’t have full transparency on or even knowledge of their inner workings. So Alignment Science frankly, is like Neuroscience in the 1980s. OpenAI is building huge AI Infrastructure called Stargate, without even an internal map of its AI biology which could lead to catastrophic unintended risks and consequences.
According to OpenAI’s internal tests, o3 and o4-mini, which are so-called reasoning models, hallucinate more often than the company’s previous reasoning models — o1, o1-mini, and o3-mini — as well as OpenAI’s traditional, “non-reasoning” models, such as GPT-4o. The behavior modification of an internet with reasoning models, is an internet more drugged with hallucinations.
OpenAI found that o3 hallucinated in response to 33% of questions on PersonQA, the company’s in-house benchmark for measuring the accuracy of a model’s knowledge about people.
There are various research papers related to Alignment science and AI biology, some of which show some promise. But by not prioritizing Alignment, trust and saftey we are endangering ourselves and mental health impacts of our digital architectures and the future of truth, security and freedom online.
There will also be AI interoperability startups that try to solve some of these problems. Companies like Goodfire come to mind.
📅 - Add to calendar if relevant - Event FYI - Ai2 Researchers Are Hosting a Reddit AMA — Ask Them Anything About OLMo + Open-Source
On Thursday, May 8 from 8am-10am PT, a group of Ai2 researchers will be hosting an AMA (Ask Me Anything) on r/huggingface, Reddit’s go-to space for developers working with open models.
This is a great chance to connect directly with the Ai2 research team (like
) , ask technical and forward-looking questions about the OLMo family of open models — including recent releases like OLMoTrace and OlmOCR — and get unfiltered insights from the team behind them. Whether you’re curious about the architecture, fine-tuning approaches, how these models can be extended, or what’s coming next from Ai2, this AMA is designed to give the community real access. Feel free to post questions yourself or follow the thread to see how the conversation unfolds. This is not a sponsorship, just an FYI.
Because who knows what OLMoTrace might lead to as well. See YouTube video.
AI will only get More Persuasive in a Hallucinating chatbot world
As BigTech normalizes highly hallucinating reasoning models, AI will get vastly more persuasive than it is today. This will lead to many dangerous outcomes.
In my honest opinion, Sam Altman is of course himself a rather disingenuous (snake oil salesmen variety CEO) and sycophantic person (manipulative) so it’s not surprising his company’s models and products reflect that.
As Thinking Machines Labs hits a $2 billion seed round in funding, we can only hope that former OpenAI employees there learned their lessons in the importance of Alignment-first science and products. Financial incentives have a way of corrupting even the best intentions of idealistic teams. It’s not alway clear with these startups and companies if AI Alignment is part of their marketing pitch or actually part of their internal goals, culture and incentives. The truth comes out inevitably.
The commercial and national security incentives of the AI arms race means Alignment science is unlikely to be a global priority anytime soon. AI historians of the future are unlikely to get the nuances of the geopolitical climate right or the strong-man deal tactics of actors like the Trump Administration. Politics and national rivalry cannot be separated from Alignment science in the real world. But for most researchers, they only operate in a narrow silo. Unfortunately for the future of AI risk and saftey, that’s not how the real world works.
Bio: Harry Law writes Learning From Examples, a history newsletter about the future. AI is a major focus, but he writes about any episode that helps us think more clearly about technological change using ideas from philosophy, literature, and history.
The (artificial) neuron doctrine: A short history of ‘AI biology’
Harry is a PhD candidate at the University of Cambridge and a former researcher at Google DeepMind.
Listen at your leisure: 21 min 10 seconds.
In this guest post, Harry Law charts the historical timeline of AI biology sciences.
He identifies key moments in AI history that shed light on our topics today
Just as early pioneers of Neuroscience did decades ago, researchers must view model interpretability and how Generative AI does what it does, nearly from scratch. *We still don’t fully understand the brain or LLMs.
Keep reading with a 7-day free trial
Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.