OpenAI's o3 Scores an "A" on ARC's AGI Test

o3 Models are going to get a lot more expensive, here's why.

Dec 21, 2024

∙ Paid

Hey Everyone,

My apologies for putting this out on a Saturday in the middle of the holidays, but this just could not wait.

We have an AGI update for you today. AGI is the representation of generalized human cognitive abilities in software so that, faced with an unfamiliar task, the AGI system could find a solution. We note that the commercial definition of AGI has been watered down by OpenAI, Google and many others in recent years for their systems to sound more capable. Taking this above definition however, OpenAI (employee) claims that they have AGI internally now make a bit more sense.

Apparently OpenAI's o3 Model scores 87.5% on the ARC challenge (arcprize.org) - the key thing about this benchmark is that it is impossible to pre-learn, as every test has new conditions, models were stuck at 30-55%. Humans are particularly good at and LLMs were bad at it.

A bit stunning to read Francois Chollet’s blog about this. While OpenAI’s ship-mas event was fairly forgettable, the announcement of o3 models becomes a lot more interesting. OpenAI previewed their next frontier models, o3 and o3-mini, setting new benchmarks in technical capabilities and safety advancements.

While Google Gemini 2 and its announcements were fairly stunning in December, 2024 did not shape up to have any GPT-5 hype worth moments. OpenAI saw in 2024 a huge increase in weekly active users of ChatGPT and have bundled many great features into it. On December 5th, OpenAI made a new $200 a month tier called ChatGPT Pro.

But how expensive is its new o3 model going to be?

o3 scored 87.5% on the ARC-AGI benchmark in high-compute mode, far surpassing previous records and tripling its predecessor’s performance.
The ARC-AGI test is designed to evaluate an AI’s ability to adapt to tasks without relying on pre-trained knowledge.
EpochAI’s Frontier Math benchmark highlighted o3’s unique reasoning capabilities, solving 25.2% of problems where others remain below 2%.

Let’s just slow down for a second, and remind ourselves that researchers view this as a major milestone but emphasize that AGI remains a distant goal, as many simple ARC-AGI tasks remain unsolved.

The o3 models use what OpenAI calls "private chain of thought," where the model pauses to examine its internal dialog and plan ahead before responding, which you might call "simulated reasoning" (SR)—a form of AI that goes beyond basic large language models (LLMs).

We all get that OpenAI needs to make more revenue in 2025 to become a viable project and have a sustainable future as a startup.

Continue reading this post for free, courtesy of Michael Spencer.

Or purchase a paid subscription.

AI Supremacy

OpenAI's o3 Scores an "A" on ARC's AGI Test

o3 Models are going to get a lot more expensive, here's why.

Continue reading this post for free, courtesy of Michael Spencer.