AI Supremacy

AI Supremacy

Share this post

AI Supremacy
AI Supremacy
Why OpenAI's CriticGPT Might be Significant
Copy link
Facebook
Email
Notes
More
Siphon

Why OpenAI's CriticGPT Might be Significant

The AI arms race to human alignment and safe and trustworthy AI.

Michael Spencer's avatar
Michael Spencer
Jul 02, 2024
∙ Paid
24

Share this post

AI Supremacy
AI Supremacy
Why OpenAI's CriticGPT Might be Significant
Copy link
Facebook
Email
Notes
More
1
7
Share
A philospher at a table reading with features of GPT, OpenAI and scholar setting in the renaissance

Hey Everyone,

One of the major problems with LLMs is their hallucinations and tendency not be human aligned properly. On June 27th, OpenAI announced what it calls CriticGPT.

While Anthropic is six years younger than OpenAI (2021 vs. 2015 respectively) now in July, 2024, Claude 3.5 is known to be better human aligned than ChatGPT and better able to follow human instructions. Clearly OpenAI has some work to do. For a long while, OpenAI claims that reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. Both startups have looked into ways to improve this.

Will Critic Models Help Improve LLMs in Human Alignment?

  • To improve human evaluation ability and overcome that limitation OpenAI’s recent work trains “critic” models that help humans to more accurately evaluate model-written code.

  • These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks.

“LLM critics now succeed in catching bugs in real-world data, and even accessible LLM baselines like ChatGPT have significant potential to assista human annotators. From this point on the intelligence of LLMs and LLM critics will only continue to improve. Human intelligence will not.”

Image
  • On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review.

  • OpenAI further confirmed that their fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as “flawless”, even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model.

  • Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.

Read the Paper (24 slides)

You can contrast this with Anthropic’s Constitutional AI, Collective Constitutional AI and more recent research.

Anthropic itself says:

“Our research teams investigate the safety, inner workings, and societal impact of AI models — so that artificial intelligence has a positive impact on society as it becomes increasingly advanced and capable.”

With regards to RLHF and its latest developments, I try to listen to

Nathan Lambert
and
Cameron R. Wolfe, Ph.D.
.

Keep reading with a 7-day free trial

Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Michael Spencer
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More