Why OpenAI's CriticGPT Might be Significant
The AI arms race to human alignment and safe and trustworthy AI.
Hey Everyone,
One of the major problems with LLMs is their hallucinations and tendency not be human aligned properly. On June 27th, OpenAI announced what it calls CriticGPT.
While Anthropic is six years younger than OpenAI (2021 vs. 2015 respectively) now in July, 2024, Claude 3.5 is known to be better human aligned than ChatGPT and better able to follow human instructions. Clearly OpenAI has some work to do. For a long while, OpenAI claims that reinforcement learning from human feedback (RLHF) is fundamentally limited by the capacity of humans to correctly evaluate model output. Both startups have looked into ways to improve this.
Will Critic Models Help Improve LLMs in Human Alignment?
To improve human evaluation ability and overcome that limitation OpenAI’s recent work trains “critic” models that help humans to more accurately evaluate model-written code.
These critics are themselves LLMs trained with RLHF to write natural language feedback highlighting problems in code from real-world assistant tasks.
“LLM critics now succeed in catching bugs in real-world data, and even accessible LLM baselines like ChatGPT have significant potential to assista human annotators. From this point on the intelligence of LLMs and LLM critics will only continue to improve. Human intelligence will not.”
On code containing naturally occurring LLM errors model-written critiques are preferred over human critiques in 63% of cases, and human evaluation finds that models catch more bugs than human contractors paid for code review.
OpenAI further confirmed that their fine-tuned LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as “flawless”, even though the majority of those tasks are non-code tasks and thus out-of-distribution for the critic model.
Critics can have limitations of their own, including hallucinated bugs that could mislead humans into making mistakes they might have otherwise avoided, but human-machine teams of critics and contractors catch similar numbers of bugs to LLM critics while hallucinating less than LLMs alone.
You can contrast this with Anthropic’s Constitutional AI, Collective Constitutional AI and more recent research.
Anthropic itself says:
“Our research teams investigate the safety, inner workings, and societal impact of AI models — so that artificial intelligence has a positive impact on society as it becomes increasingly advanced and capable.”
With regards to RLHF and its latest developments, I try to listen to
and .Keep reading with a 7-day free trial
Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.