What is OpenAI's Operator and Blueprint? History and Tips of Prompt Engineering from 2020 to 2025 ✨

Can OpenAI's Operator compete with Anthropic's Computer Use and Google Jarvis? And insider history and tips of Prompt Engineering. Why OpenAI's Blueprint has me worried.

Nov 14, 2024

∙ Paid

As we mark the two year anniversary of ChatGPT’s launch, I wanted to share again the deep dive on Prompting and its recent history by Mike Taylor. He wrote the book on AI Prompting titled: “Prompt Engineering for Generative AI: Future-Proof Inputs for Reliable AI Outputs”.

Mike’s work has appeared multiple times in places like Lenny’s Newsletter, on Ben’s Bites, and in other places like on Every.to. In this article later for premium readers, I’m also going to cover OpenAI’s new PR on their Operator tool and analyze their the AI infrastructure blueprint they presented in Washington. Skip lower down in the article if that’s of greater interest to you.

In a nutshell, OpenAI is set to launch a new AI agent called Operator in January 2025. This tool is designed to autonomously perform various tasks on behalf of users, such as booking travel, writing code, and conducting research.

🌟📋🌱 (November, 2024)

These foundational tips work well with everything from ChatGPT to Claude and open-source models. They're useful for anyone exploring AI technologies, regardless of experience level. By mastering these basics, you'll be well-prepared to experiment with more advanced techniques as you become more comfortable with AI interactions, and test and learn what works for you.

Historical Context: Prompt Engineering from 2020 to 2025

Since the introduction of large language models (LLMs), the skill of prompt engineering has become in-demand, as people and organizations look for ways to get better results from generative AI models. In this blog post, we'll take a journey through the key developments that have shaped the field, from the early days of GPT-3 to the cutting-edge techniques of today, with a look ahead to what’s next.

In order to understand the future of prompt engineering, it’s important to see what came before. This timeline isn’t exhaustive, but these have been the most personally impactful events on my own work as a prompt engineer. If you want to dive deeper, I covered these techniques and more in my book “Prompt Engineering for Generative AI” (O’Reilly, 2024).

🐝 May 2020: GPT-3 and Few-Shot Prompting

The release of GPT-3 in May 2020 marked a watershed moment in natural language processing. The paper "Language Models are Few-Shot Learners" by Brown et al. popularized the concept of few-shot prompting, demonstrating that LLMs could perform tasks with minimal examples. This breakthrough challenged the traditional paradigm of machine learning, which typically required large amounts of labeled data for training.

Few-shot prompting opened up new possibilities for AI applications, allowing models to adapt to new tasks on the fly. It significantly reduced the time and resources needed for task-specific fine-tuning, making LLMs more versatile and accessible for a wide range of applications. This development laid the foundation for many of the prompt engineering techniques that would follow.

❄️ January 2022: Chain-of-Thought Prompting

Wei et al. introduced chain-of-thought prompting in their paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." This technique improved performance on complex reasoning tasks by asking the model to think through its reasoning process before providing an answer. It marked a significant step towards more transparent and interpretable AI decision-making.

Chain-of-thought prompting not only enhanced the accuracy of LLMs on complex tasks but also provided insights into the model's reasoning process. This transparency made it easier for humans to verify and trust the model's outputs. Moreover, it opened up new avenues for debugging and improving model performance by identifying where in the chain of reasoning errors or biases might occur.

🍀 March 2022: Self-Consistency in Chain of Thought

Building on chain-of-thought prompting, Wang et al. proposed the concept of self-consistency in their paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models." This approach involved generating multiple results and selecting the majority answer or the highest-ranking option, further improving the reliability of LLM outputs.

Self-consistency addressed one of the key challenges in LLM applications: the variability of outputs. By generating multiple chains of thought and comparing their conclusions, the technique could identify more robust and consistent answers. This method not only improved accuracy but also provided a measure of confidence in the model's outputs, making it particularly valuable for critical applications where reliability is paramount.

🌼 May 2022: Least-to-Most Prompting

Zhou et al. introduced the least-to-most prompting technique in their paper "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." This method involves breaking down complex tasks into subtasks, which became a common practice in prompt engineering and led to the development of prompt chaining.

Least-to-most prompting addressed the challenge of tackling complex problems that were beyond the scope of a single prompt. By decomposing tasks into smaller, manageable steps, it allowed LLMs to approach problems more systematically. This technique not only improved performance on complex tasks but also made the problem-solving process more transparent and easier to debug.

🎃 October 2022: ReAct - Reasoning and Acting

The ReAct framework, introduced in "ReAct: Synergizing Reasoning and Acting in Language Models," taught LLMs to use tools to compensate for their weaknesses. This development paved the way for function calling, allowing LLMs to interact with APIs and expand their capabilities beyond text generation.

ReAct represented a significant step towards more capable and versatile AI assistants. By enabling LLMs to reason about when and how to use external tools, it addressed limitations in areas such as up-to-date information retrieval, complex calculations, and interaction with external systems. This framework laid the groundwork for more advanced AI systems that could seamlessly integrate with a wide range of applications and services.

🍂 November 2022: The ChatGPT Revolution

The release of ChatGPT brought conversational prompting into the mainstream. It demonstrated the potential of LLMs to engage in human-like dialogue, understand context over multiple turns of conversation, and adapt to various user needs and communication styles. This was when prompt engineering became a viable career path, and I started earning enough from training and freelancing to go full time as one.

With ChatGPT, system prompts gained importance as a way to set the context and behavior of the AI assistant. This development highlighted the significance of carefully crafted instructions in guiding the model's responses. It also sparked widespread public interest in AI capabilities, leading to increased focus on ethical considerations and potential applications of conversational AI across various industries.

🌷 March 2023: GPT-4 and Improved Reasoning

With the release of GPT-4, LLMs became significantly better at reasoning and following instructions. This advancement made prompt engineering easier and led to more reliable response formats, expanding the potential applications of AI in complex problem-solving scenarios. A lot of the hacky tricks we had to use to get GPT-3 to behave no longer were necessary, and we became less reliant on finding that one magic word to put in a prompt.

GPT-4's improved capabilities allowed for more nuanced and context-aware interactions. It demonstrated enhanced ability to understand and execute multi-step instructions, leading to more sophisticated applications in areas such as content creation, analysis, and decision support. This development also highlighted the importance of clear and well-structured prompts to fully leverage the model's capabilities.

🦋 May 2023: Tree of Thoughts

The "Tree of Thoughts" approach introduced a method for simulating multiple experts thinking through steps in reasoning and correcting each other until an answer is found. This technique represented a more advanced form of collaborative reasoning within a single LLM. This isn’t in common use yet, but the idea of multiple agents working together to solve a task is likely to see a lot of play over the next few years.

By exploring multiple reasoning paths simultaneously and allowing for self-correction, the Tree of Thoughts method improves performance on particularly challenging and open-ended tasks. It demonstrated the potential for LLMs to engage in more human-like problem-solving processes, considering alternative viewpoints and revising initial assumptions when necessary.

🌞 June 2023: LLM-as-a-Judge

Zheng et al. demonstrated in their work "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" that using an LLM to judge the performance of another LLM can yield results that reasonably agree with human ratings. This finding opened up new possibilities for automated evaluation and improvement of AI models. This has been one of the biggest impacts on the speed at which I can get results, because now I no longer have to wait for a human expert to judge every experiment I run.

The LLM-as-a-Judge approach provided a scalable method for assessing the quality of AI-generated content and responses. While not a perfect substitute for human evaluation, it offered a valuable tool for rapid iteration and improvement in AI development. This technique also raised interesting questions about AI self-assessment and the potential for creating more self-aware and self-improving AI systems.

🍉 July 2023: Emotional Stimuli in Prompting

Research showed that incorporating emotional stimuli or even making threats in prompts could improve LLM performance, particularly in avoiding "lazy" responses. This finding highlighted the complex relationship between language, emotion, and AI behavior.

This was a real problem in the winter of 2023 and emotion prompting was the number one effective method for dealing with it. With subsequent releases OpenAI has largely fixed the problem, and this technique is less needed now with GPT-4o or the other state-of-the-art models.

🌻 August 2023: Role-Play Prompting

Asking an LLM to adopt a specific persona was a common technique as far back as I can remember, but nobody was sure if it was actually helping or just superstition. The effectiveness of role-play prompting was scientifically validated with the “Better Zero-Shot Reasoning with Role-Play Prompting” paper, proving that asking LLMs to assume specific roles can indeed improve performance in certain areas.

Role-play prompting demonstrated particular effectiveness in tasks requiring specialized knowledge or unique perspectives. By instructing the model to "act as" a specific type of expert or character, users could elicit more focused and relevant responses. This technique also showed potential in creative writing and storytelling applications, allowing for more diverse and character-driven narratives.

🍁 September 2023: Automated Prompt Engineering

Kattab et al. introduced DSPy, a framework for automating prompt engineering by using LLMs to rewrite prompt instructions and add relevant examples. This development marked a significant step towards reducing the manual effort required in crafting effective prompts. It built on the work of previous papers like “Large Language Models Are Human-Level Prompt Engineers” (Zhou et al, 2022), with the idea that we can use AI to automate our prompt engineering work.

Automated prompt engineering promised to make AI more accessible to non-experts by handling the complexities of prompt crafting behind the scenes. It also opened up possibilities for dynamic prompt optimization, where prompts could be automatically refined based on task performance and evaluation scores. This approach has the potential to significantly accelerate the development and deployment of AI applications across various domains, and is my number one focus area at present.

🌰 November 2023: Mega Prompts and Multimodal Prompting

The release of GPT-4V with a 128K token context window (about 70,000 words) led to the rise of "Mega prompts" – detailed instructions and relevant context spanning two or more pages. This expanded context window allowed for more comprehensive and nuanced interactions with the AI model. Former Head of Google Brain Andrew Ng revealed that many of the AI applications he sees use mega prompts with pages full of instructions, and this is something I’ve observed myself.

Additionally, the ability to process images opened up new possibilities in multimodal prompting. This development enabled AI to understand and respond to prompts that combined text and visual information, significantly expanding the range of tasks that could be addressed. From image analysis to visual question answering, multimodal prompting represented a major leap forward in AI capabilities.

💐 May 2024: Feature Isolation and Steering

Anthropic's research on "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" demonstrated the ability to steer an LLM based on isolated features that can be adjusted, offering new levels of control. The ‘golden gate claude’ model they released (a model that couldn’t help but talk about the Golden Gate Bridge) was funny, but hinted at stronger capabilities.

Instead of dialing up the “Golden Gate Bridge” neurons, find the neurons responsible for being good at coding or strong at reasoning, and dial them up instead. Alternatively dial down the neurons responsible for undesirable responses, like racist remarks, or bugs in code. This development is not publicly available yet, but it promises to enhance the reliability and specificity of AI systems across various applications.

🍹 June 2024: The Rise of Claude 3.5 Sonnet

The release of Claude 3.5 Sonnet marked a significant milestone as the first alternative model widely accepted as superior to GPT-4. This development has led to increased interest in cross-model frameworks for prompt engineering. I use this model for all of my coding and creative writing tasks, and have started shifting client workloads as well.

Claude 3.5 Sonnet's success highlighted the importance of diversity in the AI ecosystem. It spurred efforts to develop prompt engineering techniques that could work effectively across different models, promoting interoperability and reducing dependency on any single AI provider. This trend towards model-agnostic prompt engineering promises to make AI applications more robust and adaptable, and competition among vendors can only benefit consumers of AI models with lower costs and faster feature development. .

😎 July 2024: Open-Source Breakthrough with Llama 3.1

The introduction of Llama 3.1, an open-source model performing at GPT-4 level, has sparked renewed interest in moving prompts away from proprietary models. This shift is driven by a desire for independence and cost savings in AI applications: if you can avoid paying fees to OpenAI and sending them your data, you will. Even if OpenAI or Anthropic pulls further ahead with GPT-5 or Claude 4, the current Llama model is good enough to do 80% of the tasks I have used AI for, and it will only get better as the open source community fine-tunes it and adds additional capabilities.

Llama 3.1's release democratized access to high-performance AI models, allowing smaller organizations and individual developers to build sophisticated AI applications without relying on costly API services. This development accelerated innovation in prompt engineering, as a wider community of researchers and developers began experimenting with and refining prompting techniques on powerful, freely available models. While the largest LLMs need less instruction to do a good job, a shift towards smaller, on-device, open-source models will necessitate stronger prompt engineering skills.

🔮 What will the trends be in 2025?

As we look back on the evolution of prompt engineering, it's clear that the field has undergone rapid and significant changes. From the early days of few-shot learning to the current era of multimodal, emotionally-aware, and highly controllable LLMs, prompt engineers have continuously adapted their techniques to harness the growing capabilities of AI models.

The latest developments all point towards a future where prompt engineering becomes more automated, model-agnostic, and finely tuned over the next year. With the rise of open-source alternatives and the increasing sophistication of LLMs, we can expect prompt engineering to play an even more crucial role in unlocking the full potential of AI across various domains. Rather than prompting back and forth with ChatGPT, expect to be writing instructions and personas for multiple AI agents, who collaborate on your task.

As we move forward, the challenge for prompt engineers will be to stay ahead of these rapid developments, continually refining their skills and exploring new frontiers in AI interaction. Technical abilities will continue to be essential, as these AI systems get more complex with the introduction of agents. With LLMs starting to grade their own homework, the focus of prompt engineering will shift towards monitoring performance, and finding and fixing edge cases that the AI can’t handle yet. End of Mike Taylor’s guest post.

In the news cycle related to OpenAI, we have two major developments.

What is OpenAI Operator?

The rest of the article is for premium readers and analyzes what this tool could mean and whether the trajectories of diminishing returns in frontier models and on whether powerful or userful AI agents of 2025, are even compatible. We also explore OpenAI’s blueprint for U.S. AI infrastructure at some length.

Keep reading with a 7-day free trial

Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.

AI Supremacy