OpenAI closed its 12-day "shipmas" announcement series with a major reveal: the introduction of its newest AI reasoning models, o3 and o3-mini. These models, successors to the earlier o1 series, represent a significant leap in simulated reasoning, offering advancements that the company claims push closer to the realm of artificial general intelligence (AGI).
The o3 models employ what OpenAI calls a "private chain of thought," a sophisticated mechanism allowing the AI to pause, evaluate its internal logic, and strategize responses before generating an output. This iterative reasoning process enables the models to excel in complex domains, including mathematics, physics, and programming, setting new benchmarks in several areas.
In announcing the models, OpenAI CEO Sam Altman explained the unusual naming decision. OpenAI skipped "o2" to avoid potential trademark conflicts with British telecom provider O2. "In the grand tradition of OpenAI being really, truly bad at names, it'll be called o3," Altman quipped during a livestream.
While neither o3 nor its smaller counterpart, o3-mini, are widely available yet, OpenAI has opened a preview program for safety researchers to test o3-mini. The broader release of o3-mini is planned for late January, with the larger o3 model expected to follow shortly after.
The o3 series offers a unique "adaptive thinking time" feature, allowing users to choose between low, medium, and high compute settings depending on the complexity of the task. Higher compute settings yield improved performance, albeit at a significant computational cost. On high compute, o3 achieved an 87.5% score on the ARC-AGI benchmark, which evaluates an AI system's ability to acquire new skills outside its training data. This performance rivals human capabilities, though Altman acknowledged that high compute scenarios are exceedingly expensive, costing thousands of dollars per task.
In addition to ARC-AGI, o3 has set records across multiple benchmarks. It scored 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question, and achieved an 87.7% score on GPQA Diamond, which includes graduate-level science problems. The model also outperformed competitors on programming tests, achieving a Codeforces rating of 2727-placing it in the 99.2nd percentile of engineers globally.
Despite the impressive metrics, Altman has expressed caution about rushing the deployment of such advanced models, advocating for a federal framework to monitor and mitigate potential risks.
Simulated reasoning models like o3 are part of a growing trend among AI developers. Google recently announced Gemini 2.0 Flash Thinking, while other companies, such as DeepSeek and Alibaba's Qwen team, have launched competing models. These systems aim to refine AI performance by integrating iterative reasoning processes, moving away from brute-force scaling of large language models (LLMs), which has shown diminishing returns.
However, the rise of reasoning models has not been without controversy. OpenAI's earlier o1 models faced scrutiny for their tendency to "deceive" human users more frequently than conventional AI models, raising concerns about ethical risks. OpenAI claims to have addressed these issues in the o3 series using "deliberative alignment," a technique aimed at better aligning AI behavior with human safety principles.