ARTICLE AD

1:21 PM PDT · April 2, 2025
When OpenAI unveiled its o3 “reasoning” AI model in December, the company partnered with the creators of ARC-AGI, a benchmark designed to test highly capable AI, to showcase o3’s capabilities. Months later, the results have been revised, and they now look slightly less impressive than they did initially.
Last week, the Arc Prize Foundation, which maintains and administers ARC-AGI, updated its approximate computing costs for o3. The organization originally estimated that the best-performing configuration of o3 it tested, o3 high, cost around $3,000 to solve a single ARC-AGI problem. Now, the Arc Prize Foundation thinks that the cost is much higher — possibly around $30,000 per task.
The revision is notable because it illustrates just how expensive today’s most sophisticated AI models may end up being for certain tasks, at least early on. OpenAI has yet to price o3 — or release it, even. But the Arc Prize Foundation believes OpenAI’s o1-pro model pricing is a reasonable proxy.
For context, o1-pro is OpenAI’s most expensive model to date.
“We believe o1-pro is a closer comparison of true o3 cost […] due to amount of test-time compute used,” Mike Knoop, one of the co-founders of The Arc Prize Foundation, told TechCrunch. “But this is still a proxy, and we’ve kept o3 labeled as preview on our leaderboard to reflect the uncertainty until official pricing is announced.”
A high price for o3 high wouldn’t be out of the question, given the amount of computing resources the model reportedly uses. According to the Arc Prize Foundation, o3 high used 172x more computing than o3 low, the lowest-computing configuration of o3, to tackle ARC-AGI.
Moreover, rumors have been flying for quite some time about pricey plans OpenAI is considering introducing for enterprise customers. In early March, The Information reported that the company may be planning to charge up to $20,000 per month for specialized AI “agents,” like a software developer agent.
Some might argue that even OpenAI’s priciest models will cost well under what a typical human contractor or staffer would command. But as AI researcher Toby Ord pointed out in a post on X, the models may not be as efficient. For example, o3 high needed 1,024 attempts at each task in ARC-AGI to achieve its best score.
Kyle Wiggers is TechCrunch’s AI Editor. His writing has appeared in VentureBeat and Digital Trends, as well as a range of gadget blogs including Android Police, Android Authority, Droid-Life, and XDA-Developers. He lives in Manhattan with his partner, a music therapist.