OpenAI had first introduced its o3 reasoning model in December, promoting it as having strong mathematical reasoning capabilities, especially when evaluated on benchmark datasets such as FrontierMath. However, reality now paints a different picture, and discrepancies between OpenAI’s internal testing and recent third-party results have sparked scrutiny over the transparency and consistency of the company’s performance claims.

At the time of its introduction, o3 was stated to come with the ability to solve more than 25% of questions on FrontierMath, a dataset designed to test complex mathematical reasoning. This performance was significantly higher than other models at the time, which reportedly managed only around 2%. Mark Chen, Chief Research Officer at OpenAI, publicly stated during the model’s launch that, “Today, all offerings out there have less than 2% [on FrontierMath]. We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%.”

However, independent evaluations have since cast doubt on these figures, painting a different picture altogether. The AI research organization Epoch AI, which maintains the FrontierMath dataset, conducted its own testing on the o3 model and released the results on April 18, 2025. Their findings showed that o3 scored around 10%, significantly lower than the high-performance benchmark OpenAI cited.

According to Epoch AI, various factors could account for the difference in scores. These include the possibility that OpenAI used more powerful internal hardware during testing or employed longer evaluation times. There is also the issue of dataset versions; OpenAI might have tested its model on an earlier subset of FrontierMath that differed from the version used in the independent evaluation. In their report, Epoch explained, “The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold,” adding that it could be because they were run on a different subset of FrontierMath. as well.

Further clarification came from the ARC Prize Foundation, an organization involved in testing earlier versions of o3. In a post on X (formerly Twitter), the foundation confirmed that the public release of o3 is not identical to the model used in earlier testing phases. “(O3 public) is a different model adapted for chat/product use,” the ARC Prize stated, noting that the currently available compute tiers are “smaller than the version we [previously tested].” Wenda Zhou, an OpenAI technical staff member, spoke about the AI model during a recent livestream, explaining that the production version of o3 is optimized for speed and practicality, which may result in lower benchmark scores compared to earlier test configurations. “We’ve done [optimizations] to make the model more cost efficient and more useful in general,” he said.

This comes at a time when similar concerns have emerged with Elon Musk’s Grok 3 and Meta’s AI models, where performance charts were later found to represent models different from those actually released to the public. The issue is further compounded as hallucination rates—instances where AI models provide incorrect or fabricated answers—are also under scrutiny. OpenAI’s internal evaluations show that o3 has a 33% hallucination rate, higher than its predecessor o1, which has a rate of 16%. While the o-series models are designed to use more computational reasoning to arrive at better answers, the trade-off appears to include increased instances of misinformation, raising further questions about how “reasoning” is defined and implemented in these systems.

Content originally published on The Tech Media – Global technology news, latest gadget news and breaking tech news.

Tags:

©2025 The Tech Media - Tech for Everyone powered by Digital Greedy

or

Log in with your credentials

or    

Forgot your details?

or

Create Account