Tech companies have touted scientific findings from AI systems. But can they truly produce bona fide advancements?
On HMMT Feb 25, a rigorous reasoning benchmark, Qwen3-Max-Thinking scored 98.0, edging out Gemini 3 Pro (97.5) and ...
As autonomous AI matures, the challenge is no longer collecting data but proving systems can handle rare, high-risk scenarios.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results