Your πAI

ARC Prize has released a preview of ARC-AGI-3, a new interactive reasoning benchmark to test AI agents’ ability to generalize in unseen environments — with early results showing frontier AI still fails to match or even beat humans.

The benchmark features three original games built to evaluate world-model building and long-horizon planning with minimal feedback. Agents receive no instructions and must learn purely through trial and error, mimicking how humans adapt to new challenges.

Early results show frontier models like OpenAI’s o3 and Grok 4 struggle to complete even basic levels of the games, which are pretty easy for humans. ARC Prize is also launching a public contest, inviting the community to build agents that can beat the most levels — and truly test the state of AGI reasoning.

Based on ARC Prize's announcement on X, this new novelty-focused interactive benchmark pushes research beyond specialized testing toward true general intelligence, where systems adapt to novel environments with accuracy — just like humans.

We Value Your Feedback

ARC Prize Releases ARC-AGI-3 to Test Generalization in AI Agents