Modern Australian
Men's Weekly

.

AI is failing ‘Humanity’s Last Exam’. So what does that mean for machine intelligence?

  • Written by Kai Riemer, Professor of Information Technology and Organisation, University of Sydney
AI is failing ‘Humanity’s Last Exam’. So what does that mean for machine intelligence?

How do you translate ancient Palmyrene script from a Roman tombstone? How many paired tendons are supported by a specific sesamoid bone in a hummingbird? Can you identify closed syllables in Biblical Hebrew based on the latest scholarship on Tiberian pronunciation traditions?

These are some of the questions in “Humanity’s Last Exam”, a new benchmark introduced in a study published this week in Nature. The collection of 2,500 questions is specifically designed to probe the outer limits of what today’s artificial intelligence (AI) systems cannot do.

The benchmark represents a global collaboration of nearly 1,000 international experts across a range of academic fields. These academics and researchers contributed questions at the frontier of human knowledge. The problems required graduate-level expertise in mathematics, physics, chemistry, biology, computer science and the humanities. Importantly, every question was tested against leading AI models before inclusion. If an AI could not answer it correctly at the time the test was designed, the question was rejected.

This process explains why the initial results looked so different from other benchmarks. While AI chatbots score above 90% on popular tests, when Humanity’s Last Exam was first released in early 2025, leading models struggled badly. GPT-4o managed just 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI’s most powerful model, o1, achieved only 8%.

The low scores were the point. The benchmark was constructed to measure what remained beyond AI’s grasp. And while some commentators have suggested that benchmarks like Humanity’s Last Exam chart a path toward artificial general intelligence, or even superintelligence – that is, AI systems capable of performing any task at human or superhuman levels – we believe this is wrong for three reasons.

Benchmarks measure task performance, not intelligence

When a student scores well on the bar exam, we can reasonably predict they’ll make a competent lawyer. That’s because the test was designed to assess whether humans have acquired the knowledge and reasoning skills needed for legal practice – and for humans, that works. The understanding required to pass genuinely transfers to the job.

But AI systems are not humans preparing for careers.

When a large language model scores well on the bar exam, it tells us the model can produce correct-looking answers to legal questions. It doesn’t tell us the model understands law, can counsel a nervous client, or exercise professional judgment in ambiguous situations.

The test measures something real for humans; for AI it measures only performance on the test itself.

Using human ability tests to benchmark AI is common practice, but it’s fundamentally misleading. Assuming a high test score means the machine has become more human-like is a category error, much like concluding that a calculator “understands” mathematics because it can solve equations faster than any person.

Human and machine intelligence are fundamentally different

Humans learn continuously from experience. We have intentions, needs and goals. We live lives, inhabit bodies and experience the world directly. Our intelligence evolved to serve our survival as organisms and our success as social creatures.

But AI systems are very different.

Large language models derive their capabilities from patterns in text during training. But they don’t really learn.

For humans, intelligence comes first and language serves as a tool for communication – intelligence is prelinguistic. But for large language models, language is the intelligence – there’s nothing underneath.

Even the creators of Humanity’s Last Exam acknowledge this limitation:

High accuracy on [Humanity’s Last Exam] would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence.

Subbarao Kambhampati, professor at Arizona State University and former president of the Association for the Advancement of Artificial Intelligence, puts it more clearly:

Humanity’s essence isn’t captured by a static test but rather by our ability to evolve and tackle previously unimaginable questions.

Developers like leaderboards

There’s another problem. AI developers use benchmarks to optimise their models for leaderboard performance. They’re essentially cramming for the exam. And unlike humans, for whom the learning for the test builds understanding, AI optimisation just means getting better at the specific test.

But it’s working.

Since Humanity’s Last Exam was published online in early 2025, scores have climbed dramatically. Gemini 3 Pro Preview now tops the leaderboard at 38.3% accuracy, followed by GPT-5 at 25.3% and Grok 4 at 24.5%.

Does this improvement mean these models are approaching human intelligence? No. It means they’ve gotten better at the kinds of questions the exam contains. The benchmark has become a target to optimise against.

The industry is recognising this problem.

OpenAI recently introduced a measure called GDPval specifically designed to assess real-world usefulness.

Unlike academic-style benchmarks, GDPval focuses on tasks based on actual work products such as project documents, data analyses and deliverables that exist in professional settings.

What this means for you

If you’re using AI tools in your work or considering adopting them, don’t be swayed by benchmark scores. A model that aces Humanity’s Last Exam might still struggle with the specific tasks you need done.

It’s also worth noting the exam’s questions are heavily skewed toward certain domains. Mathematics alone accounts for 41% of the benchmark, with physics, biology and computer science making up much of the rest. If your work involves writing, communication, project management or customer service, the exam tells you almost nothing about which model might serve you best.

A practical approach is to devise your own tests based on what you actually need AI to do, then evaluate newer models against criteria that matter to you. AI systems are genuinely useful – but any discussion about superintelligence remains science fiction and a distraction from the real work of making these tools relevant to people’s lives.

Authors: Kai Riemer, Professor of Information Technology and Organisation, University of Sydney

Read more https://theconversation.com/ai-is-failing-humanitys-last-exam-so-what-does-that-mean-for-machine-intelligence-274620

Unlock Durability And Beauty With Burnt Timber Cladding Solutions

Imagine a home or commercial space that not only stands the test of time but also tells a story through its very facade. In the world of architectur...

Offroad Caravans: Built for Adventure Beyond the Beaten Track

Australia’s vast and varied landscapes invite travellers to explore far beyond sealed roads and crowded parks. Offroad caravans are purpose-built ...

The Expert's Guide to Understanding Large Bore Steel Pipe Specifications

When it comes to infrastructure, construction, and various industrial applications, the choice of materials is paramount. Among the options availabl...

Preparing for Your First Trip to San Francisco in 2026

San Francisco has long occupied a particular place in the Australian imagination. It is compact yet complex, progressive but historic, and visually st...

Modern Office Painting in Australia - It's the Real Game Changer

Walk into any modern Australian office today and you'll be struck by the fact it's a whole different beast from the ones we grew up with. Gone are t...

How to Choose the Right Suburb for Your Lifestyle

Choosing the right suburb is one of the most important decisions you’ll make when buying or renting a home. Beyond the property itself, the suburb...

Considering Cryolipolysis Fat Freezing? Here’s What You Need to Know

Body confidence can shift over time, and sometimes even good diet and training can still leave a stubborn area of fat that won’t budge. If you’r...

From Local Tradie to Digital Leader: The Strategy Behind Auto Gate Guys Sydney’s Growth

For many small trade businesses, digital marketing still feels like a buzzword, not a necessity. They rely on word-of-mouth referrals, repeat clients...

Electric Automation System: Smarter Control for Modern Electrical Infrastructure

Modern buildings and industrial facilities are increasingly dependent on intelligent control and efficiency. An electric automation system brings t...

The Damp Truth: Why Your Overflowing Gutters Are an Open Invitation for Termites

When it comes to protecting your home, most people think about visible threats — storm damage, cracked tiles, break-ins. But one of the most destruc...

Is Your Inventory a Sitting Duck? 2 Critical Upgrades to Protect Your Business Assets and Your Bottom Line

Imagine this: you finish a long day on the job, lock up your tools, materials, and work vehicle in the garage, and head home. But overnight, someone b...

Electrician in Melbourne: Reliable Electrical Solutions for Homes and Businesses

Finding a dependable electrician Melbourne is essential when safety, efficiency, and long-term performance matter. Electrical systems form the back...

Rims and Tyres for Sale in Sydney: Performance, Safety, and Style Combined

Finding the right rims and tyres for sale Sydney is about far more than appearance. Tyres and rims directly influence how a vehicle handles, brakes...

Why Access to Doctors in Bundoora Is Essential for Ongoing Community Health

Reliable access to healthcare plays a vital role in maintaining physical wellbeing and peace of mind. Having trusted doctors in Bundoora available ...

Pendant Lights: Elevating Interior Spaces With Style and Purpose

Well-chosen pendant lights have the power to transform interiors by combining focused illumination with strong visual impact. More than just a ligh...

What Sets Professional Family Lawyers in Sydney Apart from General Lawyers?

Choosing the right legal support can make a noticeable difference when dealing with family-related matters. This article will explore what separates...

Balancing Teen Academic Expectations and Wellbeing

For many teenagers, school years are shaped by increasing expectations. Academic performance, future pathways, and comparison with peers can create pr...

Why Ceiling Fans Remain One of the Most Effective Solutions for Year-Round Comfort

Creating a comfortable indoor environment without relying heavily on energy-intensive systems is a priority for many households. Installing ceiling ...