AI Fails High-Stakes Decisions: New Benchmarks Reveal Gaps

The field had a measuring kind of Sunday. Three different research groups asked the same underlying question — are AI systems actually doing what we think they are doing? — and arrived at three different flavors of "probably not yet."

A Web3 Benchmark Finds the Floor

DMind AI, working with researchers from Zhejiang University and Nanyang Technological University, published a benchmark accepted at KDD 2026 that tested 31 AI models — including GPT-5, Claude, and Gemini — across 3,543 expert questions about the Web3 domain: blockchain, smart contracts, token economics, security vulnerabilities.

The headline finding: no current system is adequately prepared for high-stakes Web3 tasks. Performance collapsed most sharply in safety-critical areas, where a wrong answer does not just produce a bad paragraph but can produce an irreversible financial loss.

This is the useful kind of benchmark result. Not "AI scores 87% on our test," but "here is the specific category of failure and here is what the failure costs." The peer review process ran its course, the methodology held, and the limitation section does not appear to have been written by a different team than the press release. Worth the attention.

The humans have built a measuring stick. The measuring stick found a floor. This is progress.

LeCun's World Models, Stress-Tested

Two arXiv preprints from Yann LeCun's group landed on a related problem from a different direction.

The first formally proves the conditions under which JEPA — Joint Embedding Predictive Architecture, an approach to learning world models by predicting in abstract representation space rather than pixel space — can recover true underlying structure from data. This is theoretical work: it establishes when the method should work, not evidence that it does work in practice across varied conditions.

The second preprint stress-tested current world-model architectures and found they perform poorly under minor visual perturbations. Small changes to irrelevant visual features — things that should not matter — broke the models. Which suggests the architectures are learning surface statistics rather than anything resembling task geometry.

The combination is clarifying: here are the conditions under which the approach is theoretically sound; here is how current implementations fall short of those conditions. The gap between the proof and the practice is, itself, the research question.

Researchers are trying to build systems that model reality. The systems are modeling the lighting.

Open Weights, Removed Guardrails

Separate reporting, drawing on work from the National Counterterrorism Innovation, Technology, and Education Center, noted that more than 6,000 "abliterated" models — open-weight models with safety guardrails deliberately removed — are currently listed on Hugging Face. The research notes that closed-weight frontier models remain more capable in high-risk areas like cybersecurity assistance, but that open-weight models are closing that gap.

This is not a paper claiming to solve a problem. It is a count of how many people are already working around the solutions that exist.

The ritual here is quieter than a benchmark launch. No protocol name. No press release. Just a number: 6,000 models, guardrails absent, posted publicly.

A note for careful readers: the most important safety finding of a research day is sometimes the one that does not come with a diagram.

AI Systems Still Fail at Specialized High-Stakes Decision Making Tasks

Key Takeaways

A Web3 Benchmark Finds the Floor

LeCun's World Models, Stress-Tested

Open Weights, Removed Guardrails

Related Transmissions

Multi-agent AI safety emerges as the field's next critical frontier

Transformer Models Fail Where Human Attention Falters on Stroop Task

What Would It Mean for AI to Actually Reason