AI Models Struggle to Match Human Engineers in New Incident Response Benchmark

A new benchmark using real production incidents shows AI models fail to match human experts in complex site reliability tasks.

May 18, 2026 · 23:31 CEST 2 min read

By Cowlpane Staff AI-curated financial analysis for retail investors.

Lead

A new benchmark developed by Datadog and Carnegie Mellon, known as the Anomaly Reasoning Framework Benchmark (ARFBench), reveals that current frontier AI models cannot yet match the performance of human engineers in investigating real-world production incidents. While AI companies promote autonomous site reliability engineer agents, testing against 63 actual production incidents shows that even leading models struggle with the complex, cross-metric reasoning required during live emergencies.

Background

System outages result in trillions of dollars of losses annually, driving interest in autonomous agents capable of performing incident response. To test the viability of these models, researchers created ARFBench using data extracted from engineers' Slack threads during live emergencies. Unlike many AI evaluations that rely on synthetic data or textbook scenarios, this benchmark utilizes 750 hand-verified multiple-choice questions derived from 142 monitoring metrics and 5.38 million data points to simulate real-world observability challenges.

What Happened

The benchmark evaluates AI capabilities across three distinct tiers of difficulty:

Tier I: Identifying if an anomaly exists within a specific chart.
Tier II: Determining the start time, severity, and type of an anomaly.
Tier III: Performing cross-metric reasoning to determine if one chart is causing issues in another.

Results indicated that AI performance degrades significantly at the most difficult level. On Tier III questions, GPT-5 achieved a 47.5% F1 score, a metric designed to penalize models for simply picking the most common class. In terms of overall accuracy, the performance of the tested models was as follows:

Datadog's hybrid model (Toto combined with Qwen3-VL 32B): 63.9%
GPT-5: 62.7%
Gemini 3 Pro: 58.1%
Claude Opus 4.6: 54.8%
Claude Sonnet 4.5: 47.2%

In comparison, human domain experts achieved 72.7% accuracy. Even non-domain experts, such as time series researchers at Datadog without extensive observability experience, outperformed the lower-tier models with a 69.7% accuracy rate. The top-performing model, Toto-1.0-QA-Experimental, outperformed GPT-5 by using a fraction of its parameters and led all other models in anomaly identification by at least 8.8 percentage points in F1 score.

Market & Industry Implications

The research suggests that the relationship between AI and human engineers is currently complementary rather than competitive. The study observed substantially different error profiles between models and humans: AI models tend to hallucinate, miss metadata, and lose domain context, whereas humans are more prone to misreading precise timestamps or failing on complex instructions. Because these mistakes rarely overlap, the researchers modeled a theoretical "Model-Expert Oracle"—a perfect judge that selects the correct answer between the AI and the human. This hybrid approach yielded an 87.2% accuracy and 82.8% F1 score, significantly higher than either the AI or the human acting alone.

What to Watch

The ARFBench leaderboard is currently live on Hugging Face, providing a documented target for future model development. The gap between current model accuracy (peaking at 63.9%) and the theoretical human-AI collaboration ceiling (87.2%) serves as a quantitative measure for the industry to aim for in the development of autonomous site reliability tools.

Name	Provider	Purpose	Expiry
Essential
cowlpane-consent	Cowlpane	Stores your cookie preferences	1 year
cowlpane-theme	Cowlpane	Remembers dark/light theme	Persistent
__cfruid	Cloudflare	DDoS protection & security	Session
Advertising (consent required)
IDE	Google	Ad targeting & frequency capping	13 months
_gads	Google	Connects browser to ad preferences	2 years
ANID	Google	Ad personalisation	13 months
Affiliate tracking (consent required)
session-id	Amazon	Affiliate purchase attribution	Session
ubid-main	Amazon	Browser ID for affiliate tracking	10 years

Lead

Background

What Happened

Market & Industry Implications

What to Watch

Read Next

GitHub Leak Exposes 3,800 Repos — Threat to DevOps Secrets and Crypto Projects

ZEC Breaks Key Resistance — Upside Potential for Holders and On‑Chain Activity

OpenAI IPO Filing Set for Friday — Crypto Liquidity May Tighten Sharply