Different domains demonstrate how tracking the latest value is commonplace across many fields:
Legal: Latest case status, document versions
Finance: Current account balances, stock prices
Medical: Recent vital signs, lab results
Physics: Current measurements, sensor readings
Note: In real tests we use English words for keys and values, which is closer to the generic settings.
Research Notes
About Us
ICML 2025 Workshop Accepted
1️⃣ Grouped Updates
🔍 LLM Retrieval Task — Tracking the Most Recent Value for Each Keyⓘ
Tracking the latest value for each key is a fundamental operation in applications ranging from financial systems to AI agents and long-context models, where reliable performance depends on continually updating and maintaining evolving information.
However,LLMs consistently fail
and exhibit a clear and predictable decline in accuracy as the number of key-value updates increases.
All current LLMs :cannot read last value correctly like humans could.
💡Hit "Mix" to START interference challenge and see LLM's performance.
🔀
2️⃣ Mixed Update
LLM Retrieval Accuracy (toy)
—
Human Performance
HIGH(~100%)
Try it yourself to see if you can read out the last value
We found All state-of-the-art LLMs struggle on this straightforward retrieval task, while humans maintain high accuracy.
As the number of tracked keys or updates per key increases, semantically similar interference rises sharply, leading to a consistent, log-linear decline in LLM retrieval accuracy.
See actual test accuracy for leading LLMs →
Accuracy vs Update Count
Fixed to 46 tracked keys, and vary update count for each key • Bootstrap 95% CI • Log-scaled x-axis
Click model names in the chart legend to toggle individual models on/off
By Family:By Type:
🖱️Interactive Legend & Data Points
Click model names in the chart legend to toggle individual models on/off
Click any data point to view detailed accuracy breakdown for that update count
Hover over lines to see tooltips with model name and accuracy values
👥 About the Authors
*Authors contributed equally to this work. Listing order is random.
We are an interdisciplinary group interested and probing the boundaries between human and machine intelligence.
Chupei Wang*
Bachelor, University of Virginia, Physics Department.
With a foundation in physics and philosophy—including a year at the University of Chicago Divinity School—Chupei explores where logic and mind meet their limits, probing how the edges of science and the humanities intersect. Chupei is driven by a curiosity about where cognitive architectures—biological and artificial—break down, and what these failures teach us about intelligence itself. After graduated he had a start-up exerience in China. Currently seeking Lab and Research Opportunities.
📫 cw4bb@virginia.edu
Jiaqiu Vince Sun*
PhD Candidate, NYU Center for Neuroscience
A former professional architect turned neuroscientist, Jiaqiu draws on his background in spatial design, cognitive neuroscience, and philosophy of mind to investigate how memory emerges and diverges in brains and artificial systems. His primary focus lies in the higher-level functions of the brain, such as self-monitoring and control.
📫 vince.sun@nyu.edu
Accuracy vs Update Count - Expanded View
By Family:By Type:
💡 Tips: Click a point to view per‑model accuracy • Click model names in legend to show/hide individual models
📋 JSON File Format Requirements
📊 Data Structure Overview
Your JSON file must contain LLM accuracy data across 16 different update counts representing increasing context interference levels.
Each value represents the number of context updates the LLM must track simultaneously:
• 2-4: Low interference (easy tasks)
• 6-24: Medium interference (moderate difficulty)
• 34-400: High interference (challenging attention tasks)
metadata:
• tracked_keys: Number of key-value pairs tracked (usually 46)
• description: Brief description of your benchmark
• n_tracked_updates: Must be exactly these 16 values
• total_models: Number of models in your dataset
models array:
• name: Unique identifier for each LLM model
• accuracies: Array of exactly 16 accuracy percentages (0-100)
→ Each accuracy corresponds to the same index in n_tracked_updates
→ Index 0: accuracy at 2 updates, Index 15: accuracy at 400 updates
⚠️ Important Notes
Exact length: Each model must have exactly 16 accuracy values
Order matters: Accuracies must match the n_tracked_updates order
Percentage format: Use values 0-100 (not 0-1 decimals)
JSON validity: Ensure proper JSON syntax (use jsonlint.com to validate)
No null values: All 16 accuracy values must be numbers
🎯 Example Interpretation
For gpt-4.1 in the example above:
• At 2 updates: 100% accuracy (perfect performance)
• At 12 updates: 82% accuracy (slight interference)
• At 400 updates: 25% accuracy (high interference, significant degradation)
🧠 PI-LLM Attention Benchmark
🔬 Research Context
Large language models struggle to retrieve co-referenced information from long contexts—a core challenge in modern AI evaluation. Our proactive interference (PI) test isolates this very limitation: it directly reveals that interference between similar pieces of information, not just raw context length, is the main bottleneck underlying failures in long-context coreference resolution.
Our Test: Tracking Last Values
Common challenge in daily tasks and long reasoning scenarios
At the core of MRCR, showing LLMs don't need long context to fail in retrieving
Directly demonstrates interference resistance as key performance factor
Scientific Visualization: Log-scale accuracy plots with confidence intervals
📊 How It Works
Organized Phase: Present key-value pairs grouped by key
Interference Phase: Mix pairs randomly to create context noise
Retrieval Task: Test model's ability to find final values
Analysis: Measure accuracy degradation as interference increases
Research Impact: This tool helps researchers understand attention span limitations and design better context management strategies for long-context LLM applications.
Classic evaluation of large language models (LLMs) often focuses on their ability to retrieve a single piece of information—a "needle"—from a massive "haystack" of context (as in the Needle-in-a-Haystack test). Later benchmarks DeepMind MRCR and OpenAI MRCR increased the challenge by filling the haystack with many similar needles, testing the model's ability to distinguish between closely related items.
Our test takes this one step further:
We show that the "haystack" itself isn't even necessary to reveal fundamental retrieval failures. By isolating and systematically controlling the number of similar "needles" (semantically similar text) in context, our paradigm directly measures how interference between similar items limits retrieval accuracy in LLMs. We find the precise log-linear decline in retrieval as the number of interference grows—a pattern observed in all major transformer-based models.
In summary: Our work reveals a core working memory bottleneck in LLMs that arises from interference, not just context length. This new approach enables researchers to quantify and compare how different models handle proactive interference—a key ingredient in real-world reasoning and memory.
Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.
Cognitive Science Foundation
Our test adopts the classic and widely used proactive interference (PI) paradigm from cognitive science—a foundational method for studying human working memory, where outdated but similar information disrupts the encoding of new content (see review). By bringing this well-established approach to large language models (LLMs), we can directly measure how interference between similar items—not just context length—limits their memory and retrieval capabilities.
In a typical PI experiment, participants are presented with a sequence of repeating cues—such as a list of phone numbers, words, or categories—where each cue is updated several times with new information. As the sequence progresses, earlier (now outdated) associations pile up, making it increasingly challenging to remember only the most recent value for each cue. After the updates, participants are asked to report the latest value for each cue, requiring them to ignore or "unbind" the outdated ones. While some confusion from prior information is expected, humans generally show remarkable resilience: they can flexibly suppress irrelevant details and maintain high accuracy, even as interference grows. This ability reflects a core strength of human working memory, distinguishing it from simpler storage or recall systems.
Core Challenge
Interference-based retrieval failures occur when LLMs struggle to isolate target information from similar competing contexts. This fundamental limitation manifests across tasks from simple key-value tracking to complex multi-round coreference resolution (MRCR).
Our Proactive Interference (PI) Test
We isolate the core mechanism by testing tracking the last value of multiple keys under systematically increasing interference:
Controlled Variables
Systematic interference increase
Isolated failure modes
Removed Searching Difficulty
Measurement
Direct accuracy quantification
35+ LLM benchmarks
Statistical significance testing
Connection to Multi-Round Coreference Resolution
MRCR benchmarks like DeepMind's Michelangelo and OpenAI's MRCR dataset embed highly similar co-references within massive contexts, combining long input length with high interference to create a compound retrieval challenge.
Key Insight: Our test isolates and quantifies interference, revealing it as the primary cause of model failures on MRCR tasks. We show that interference is independent of input context length, challenging the prevailing assumption—shaped by recent long-context benchmarks—that retrieval failures are mainly due to context window limitations.
This suggests that improving interference resistance—rather than just scaling context windows—is crucial for advancing long-context reasoning capabilities.
Research Impact: This work provides actionable insights for improving LLM attention mechanisms and designing better context management strategies for long-context applications.