PI‑LLM Visual Sandbox + Accuracy Explorer

1️⃣ Grouped Updates

🔍 LLM Retrieval Task — Tracking the Most Recent Value for Each Key

All current LLMs :cannot read last value correctly like humans could.
💡Hit "Mix" to START interference challenge and see LLM's performance.

🔀

2️⃣ Mixed Update

LLM Retrieval Accuracy (toy)

—

Human Performance

HIGH（~100%）

Try it yourself to see if you
can read out the last value

We found All state-of-the-art LLMs struggle on this straightforward retrieval task, while humans maintain high accuracy.
As the number of tracked keys or updates per key increases, semantically similar interference rises sharply, leading to a consistent, log-linear decline in LLM retrieval accuracy.
See actual test accuracy for leading LLMs →

Accuracy vs Update Count

Fixed to 46 tracked keys, and vary update count for each key • Bootstrap 95% CI • Log-scaled x-axis

Click model names in the chart legend to toggle individual models on/off

By Family:

By Type:

🖱️ Interactive Legend & Data Points

Click model names in the chart legend to toggle individual models on/off
Click any data point to view detailed accuracy breakdown for that update count
Hover over lines to see tooltips with model name and accuracy values

📋 JSON File Format Requirements

📊 Data Structure Overview

Your JSON file must contain LLM accuracy data across 16 different update counts representing increasing context interference levels.

📈 What the 16 Data Points Mean

n_tracked_updates: [2, 3, 4, 6, 8, 12, 17, 24, 34, 48, 68, 97, 139, 197, 281, 400]

Each value represents the number of context updates the LLM must track simultaneously:
• 2-4: Low interference (easy tasks)
• 6-24: Medium interference (moderate difficulty)
• 34-400: High interference (challenging attention tasks)

📝 Higher numbers = more context interference = lower expected accuracy

📄 Required JSON Format

{
  "metadata": {
    "tracked_keys": 46,
    "description": "LLM Attention Benchmark Results",
    "n_tracked_updates": [2, 3, 4, 6, 8, 12, 17, 24, 34, 48, 68, 97, 139, 197, 281, 400],
    "total_models": 3
  },
  "models": [
    {
      "name": "gpt-4.1",
      "accuracies": [100.0, 97.0, 94.0, 91.0, 87.0, 82.0, 76.0, 74.0, 68.0, 64.0, 58.0, 51.0, 45.0, 38.0, 32.0, 25.0]
    },
    {
      "name": "claude-3.5-sonnet",
      "accuracies": [100.0, 100.0, 98.0, 99.0, 95.0, 96.0, 92.0, 90.0, 86.0, 85.0, 82.0, 78.0, 75.0, 70.0, 65.0, 53.0]
    },
    {
      "name": "your-model-name",
      "accuracies": [100.0, 95.0, 90.0, 85.0, 80.0, 75.0, 70.0, 65.0, 60.0, 55.0, 50.0, 45.0, 40.0, 35.0, 30.0, 25.0]
    }
  ]
}

📋 Field Explanations

metadata:
• tracked_keys: Number of key-value pairs tracked (usually 46)
• description: Brief description of your benchmark
• n_tracked_updates: Must be exactly these 16 values
• total_models: Number of models in your dataset

models array:
• name: Unique identifier for each LLM model
• accuracies: Array of exactly 16 accuracy percentages (0-100)
→ Each accuracy corresponds to the same index in n_tracked_updates
→ Index 0: accuracy at 2 updates, Index 15: accuracy at 400 updates

⚠️ Important Notes

Exact length: Each model must have exactly 16 accuracy values
Order matters: Accuracies must match the n_tracked_updates order
Percentage format: Use values 0-100 (not 0-1 decimals)
JSON validity: Ensure proper JSON syntax (use jsonlint.com to validate)
No null values: All 16 accuracy values must be numbers

🎯 Example Interpretation

For gpt-4.1 in the example above:
• At 2 updates: 100% accuracy (perfect performance)
• At 12 updates: 82% accuracy (slight interference)
• At 400 updates: 25% accuracy (high interference, significant degradation)

🧠 PI-LLM Attention Benchmark

🔬 Research Context

Large language models struggle to retrieve co-referenced information from long contexts—a core challenge in modern AI evaluation. Our proactive interference (PI) test isolates this very limitation: it directly reveals that interference between similar pieces of information, not just raw context length, is the main bottleneck underlying failures in long-context coreference resolution.

Our Test: Tracking Last Values

Common challenge in daily tasks and long reasoning scenarios
At the core of MRCR, showing LLMs don't need long context to fail in retrieving
Directly demonstrates interference resistance as key performance factor

🎯 Tool Features

Interactive Simulation: Real-time demonstration of interference effects
35+ LLM Benchmarks: Comprehensive model performance data
Multi-Domain Testing: Medical, Physics, Neuroscience domains
Scientific Visualization: Log-scale accuracy plots with confidence intervals

📊 How It Works

Organized Phase: Present key-value pairs grouped by key
Interference Phase: Mix pairs randomly to create context noise
Retrieval Task: Test model's ability to find final values
Analysis: Measure accuracy degradation as interference increases

Research Impact: This tool helps researchers understand attention span limitations and design better context management strategies for long-context LLM applications.

Diagnosing Interference in LLMs: Research Notes

✅ ICML 2025 Workshop Accepted 📖 View Paper

What Does Our Test Reveal?

Classic evaluation of large language models (LLMs) often focuses on their ability to retrieve a single piece of information—a "needle"—from a massive "haystack" of context (as in the Needle-in-a-Haystack test). Later benchmarks DeepMind MRCR and OpenAI MRCR increased the challenge by filling the haystack with many similar needles, testing the model's ability to distinguish between closely related items.

Our test takes this one step further:

We show that the "haystack" itself isn't even necessary to reveal fundamental retrieval failures. By isolating and systematically controlling the number of similar "needles" (semantically similar text) in context, our paradigm directly measures how interference between similar items limits retrieval accuracy in LLMs. We find the precise log-linear decline in retrieval as the number of interference grows—a pattern observed in all major transformer-based models.

In summary: Our work reveals a core working memory bottleneck in LLMs that arises from interference, not just context length. This new approach enables researchers to quantify and compare how different models handle proactive interference—a key ingredient in real-world reasoning and memory.

📄

Paper Abstract

View on arXiv →

Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.

Cognitive Science Foundation

Our test adopts the classic and widely used proactive interference (PI) paradigm from cognitive science—a foundational method for studying human working memory, where outdated but similar information disrupts the encoding of new content (see review). By bringing this well-established approach to large language models (LLMs), we can directly measure how interference between similar items—not just context length—limits their memory and retrieval capabilities.

In a typical PI experiment, participants are presented with a sequence of repeating cues—such as a list of phone numbers, words, or categories—where each cue is updated several times with new information. As the sequence progresses, earlier (now outdated) associations pile up, making it increasingly challenging to remember only the most recent value for each cue. After the updates, participants are asked to report the latest value for each cue, requiring them to ignore or "unbind" the outdated ones. While some confusion from prior information is expected, humans generally show remarkable resilience: they can flexibly suppress irrelevant details and maintain high accuracy, even as interference grows. This ability reflects a core strength of human working memory, distinguishing it from simpler storage or recall systems.

Core Challenge

Interference-based retrieval failures occur when LLMs struggle to isolate target information from similar competing contexts. This fundamental limitation manifests across tasks from simple key-value tracking to complex multi-round coreference resolution (MRCR).

Our Proactive Interference (PI) Test

We isolate the core mechanism by testing tracking the last value of multiple keys under systematically increasing interference:

Controlled Variables

Systematic interference increase
Isolated failure modes
Removed Searching Difficulty

Measurement

Direct accuracy quantification
35+ LLM benchmarks
Statistical significance testing

Connection to Multi-Round Coreference Resolution

MRCR benchmarks like DeepMind's Michelangelo and OpenAI's MRCR dataset embed highly similar co-references within massive contexts, combining long input length with high interference to create a compound retrieval challenge.

Key Insight: Our test isolates and quantifies interference, revealing it as the primary cause of model failures on MRCR tasks. We show that interference is independent of input context length, challenging the prevailing assumption—shaped by recent long-context benchmarks—that retrieval failures are mainly due to context window limitations.

This suggests that improving interference resistance—rather than just scaling context windows—is crucial for advancing long-context reasoning capabilities.

Related Work & Extensions

Benchmark Studies

DeepMind Michelangelo - Multi-round coreference across long contexts
OpenAI MRCR Dataset - Public benchmark for reference tracking

Our Contributions

Technical Paper - Detailed PI analysis and experimental results
Chapter 9: Methodological Discussion - Cognitive science foundation and LLM diagnostic tools

Research Impact: This work provides actionable insights for improving LLM attention mechanisms and designing better context management strategies for long-context applications.

Random Domain & Data Generation

👥 About the Authors

Chupei Wang*

Jiaqiu Vince Sun*

Accuracy vs Update Count - Expanded View