3
6
🎲 Random

Random Domain & Data Generation

Different domains demonstrate how tracking the latest value is commonplace across many fields:

  • Legal: Latest case status, document versions
  • Finance: Current account balances, stock prices
  • Medical: Recent vital signs, lab results
  • Physics: Current measurements, sensor readings

Note: In real tests we use English words for keys and values, which is closer to the generic settings.

Research Notes
About Us
ICML 2025 Workshop Accepted
1️⃣ Grouped Updates
🔍 LLM Retrieval Task — Tracking the Most Recent Value for Each Key
Tracking the latest value for each key is a fundamental operation in applications ranging from financial systems to AI agents and long-context models, where reliable performance depends on continually updating and maintaining evolving information.
However,LLMs consistently fail
and exhibit a clear and predictable decline in accuracy as the number of key-value updates increases.
All current LLMs :cannot read last value correctly like humans could.
💡Hit "Mix" to START interference challenge and see LLM's performance.
🔀
2️⃣ Mixed Update

LLM Retrieval Accuracy (toy)
Human Performance
HIGH(~100%)
Try it yourself to see if you
can read out the last value
We found All state-of-the-art LLMs struggle on this straightforward retrieval task, while humans maintain high accuracy.
As the number of tracked keys or updates per key increases, semantically similar interference rises sharply, leading to a consistent, log-linear decline in LLM retrieval accuracy.
See actual test accuracy for leading LLMs →
Accuracy vs Update Count
Fixed to 46 tracked keys, and vary update count for each key • Bootstrap 95% CI • Log-scaled x-axis
Click model names in the chart legend to toggle individual models on/off
By Family:
By Type:
🖱️ Interactive Legend & Data Points
  • Click model names in the chart legend to toggle individual models on/off
  • Click any data point to view detailed accuracy breakdown for that update count
  • Hover over lines to see tooltips with model name and accuracy values