Ditto

DittoBench · June 2026

Clocking tools, memory & speed.

Read the full writeup →
The benchmark 10 models · 177 tool cases · 150 memory Q
Best tool-caller 0.855 Claude 4.6 Sonnet
Best at memory 0.727 Kimi K2.7 Code
Fastest (composite) 100 GPT-5.4 nano
Memory QA overall 0.649 10 models · 150 Q
Tool calling · 177 cases

Did it pick the right tool?

  1. 1 Claude 4.6 Sonnet 0.855
  2. 2 Claude Opus 4.8 0.823
  3. 3 GPT-5.5 0.822
  4. 4 Gemini 3.1 Pro 0.821
  5. 5 Gemini Flash-Lite 0.808
  6. 6 Grok 4.3 0.799
  7. 7 Gemini 3.5 Flash 0.790
  8. 8 Kimi K2.7 Code 0.787
  9. 9 GPT-5.4 mini 0.786
  10. 10 GPT-5.4 nano 0.784
LongMemEval · 150 questions

Did it answer from memory?

  1. 1 Kimi K2.7 Code 0.727
  2. 2 Gemini 3.5 Flash 0.720
  3. 3 Gemini 3.1 Pro 0.693
  4. 4 GPT-5.5 0.687
  5. 5 Claude Opus 4.8 0.673
  6. 6 Claude 4.6 Sonnet 0.653
  7. 7 Grok 4.3 0.633
  8. 8 GPT-5.4 mini 0.627
  9. 9 Gemini Flash-Lite 0.567
  10. 10 GPT-5.4 nano 0.507
Full methodology, charts, and the model-by-model breakdown in the DittoBench writeup. Tool score = tool accuracy + judged quality · Memory = LLM-judged QA over full-history retrieval · Speed = fleet-relative, fastest = 100
Charts & full detail

The full picture

Every model, every axis.

Scatter plot of tool-calling score vs. latency: scores cluster tightly while latency ranges 2.4s to 13s; small green models sit fast and high.
Tool calling is a near-flat quality race — but a 5× speed race. The cheap models sit fast and nearly as accurate.
Horizontal bar chart of LongMemEval QA accuracy by model, Kimi K2.7 Code top at 0.727 to GPT-5.4 nano at 0.507.
Memory QA spreads the field wide (0.51–0.73). Reasoning over long retrieved context is the real separator.
Horizontal bar chart of composite speed score by model, GPT-5.4 nano at 100 down to the frontier models at 15-26.
Composite speed blends both axes. Small models dominate; frontier models trade speed for depth.
Horizontal bar chart of QA accuracy by question type, single-session-assistant 0.93 down to multi-session 0.25.
Difficulty by question type: single-session is near-solved; multi-session reasoning is the open frontier.

Fusion · OpenRouter

Does fusing models help?

The surprise: three of the cheapest models — nano + mini + Flash-Lite — fused into one panel score 0.809 on tool-calling, out-calling Grok, Gemini 3.5 Flash, and even the fused panel of Opus 4.8 + GPT-5.5. Yes — fusion is how small models punch into frontier territory at a fraction of the cost. No — none of the three panels we've tested yet tops the board, and fusing two frontier models drops below their solo scores. But that's three pairings out of a huge space — finding a fusion combo that beats every solo model is exactly what we'll explore next (more models, bigger panels). For now: fuse small to save money, pick one big model to win.

  • nano + mini + Flash-Lite Tools 0.809 Memory 0.660
  • Gemini 3.5 Flash + Kimi K2.7 Tools 0.801 Memory 0.669
  • Opus 4.8 + GPT-5.5 Tools 0.772 Memory 0.676

Bars relative to the best single model — tools 0.855 (Claude 4.6 Sonnet) · memory 0.727 (Kimi K2.7 Code). A full bar would mean matching the best solo model.

Want the methodology, the harness internals, and the model-by-model tables? Read the DittoBench writeup →