DittoBench · June 2026

Clocking tools, memory & speed.

Read the full writeup →

The benchmark 10 models · 177 tool cases · 150 memory Q

Best tool-caller 0.855 Claude 4.6 Sonnet

Best at memory 0.727 Kimi K2.7 Code

Fastest (composite) 100 GPT-5.4 nano

Memory QA overall 0.649 10 models · 150 Q

Tool calling · 177 cases

Did it pick the right tool?

1 Claude 4.6 Sonnet 0.855
2 Claude Opus 4.8 0.823
3 GPT-5.5 0.822
4 Gemini 3.1 Pro 0.821
5 Gemini Flash-Lite 0.808
6 Grok 4.3 0.799
7 Gemini 3.5 Flash 0.790
8 Kimi K2.7 Code 0.787
9 GPT-5.4 mini 0.786
10 GPT-5.4 nano 0.784

LongMemEval · 150 questions

Did it answer from memory?

1 Kimi K2.7 Code 0.727
2 Gemini 3.5 Flash 0.720
3 Gemini 3.1 Pro 0.693
4 GPT-5.5 0.687
5 Claude Opus 4.8 0.673
6 Claude 4.6 Sonnet 0.653
7 Grok 4.3 0.633
8 GPT-5.4 mini 0.627
9 Gemini Flash-Lite 0.567
10 GPT-5.4 nano 0.507

Charts & full detail

The full picture

Every model, every axis.

Scatter plot of tool-calling score vs. latency: scores cluster tightly while latency ranges 2.4s to 13s; small green models sit fast and high. — Tool calling is a near-flat quality race, but a 5× speed race. The cheap models sit fast and nearly as accurate.

Horizontal bar chart of LongMemEval QA accuracy by model, Kimi K2.7 Code top at 0.727 to GPT-5.4 nano at 0.507. — Memory QA spreads the field wide (0.51–0.73). Reasoning over long retrieved context is the real separator.

Horizontal bar chart of composite speed score by model, GPT-5.4 nano at 100 down to the frontier models at 15-26. — Composite speed blends both axes. Small models dominate; frontier models trade speed for depth.

Horizontal bar chart of QA accuracy by question type, single-session-assistant 0.93 down to multi-session 0.25. — Difficulty by question type: single-session is near-solved; multi-session reasoning is the open frontier.

Fusion · OpenRouter

Does fusing models help?

The surprise: three of the cheapest models, nano + mini + Flash-Lite, fused into one panel score 0.809 on tool-calling, out-calling Grok, Gemini 3.5 Flash, and even the fused panel of Opus 4.8 + GPT-5.5. Yes: fusion is how small models punch into frontier territory at a fraction of the cost. No: none of the three panels we've tested yet tops the board, and fusing two frontier models drops below their solo scores. But that's three pairings out of a huge space. Finding a fusion combo that beats every solo model is exactly what we'll explore next (more models, bigger panels). For now: fuse small to save money, pick one big model to win.

nano + mini + Flash-Lite Tools 0.809 Memory 0.660
Gemini 3.5 Flash + Kimi K2.7 Tools 0.801 Memory 0.669
Opus 4.8 + GPT-5.5 Tools 0.772 Memory 0.676

Bars relative to the best single model: tools 0.855 (Claude 4.6 Sonnet) · memory 0.727 (Kimi K2.7 Code). A full bar would mean matching the best solo model.

Want the methodology, the harness internals, and the model-by-model tables? Read the DittoBench writeup →