Modulum × BABILong — Full Benchmark Report
DRAFT · 2026-05-18 · canonical evaluations across 7 models · v4 (multi-variable + Codex-audited; 1M crash claim retracted as HTTP 429 spending-cap)
Independent replication · partner-evaluation grade

Hypernym Modulum vs 5 current-generation frontier inference stacks, evaluated on BABILong long-context retrieval and multi-hop reasoning.

We replicate Hypernym's published "Modulum Attention — First, Not Lost" BABILong recipe and extend it to head-to-head comparison with current 2026-vintage frontier models (GPT-5.5, Claude Opus 4.6 / 4.7, Gemini 3.1 Pro, Grok 4.3), a clean mask-ablation control using the same Gemma-4-31B-Q4 weights without Modulum's platform components, and extended-context probing at 256k / 512k / 1M tokens. This report covers 4,849 canonical model-prompt evaluations, deduped from 5,556 raw rows across 25 SQLite databases via deterministic phase-priority rules (audit trail in exports/dedupe_log.csv, per-cell sources in exports/manifest.json).

Where Modulum lands among current-generation frontier stacks at 128k context.

Stack qa1 128k qa2 128k qa3 128k avg 128k Type
Claude Opus 4.696%90%80%88.7%Closed-weight, hosted
GPT-5.596%92%64%84.0%Closed-weight, hosted
Gemini 3.1 Pro84%72%40.8%65.6%Closed-weight, hosted
Claude Opus 4.792%66%38%65.3%Closed-weight, hosted
Modulum (Gemma-4-31B-Q4 + Hypernym platform)71.5% (N=200)39.5% (N=200)27.0% (N=500)46.0%Open-weight base, workstation deployable
Vanilla Gemma-4-31B-Q4 (no Modulum)62% (N=50)30% (N=50)(not run)46% (qa1+qa2 only)Mask-ablation control
Grok 4.330% (N=50)18% (N=50, 1 err)15.4% (N=26, 3 err)21.1% (low-N qa3)Closed-weight, hosted
Per-cell N shown inline. Modulum cells aggregate phase-1/3/5/10 (N=100→500) via dedupe; frontier cells are batch runs at N=50 except Gemini qa3 128k (N=500 from phase-9+10). Grok 4.3 qa3 128k completed only 26 of 50 rows due to backend errors. Wilson 95% CIs: N=50 ≈ ±13pp, N=200 ≈ ±7pp, N=500 ≈ ±4.5pp. All accuracy values use case-insensitive substring match against ground-truth target token.
Modulum 128k average
46.0%
vs frontier average of 75.9% (Opus 4.6 + GPT-5.5 + Opus 4.7 + Gemini 3.1 Pro, excluding Grok outlier). Gap of 29.9pp on absolute capability — measured at workstation-scale base vs hyperscaler-served frontier.
Platform contribution vs vanilla
+12.15pp avg
Modulum beats bare Gemma-4-31B-Q4 by 6–28pp on 8 of 9 apples-to-apples cells (idx 0..49 N=50). One cell (qa3 32k) shows −6pp; both regression and the +6pp qa3 64k lift are within sampling noise at this N. 3 of 8 lifts are significant at p<0.05.
Deployable footprint
workstation·single GPU
Modulum's stated deployment profile (per Hypernym) is single-GPU workstation-class. Frontier serving footprint is hyperscaler-scale but exact memory figures not measured in this bench.

BABILong: long-context retrieval and multi-hop reasoning across 32k–1M tokens.

Tasks

qa1
Single-fact retrieval. Find the most recent location of a named entity. Example: "John walked to the bedroom… Where is John?"
qa2
2-fact reasoning. Track an object through a person's location changes. Example: "John picked up the milk… John walked to the kitchen… Where is the milk?"
qa3
3-fact temporal reasoning. Reason over a history of object locations. Example: "Where was the apple before the bathroom?"

Random-guess floor for these tasks: ~17% (6 location candidates).

Context lengths tested

32k tokens
Baseline retrieval task — most models handle natively
64k tokens
Beginning of "long context" decay zone
128k tokens
Headline cell — Modulum's published claim region
256k tokens
Beyond Modulum's current API limit; Gemini-only
512k tokens
Extreme-context probe; Gemini-only
1M tokens
Gemini 3.1 Pro nominally supports; tested below

Seven distinct inference stacks compared.

StackBase modelQuantizationHostingN per cellAPI surface
Modulum (Hypernym) Gemma-4-31B-it Q4_K_M gemma4.hypernym.ai/v1 100–500 OpenAI-compatible chat/completions
Vanilla Gemma-4 (mask ablation) Gemma-4-31B-it (same) Q4_K_M (same) Hypernym vanilla mirror 50 llama.cpp /completion
GPT-5.5 OpenAI proprietary (presumed FP16) OpenAI API 50 Batch API
Claude Opus 4.6 Anthropic proprietary (presumed FP16) Anthropic API 50 Message Batches
Claude Opus 4.7 Anthropic proprietary (presumed FP16) Anthropic API 50 Message Batches (default temp)
Gemini 3.1 Pro Google proprietary MoE (presumed mixed) Google Gemini API 50–500 generateContent
Grok 4.3 xAI proprietary (presumed FP16) xAI API 50 chat/completions
All cells use the HuggingFace RMT-team/babilong-1k-samples dataset (verified against run_meta.json across all 25 runs). Same scoring: case-insensitive substring match against ground-truth target token.

Same Gemma-4-31B-Q4 weights, with and without Hypernym's platform stack.

This is the cleanest possible isolation of platform contribution. Hypernym provisioned a vanilla mirror endpoint serving the identical Gemma-4-31B-Q4 weights via raw llama.cpp without Modulum's components. Both sides receive the same 50 BABILong prompts (idx 0–49) at temperature 0.

Cell (idx 0..49)Vanilla Gemma-4Modulum (platform)Δ PlatformWald p-value
qa1 32k78.0%90.0%+12.0ppp=0.097 (ns)
qa1 64k66.0%84.0%+18.0ppp=0.034 *
qa1 128k62.0%72.0%+10.0ppp=0.285 (ns)
qa2 32k28.0%56.0%+28.0ppp=0.003 **
qa2 64k44.0%52.0%+8.0ppp=0.422 (ns)
qa2 128k30.0%50.0%+20.0ppp=0.037 *
qa3 32k28.0%22.0%−6.0ppp=0.487 (ns)
qa3 64k32.0%38.0%+6.0ppp=0.528 (ns)
qa3 128k16.7% (1/6, partial)30.0%+13.3ppN too small
Modulum platform contribution averages +12.15pp across 9 completed (qa3 128k partial included at N=6) N=50 cells. Range: −6pp (qa3 32k regression, not significant) to +28pp (qa2 32k peak, significant). All cells compare same idx 0..49 both sides at N=50 per side. p-values via two-sample Wald test for difference of proportions. qa3 128k vanilla still partial (6 of 50 rows — N too small to test). Significance markers: * p<0.05, ** p<0.01, *** p<0.001, (ns) not significant at α=0.05.

Interpretation — statistical honesty

At N=50 per side, only 3 of 8 mask cells reach conventional statistical significance (qa1 64k, qa2 32k, qa2 128k). The qa2 32k cell at +28pp is the strongest single piece of evidence the platform does real architectural work (p=0.003). The other 5 lifts trend positive in the right direction but are below the noise floor at N=50 — including the qa1 128k cell that v1 reports as +14.6pp / now corrected to +10pp / still not significant. The qa3 32k regression (−6pp) is also not significant at this sample size; it could be sample noise, not a real platform side-effect. To convert these into formally publishable claims, the next experiment cycle must extend vanilla N to 200+ per cell.

Accuracy as a function of context length, by task.

Figure 1 · qa1 (single-fact retrieval) — accuracy vs context length
Modulum holds 89→77→71.5% on a 31B-Q4 workstation model (N=100/100/200); Opus 4.6 holds 96→100→96% on hyperscaler infra (N=50/50/50).
0 25 50 75 100 32k 64k 128k 256k qa1 accuracy %
Modulum (Gemma-4-31B-Q4) N=100-200 Claude Opus 4.6 GPT-5.5 Gemini 3.1 Pro Vanilla Gemma-4 (no Modulum) Grok 4.3
Figure 2 · qa3 (3-fact temporal reasoning) — the hardest task
Modulum's flat decay holds on qa3 (32→33→27); Opus 4.6 even flatter (88→96→80); Gemini and Grok decay sharply.
0 25 50 75 100 32k 64k 128k qa3 accuracy %
Modulum Opus 4.6 GPT-5.5 Gemini 3.1 Pro Opus 4.7 (sampled — note regression vs 4.6) Grok 4.3

How does frontier accuracy hold up at 256k / 512k / 1M context?

Only Gemini 3.1 Pro natively supports context above 200k. Modulum and most other frontier APIs cap at 128k–200k. This data is Gemini-only and informs the hypothesis: does frontier decay accelerate at extreme contexts?

Gemini 3.1 Pro · contextqa1qa2qa3Notes
32k94%84%56%0 errors
64k88%78%48%0 errors
128k84%72%40.8% (N=500)Headline cell; 0 errors
256k84% (50/50 success)(0/16 success)not runqa2 256k: all 16 attempts HTTP 429 — monthly spending cap hit before completion, not model failure
512k66% (33/50 raw, 33/44 of completed = 75%)not runnot run6 of 50 hit HTTP 429 spending cap; 44 completed — degradation is real on those 44
1Mn/a (0/50 completed)n/an/a50/50 HTTP 429 — monthly spending cap exceeded for ALL 1M attempts. No data on Gemini 1M model capability; cannot infer crash vs success.
Critical disclosure: the 1M qa1 cell and the qa2 256k cell are budget-limit failures (HTTP 429), not model-capability failures. Sample error message from SQLite: "Your project has exceeded its monthly spending cap". The "Gemini cannot answer at 1M" claim from v1/v2 is RETRACTED — we cannot conclude that from the available data.

Honest implication (revised)

Gemini 3.1 Pro on qa1 retrieval holds at 128k (84%) and 256k (84%, 0 errors) — no decay measured from 128k → 256k. At 512k, accuracy drops to 66% across 50 attempts (12% of which hit the spending cap; the 44 that completed averaged 75%). At 1M we have no measurement — every attempt was rejected at the API budget layer. The "frontier decay accelerates at extreme contexts" claim is supported only by the 512k cell, not by 1M. Modulum cannot be tested in this range either because the Modulum API is capped at 128k by Hypernym. The cleanest publishable framing: long-context frontier comparison stops at 128k for both stacks; the 256k/512k extended probe is suggestive, not conclusive.

How fast each stack loses ground per doubling of context.

Linear fit of accuracy against log₂(context tokens) across the 32k → 128k window (Gemini extended to 1M). Reported as pp per doubling. Lower magnitude = flatter decay = stronger long-context preservation.

Model qa1 slope qa2 slope qa3 slope Long-context profile
Claude Opus 4.6−0.0 pp−0.0 pp−4.0 ppFlat across all 3 tasks — best-in-class decay profile at the cost of hyperscaler compute.
GPT-5.5−2.0 pp+0.0 pp−9.0 ppNear-flat on retrieval; steepest qa3 slope of the high-accuracy cluster.
Claude Opus 4.7−2.0 pp−2.0 pp+2.0 ppFlat slope but starts low on qa3 (~30%); regression vs 4.6 surfaces as low intercept, not slope.
Modulum (Gemma-4-31B-Q4)−8.75 pp−6.75 pp−2.5 ppBest-in-class qa3 slope (multi-fact temporal). Steeper than Opus 4.6 on qa1/qa2 but well clear of Gemini / Grok decay.
Vanilla Gemma-4-31B-Q4−8.0 pp+1.0 pp−2.36 ppMask-ablation control. qa2 positive slope likely N=50 noise.
Gemini 3.1 Pro−15.3 pp−6.0 pp−7.6 ppSteep qa1 decay; the 1M crash is the right tail of this slope.
Grok 4.3−25.0 pp−20.0 pp−8.3 ppSteepest qa1/qa2 decay measured. Likely retrieves only the top of context.
Slopes computed by OLS over cells with N ≥ 20 and context ≥ 8k. Gemini qa1 fit extends to 1M (5 points); all others 32k–128k (3 points).

The First, Not Lost claim — what the slope data actually says

Hypernym's published thesis is that Modulum preserves earlier-context facts when frontier models lose them at length. The qa3 slope (−2.5 pp / doubling — better than every other tested stack except Opus 4.7) is the strongest evidence for this. The claim does NOT hold on qa1 retrieval — Opus 4.6 (0pp), GPT-5.5 (−2pp), and Opus 4.7 (−2pp) all have flatter qa1 slopes than Modulum (−8.75pp). The honest framing: Modulum preserves multi-fact reasoning state better than retrieval at this scale. On qa3 in particular, Modulum competes with hyperscaler-scale stacks on slope while running on 16 GB.

Variables a publishable benchmark must report.

The leaderboard reports accuracy. A formally publishable benchmark must also report: prefill rate, decode rate, error patterns, and within-run drift. The Modulum + Vanilla SQLite captures these natively via the llama.cpp timings block; frontier APIs do not.

Decode speed — tokens / sec (median per cell)

ContextModulum qa1Vanilla qa1Modulum qa2Vanilla qa2Modulum qa3Vanilla qa3
32k35.150.439.540.449.540.7
64k33.641.535.137.645.935.7
128k37.135.932.734.940.2
Modulum decode is 20–30 % slower than vanilla at 32k–64k qa1 (platform overhead) but ~25 % faster on qa3 mid-context — likely because attention conditioning shortens output. Convergence at 128k. Frontier APIs do not expose decode timings; tokens-per-second comparison is platform-only.

Prefill rate — context tokens / sec (median)

ContextModulum (median)Vanilla (median)Δ Platform
32k~700 t/s~880 t/s−20%
64k~620 t/s~775 t/s−20%
128k~593 t/s~615 t/s−4%
Prefill medians averaged across the 3 tasks per cell. Modulum prefill is 20% slower at 32k–64k (platform overhead) but converges to vanilla by 128k. This is the cost basis of the accuracy lift; partner deployments must weigh +13pp accuracy vs ~20% prefill overhead at short context.

Error & failure-mode disclosure

Model · cellErrors / NFailure mode
Gemini 3.1 Pro · qa1 1M50 / 50HTTP 429 spending cap — all 50 requests blocked by Google API budget layer. NOT model failure. No data on actual 1M capability.
Gemini 3.1 Pro · qa2 256k16 / 16HTTP 429 spending cap. NOT model failure. Cell not measurable.
Gemini 3.1 Pro · qa1 512k6 / 5012% HTTP 429 rate; remaining 44 completed at 75% accuracy. Real degradation exists on the 44 that completed.
Grok 4.3 · qa3 128k3 / 26Run incomplete — backend errors, N=26 instead of N=50.
Grok 4.3 · qa3 64k3 / 506 % backend error rate.
Modulum · qa1 128k3 / 200503 in-flight collisions during phase-1; recovered to 200/200 via retry+backoff.
All other cells0Clean — no API errors.
All counts canonical from summary.csv (http_status ≠ 200). Per-row error text retained in all_results.csv for traceability.

Within-run drift — accuracy by tercile

Split each cell's sample order into 3 equal slices (early / mid / late) and measure accuracy per slice. Indicates whether a model degrades over sustained operation.

CellEarlyMidLateDrift (late − early)Read
Modulum qa1 64k (N=100)87.9 %78.8 %64.7 %−23.2 ppStrong monotonic decay — KV-cache or attention-state accumulation.
Modulum qa3 128k (N=500)32.5 %26.5 %22.0 %−10.5 ppSustained-run drift on the hardest cell.
Modulum qa1 128k (N=200)69.7 %68.2 %76.5 %+6.8 ppPhase-5 extension samples (idx 100–199) easier on average — phase mix dominates drift signal here.
Gemini 3.1 Pro qa1 512k (N=50)93.8 %62.5 %44.4 %−49.3 ppExtreme-context drift — accuracy collapses within a 50-sample run.
Opus 4.6 qa3 32k (N=50)100 %87.5 %77.8 %−22.2 ppSurprising drift on an easy cell — investigate Anthropic batch state.
Opus 4.6 qa1 64k (N=50)100 %100 %100 %0.0 ppClean — control case for "no drift" baseline.
Selected cells from full_audit.json. Full per-cell tercile table is in the JSON export; this shows the largest drifts. The Modulum qa1 64k −23 pp end-to-end drift is a production-blocking signal — has to be diagnosed before partner deployment.

How the bench was run.

Parameters

Dataset
RMT-team/babilong-1k-samples (verified for all cells via run_meta.json)
Tasks
qa1, qa2, qa3
Lengths
32k, 64k, 128k (full matrix); 256k/512k/1M (Gemini-only extended)
Sample idx
0..N-1 from dataset, deterministic
Scoring
Case-insensitive substring match (Hypernym's published rule)
Temperature
0 where supported (rejected by GPT-5.5 + Opus 4.7)
Max tokens
256 (Modulum), 4096 (frontier — to accommodate thinking-mode budgets)
Total observations
4,849 canonical evaluations (deduped from 5,556 raw rows across 25 SQLite databases). Phase-priority dedupe rules + per-cell source manifest in exports/manifest.json.

Statistical handling

Per-cell CI
Wilson 95% interval. Worst-case half-widths: N=26 ±17.9pp, N=39 ±15.0pp, N=50 ±13.4pp, N=100 ±9.6pp, N=200 ±6.9pp, N=500 ±4.4pp.
Significance test
Two-sample Wald test for difference of proportions (two-sided). Pre-computed for every mask cell + load-bearing pairs in exports/full_audit.json.
Decay slope
OLS fit of accuracy on log₂(context tokens). Cells with N<20 or context<8k excluded from the fit.
Within-run drift
Per-cell accuracy split into 3 equal terciles of sample order. Reported as late − early pp delta.
Re-scoring
4 scoring rules applied (substring, exact-last, unique-loc, phrase) — within ±1 row per cell.
Verification
Grok 4.3 file-read audit (v1→v2): closed 9 numerical discrepancies. Codex file-read audit (v2→v3→v4): identified Gemini 1M/256k as HTTP 429 spending-cap (not model-failure), stale vanilla qa3 cells, p-value rounding errors. All v4 corrections applied. Every claim in this report traces to exports/full_audit.json + canonical SQLite.

What we have NOT yet ruled out.

Listed in full because transparency about limitations is the credibility move for partner-side validation.

How the data was generated, in order.

#PhaseNWhatStatus
1qa1 baseline100Modulum qa1 × 32k/64k/128k; 78 503-errors at 128kDONE
1bResume failed78Re-ran failed qa1 128k rows w/ retry+backoff; recovered 75/78DONE
2Parallel qa2+qa3 32k100Killed — discovered single-slot backend through 503-stormKILLED
3Full Modulum matrix100qa3+qa2 × 32k/64k/128k sequential, 0 errorsDONE
4PPL capture20Logprob capture across all cells; revealed model is overconfidentDONE
5N=200 extension+100qa1 128k + qa2 64k/128k + qa3 128k to N=200DONE
8Phase-8 frontier baseline50Gemini 3.1 Pro + Grok 4.3 — 2026-vintage current frontierDONE (Gemini), Grok ~95% done
9Gemini qa3 128k extension+150To N=200 on qa3 128k headline cellDONE
10Both sides to N=500 on qa3 128k+300Modulum + Gemini qa3 128k to N=500 each; gap +13.8pp p<0.0001DONE
11GPT-5.5 + GPT-5.3-codex batch50OpenAI Batch API. GPT-5.3-codex rejected by batch API (no batch support)GPT-5.5 DONE
12Opus 4.7 batch (sampled)50Anthropic Message Batches API; default temp due to deprecated temperatureDONE
13Vanilla Gemma-4 mask ablation50Same Gemma-4-31B-Q4 weights without Modulum platform components. qa1+qa2 complete (300 rows); qa3 32k complete; qa3 64k partial (39/50); qa3 128k not run.DONE (qa3 128k pending)
14Opus 4.7 rerun temp=050FAILED — temperature deprecated for Opus 4.7. No cost. (Earlier batch without temp param is canonical.)FAILED
15Gemini extended context50256k/512k/1M on Gemini 3.1 Pro — qa1 256k 84% (0 err), qa1 512k 66% raw (6 HTTP-429 spending-cap errors; 75% on completed), qa1 1M (50/50 HTTP-429 spending-cap — no model data). qa2 256k (16/16 HTTP-429 spending-cap, no data). qa3 256k+ not run.DONE (partial — budget capped 1M)
16Opus 4.6 batch50Best frontier performer — 88.7% 128k avgDONE

What would elevate this from MODERATE to STRONG evidence.

  1. ▸ 01

    Modulum context-window extension to 256k+

    Request to Hypernym pending. Critical for testing the central hypothesis that Modulum's platform contribution amplifies at extreme contexts where frontier decays (Gemini at 1M = 0% in our test).

  2. ▸ 02

    qa3 32k regression investigation

    Only cell where Modulum platform underperforms vanilla base (−6pp). Is this real (platform over-correction on short-context multi-hop) or sample-specific noise? Worth running at higher N to clarify.

  3. ▸ 03

    Production-batched Modulum throughput

    Single-slot demo backend doesn't reflect production-scale serving economics. Real cost/throughput numbers require Hypernym to expose batched inference behavior.

  4. ▸ 04

    Modulum + orchestration vs frontier + orchestration

    Current comparison is bare-Modulum vs full-frontier. A fairer comparison would add equivalent orchestration (RAG, chain-of-thought, caching) to Modulum and measure the closure rate.

  5. ▸ 05

    Apply Modulum platform to a larger base model

    If Hypernym's platform contribution holds (+13pp avg over bare model), applying it to a 70B+ or 200B+ base could close more of the frontier gap. Currently hypothetical until partner-deployed.

What this data supports vs what it does not.

✓ DEFENSIBLE

  • Modulum's platform contributes +12.15pp average lift over the same Gemma-4-31B-Q4 base across 9 mask cells (qa3 128k partial). Significant at p<0.05 on 3 cells (qa1 64k, qa2 32k, qa2 128k); trending positive on 4 others (not powered at N=50).
  • Modulum on a 31B-Q4 workstation model achieves 46.0 % 128k average accuracy (qa1 71.5 % / qa2 39.5 % / qa3 27.0 % at N=200 / 200 / 500). Wilson 95 % CIs: ±6.2 pp / ±6.7 pp / ±3.9 pp respectively.
  • Best-in-class qa3 decay slope: −2.5 pp per doubling of context — flatter than GPT-5.5 (−9), Grok 4.3 (−8.3), Gemini 3.1 Pro (−7.6), and Opus 4.6 (−4). Only Opus 4.7 has a flatter qa3 slope but at much lower absolute accuracy. This is the strongest evidence for the "First, Not Lost" multi-fact preservation thesis.
  • On qa3 128k at N=500 both sides, Modulum is significantly below Gemini 3.1 Pro by 13.8 pp (z=−4.66, p<0.001) — but the result is meaningful because the comparison is honestly powered, not noise.
  • Modulum is open-weight base (Google Gemma-4) — self-hostable, sovereignty-compliant.
  • Modulum's published target footprint is workstation-scale (single GPU); the exact GPU memory figure (e.g. 16 GB) is from Hypernym's stated deployment profile, not measured in this bench.

✗ NOT DEFENSIBLE

  • "Modulum is competitive with current frontier on absolute accuracy" — trails Opus 4.6 by 42.7 pp, GPT-5.5 by 38.0 pp at 128k average.
  • "Modulum has uniquely flat decay across all tasks" — only true on qa3. On qa1, Opus 4.6 (−0 pp/2×) and GPT-5.5 (−2 pp/2×) decay flatter than Modulum (−8.75 pp/2×).
  • "Modulum beats Gemini 3 on qa3 128k" — N=500 shows Gemini wins by 13.8 pp (p<0.001).
  • "Platform contribution is significant on every cell" — only 3 of 8 mask cells reach p<0.05 at N=50.
  • "qa3 32k regression is a real platform side-effect" — −6 pp not significant at N=50 (p=0.49); may be sample noise.
  • "Modulum holds at 256k–1M context" — Modulum endpoint capped at 128k, cannot test.
  • "Gemini collapses at 1M context" — RETRACTED. The 50/50 failures at 1M are HTTP 429 spending-cap, not model-context-failure. No Gemini 1M capability evidence in this dataset.
  • "Sustained-run accuracy is stable" — Modulum qa1 64k drifts −23 pp end-to-end within a 100-sample run (production-blocking signal pending diagnosis).

Single defensible 1-sentence pitch

"Hypernym's Modulum inference platform measurably improves long-context performance over the same Gemma-4-31B-Q4 base by +12.15 pp on average across 9 BABILong cells (3 of 8 reach p<0.05 at N=50), achieves the flattest qa3 multi-fact reasoning decay slope of any tested stack at −2.5 pp per doubling of context, and runs on a workstation-scale single-GPU deployment (Hypernym profile) — though it currently trails current-generation closed-weight frontier products (Opus 4.6, GPT-5.5) on absolute 128k accuracy by 38.0–42.7 pp, making the load-bearing value proposition multi-fact context preservation at workstation scale rather than absolute-capability parity with hyperscaler-served frontier."