We replicate Hypernym's published "Modulum Attention — First, Not Lost" BABILong recipe and extend it to head-to-head comparison with current 2026-vintage frontier models (GPT-5.5, Claude Opus 4.6 / 4.7, Gemini 3.1 Pro, Grok 4.3), a clean mask-ablation control using the same Gemma-4-31B-Q4 weights without Modulum's platform components, and extended-context probing at 256k / 512k / 1M tokens. This report covers 4,849 canonical model-prompt evaluations, deduped from 5,556 raw rows across 25 SQLite databases via deterministic phase-priority rules (audit trail in exports/dedupe_log.csv, per-cell sources in exports/manifest.json).
| Stack | qa1 128k | qa2 128k | qa3 128k | avg 128k | Type |
|---|---|---|---|---|---|
| Claude Opus 4.6 | 96% | 90% | 80% | 88.7% | Closed-weight, hosted |
| GPT-5.5 | 96% | 92% | 64% | 84.0% | Closed-weight, hosted |
| Gemini 3.1 Pro | 84% | 72% | 40.8% | 65.6% | Closed-weight, hosted |
| Claude Opus 4.7 | 92% | 66% | 38% | 65.3% | Closed-weight, hosted |
| Modulum (Gemma-4-31B-Q4 + Hypernym platform) | 71.5% (N=200) | 39.5% (N=200) | 27.0% (N=500) | 46.0% | Open-weight base, workstation deployable |
| Vanilla Gemma-4-31B-Q4 (no Modulum) | 62% (N=50) | 30% (N=50) | — (not run) | 46% (qa1+qa2 only) | Mask-ablation control |
| Grok 4.3 | 30% (N=50) | 18% (N=50, 1 err) | 15.4% (N=26, 3 err) | 21.1% (low-N qa3) | Closed-weight, hosted |
Random-guess floor for these tasks: ~17% (6 location candidates).
| Stack | Base model | Quantization | Hosting | N per cell | API surface |
|---|---|---|---|---|---|
| Modulum (Hypernym) | Gemma-4-31B-it | Q4_K_M | gemma4.hypernym.ai/v1 | 100–500 | OpenAI-compatible chat/completions |
| Vanilla Gemma-4 (mask ablation) | Gemma-4-31B-it (same) | Q4_K_M (same) | Hypernym vanilla mirror | 50 | llama.cpp /completion |
| GPT-5.5 | OpenAI proprietary | (presumed FP16) | OpenAI API | 50 | Batch API |
| Claude Opus 4.6 | Anthropic proprietary | (presumed FP16) | Anthropic API | 50 | Message Batches |
| Claude Opus 4.7 | Anthropic proprietary | (presumed FP16) | Anthropic API | 50 | Message Batches (default temp) |
| Gemini 3.1 Pro | Google proprietary MoE | (presumed mixed) | Google Gemini API | 50–500 | generateContent |
| Grok 4.3 | xAI proprietary | (presumed FP16) | xAI API | 50 | chat/completions |
RMT-team/babilong-1k-samples dataset (verified against run_meta.json across all 25 runs). Same scoring: case-insensitive substring match against ground-truth target token.This is the cleanest possible isolation of platform contribution. Hypernym provisioned a vanilla mirror endpoint serving the identical Gemma-4-31B-Q4 weights via raw llama.cpp without Modulum's components. Both sides receive the same 50 BABILong prompts (idx 0–49) at temperature 0.
| Cell (idx 0..49) | Vanilla Gemma-4 | Modulum (platform) | Δ Platform | Wald p-value |
|---|---|---|---|---|
| qa1 32k | 78.0% | 90.0% | +12.0pp | p=0.097 (ns) |
| qa1 64k | 66.0% | 84.0% | +18.0pp | p=0.034 * |
| qa1 128k | 62.0% | 72.0% | +10.0pp | p=0.285 (ns) |
| qa2 32k | 28.0% | 56.0% | +28.0pp | p=0.003 ** |
| qa2 64k | 44.0% | 52.0% | +8.0pp | p=0.422 (ns) |
| qa2 128k | 30.0% | 50.0% | +20.0pp | p=0.037 * |
| qa3 32k | 28.0% | 22.0% | −6.0pp | p=0.487 (ns) |
| qa3 64k | 32.0% | 38.0% | +6.0pp | p=0.528 (ns) |
| qa3 128k | 16.7% (1/6, partial) | 30.0% | +13.3pp | N too small |
At N=50 per side, only 3 of 8 mask cells reach conventional statistical significance (qa1 64k, qa2 32k, qa2 128k). The qa2 32k cell at +28pp is the strongest single piece of evidence the platform does real architectural work (p=0.003). The other 5 lifts trend positive in the right direction but are below the noise floor at N=50 — including the qa1 128k cell that v1 reports as +14.6pp / now corrected to +10pp / still not significant. The qa3 32k regression (−6pp) is also not significant at this sample size; it could be sample noise, not a real platform side-effect. To convert these into formally publishable claims, the next experiment cycle must extend vanilla N to 200+ per cell.
Only Gemini 3.1 Pro natively supports context above 200k. Modulum and most other frontier APIs cap at 128k–200k. This data is Gemini-only and informs the hypothesis: does frontier decay accelerate at extreme contexts?
| Gemini 3.1 Pro · context | qa1 | qa2 | qa3 | Notes |
|---|---|---|---|---|
| 32k | 94% | 84% | 56% | 0 errors |
| 64k | 88% | 78% | 48% | 0 errors |
| 128k | 84% | 72% | 40.8% (N=500) | Headline cell; 0 errors |
| 256k | 84% (50/50 success) | — (0/16 success) | not run | qa2 256k: all 16 attempts HTTP 429 — monthly spending cap hit before completion, not model failure |
| 512k | 66% (33/50 raw, 33/44 of completed = 75%) | not run | not run | 6 of 50 hit HTTP 429 spending cap; 44 completed — degradation is real on those 44 |
| 1M | n/a (0/50 completed) | n/a | n/a | 50/50 HTTP 429 — monthly spending cap exceeded for ALL 1M attempts. No data on Gemini 1M model capability; cannot infer crash vs success. |
Gemini 3.1 Pro on qa1 retrieval holds at 128k (84%) and 256k (84%, 0 errors) — no decay measured from 128k → 256k. At 512k, accuracy drops to 66% across 50 attempts (12% of which hit the spending cap; the 44 that completed averaged 75%). At 1M we have no measurement — every attempt was rejected at the API budget layer. The "frontier decay accelerates at extreme contexts" claim is supported only by the 512k cell, not by 1M. Modulum cannot be tested in this range either because the Modulum API is capped at 128k by Hypernym. The cleanest publishable framing: long-context frontier comparison stops at 128k for both stacks; the 256k/512k extended probe is suggestive, not conclusive.
Linear fit of accuracy against log₂(context tokens) across the 32k → 128k window (Gemini extended to 1M). Reported as pp per doubling. Lower magnitude = flatter decay = stronger long-context preservation.
| Model | qa1 slope | qa2 slope | qa3 slope | Long-context profile |
|---|---|---|---|---|
| Claude Opus 4.6 | −0.0 pp | −0.0 pp | −4.0 pp | Flat across all 3 tasks — best-in-class decay profile at the cost of hyperscaler compute. |
| GPT-5.5 | −2.0 pp | +0.0 pp | −9.0 pp | Near-flat on retrieval; steepest qa3 slope of the high-accuracy cluster. |
| Claude Opus 4.7 | −2.0 pp | −2.0 pp | +2.0 pp | Flat slope but starts low on qa3 (~30%); regression vs 4.6 surfaces as low intercept, not slope. |
| Modulum (Gemma-4-31B-Q4) | −8.75 pp | −6.75 pp | −2.5 pp | Best-in-class qa3 slope (multi-fact temporal). Steeper than Opus 4.6 on qa1/qa2 but well clear of Gemini / Grok decay. |
| Vanilla Gemma-4-31B-Q4 | −8.0 pp | +1.0 pp | −2.36 pp | Mask-ablation control. qa2 positive slope likely N=50 noise. |
| Gemini 3.1 Pro | −15.3 pp | −6.0 pp | −7.6 pp | Steep qa1 decay; the 1M crash is the right tail of this slope. |
| Grok 4.3 | −25.0 pp | −20.0 pp | −8.3 pp | Steepest qa1/qa2 decay measured. Likely retrieves only the top of context. |
Hypernym's published thesis is that Modulum preserves earlier-context facts when frontier models lose them at length. The qa3 slope (−2.5 pp / doubling — better than every other tested stack except Opus 4.7) is the strongest evidence for this. The claim does NOT hold on qa1 retrieval — Opus 4.6 (0pp), GPT-5.5 (−2pp), and Opus 4.7 (−2pp) all have flatter qa1 slopes than Modulum (−8.75pp). The honest framing: Modulum preserves multi-fact reasoning state better than retrieval at this scale. On qa3 in particular, Modulum competes with hyperscaler-scale stacks on slope while running on 16 GB.
The leaderboard reports accuracy. A formally publishable benchmark must also report: prefill rate, decode rate, error patterns, and within-run drift. The Modulum + Vanilla SQLite captures these natively via the llama.cpp timings block; frontier APIs do not.
| Context | Modulum qa1 | Vanilla qa1 | Modulum qa2 | Vanilla qa2 | Modulum qa3 | Vanilla qa3 |
|---|---|---|---|---|---|---|
| 32k | 35.1 | 50.4 | 39.5 | 40.4 | 49.5 | 40.7 |
| 64k | 33.6 | 41.5 | 35.1 | 37.6 | 45.9 | 35.7 |
| 128k | 37.1 | 35.9 | 32.7 | 34.9 | 40.2 | — |
| Context | Modulum (median) | Vanilla (median) | Δ Platform |
|---|---|---|---|
| 32k | ~700 t/s | ~880 t/s | −20% |
| 64k | ~620 t/s | ~775 t/s | −20% |
| 128k | ~593 t/s | ~615 t/s | −4% |
| Model · cell | Errors / N | Failure mode |
|---|---|---|
| Gemini 3.1 Pro · qa1 1M | 50 / 50 | HTTP 429 spending cap — all 50 requests blocked by Google API budget layer. NOT model failure. No data on actual 1M capability. |
| Gemini 3.1 Pro · qa2 256k | 16 / 16 | HTTP 429 spending cap. NOT model failure. Cell not measurable. |
| Gemini 3.1 Pro · qa1 512k | 6 / 50 | 12% HTTP 429 rate; remaining 44 completed at 75% accuracy. Real degradation exists on the 44 that completed. |
| Grok 4.3 · qa3 128k | 3 / 26 | Run incomplete — backend errors, N=26 instead of N=50. |
| Grok 4.3 · qa3 64k | 3 / 50 | 6 % backend error rate. |
| Modulum · qa1 128k | 3 / 200 | 503 in-flight collisions during phase-1; recovered to 200/200 via retry+backoff. |
| All other cells | 0 | Clean — no API errors. |
summary.csv (http_status ≠ 200). Per-row error text retained in all_results.csv for traceability.Split each cell's sample order into 3 equal slices (early / mid / late) and measure accuracy per slice. Indicates whether a model degrades over sustained operation.
| Cell | Early | Mid | Late | Drift (late − early) | Read |
|---|---|---|---|---|---|
| Modulum qa1 64k (N=100) | 87.9 % | 78.8 % | 64.7 % | −23.2 pp | Strong monotonic decay — KV-cache or attention-state accumulation. |
| Modulum qa3 128k (N=500) | 32.5 % | 26.5 % | 22.0 % | −10.5 pp | Sustained-run drift on the hardest cell. |
| Modulum qa1 128k (N=200) | 69.7 % | 68.2 % | 76.5 % | +6.8 pp | Phase-5 extension samples (idx 100–199) easier on average — phase mix dominates drift signal here. |
| Gemini 3.1 Pro qa1 512k (N=50) | 93.8 % | 62.5 % | 44.4 % | −49.3 pp | Extreme-context drift — accuracy collapses within a 50-sample run. |
| Opus 4.6 qa3 32k (N=50) | 100 % | 87.5 % | 77.8 % | −22.2 pp | Surprising drift on an easy cell — investigate Anthropic batch state. |
| Opus 4.6 qa1 64k (N=50) | 100 % | 100 % | 100 % | 0.0 pp | Clean — control case for "no drift" baseline. |
full_audit.json. Full per-cell tercile table is in the JSON export; this shows the largest drifts. The Modulum qa1 64k −23 pp end-to-end drift is a production-blocking signal — has to be diagnosed before partner deployment.RMT-team/babilong-1k-samples (verified for all cells via run_meta.json)exports/manifest.json.exports/full_audit.json.exports/full_audit.json + canonical SQLite.Listed in full because transparency about limitations is the credibility move for partner-side validation.
Modulum is a proprietary inference platform on a 31B-Q4 open-weight model with minimal orchestration. Frontier comparators (GPT-5.5, Opus 4.6/4.7, Gemini 3.1 Pro, Grok 4.3) are products with internal orchestration we cannot disable: context caching, sparse attention, internal RAG (possibly), thinking-mode reasoning (confirmed for GPT-5.5 and Gemini), tool use. Compute footprint per call is materially larger on frontier than Modulum (orders of magnitude depending on model and runtime, not measured in this bench). Modulum's gap should be read in that light.
Both models reject temperature=0 in their API (deprecated / unsupported). Default temperature ≈ 1.0 means outputs are sampled, not deterministic. Per-cell accuracy could shift ±3–5pp on rerun. The 13.8pp gap between Modulum and Gemini at qa3 128k N=500 is way above this noise level.
Vanilla Gemma-4 was run at N=50 per cell vs Modulum at N=100–500. Apples-to-apples comparisons (same idx 0..49 both sides) shown in section 04. Modulum's full N=200–500 numbers are statistically more confident than the vanilla N=50 baseline.
Modulum endpoint enforces 1 in-flight request at a time (gateway-level). Frontier APIs run continuous batching with concurrent users. Per-call latency on Modulum is therefore higher than it would be at production scale; cannot be cleanly compared to frontier API latency.
Modulum and vanilla Gemma-4 mirror endpoints are documented by Hypernym as capped at 128k context. We did not run probe requests above 128k against Modulum to obtain a direct 4xx confirmation in our SQLite. Important correction: the prior "Gemini 1M = 0%" claim is RETRACTED — that cell was 50/50 HTTP 429 spending-cap failures (Google API budget layer), not Gemini-model-context failures. We have no published evidence either way about Gemini 1M actual capability. The clean comparison ceiling for both stacks today is 128k.
qa1 has 6 possible location outputs; substring matching is lenient. Re-scoring with exact-last and unique-loc rules within ±1 row per cell — empirical scoring-bias concern not validated. Still, all reported numbers are substring-rule based.
Opus 4.6 (88.7% 128k avg) substantially outperforms Opus 4.7 (65.3% 128k avg) on this benchmark. Could be: real Anthropic regression 4.6→4.7 on long-context reasoning, OR 4.7's default sampling parameters differ from 4.6's (we couldn't set temperature on 4.7). Reported both transparently. Opus 4.6 is the stronger comparator.
| # | Phase | N | What | Status |
|---|---|---|---|---|
| 1 | qa1 baseline | 100 | Modulum qa1 × 32k/64k/128k; 78 503-errors at 128k | DONE |
| 1b | Resume failed | 78 | Re-ran failed qa1 128k rows w/ retry+backoff; recovered 75/78 | DONE |
| 2 | Parallel qa2+qa3 32k | 100 | Killed — discovered single-slot backend through 503-storm | KILLED |
| 3 | Full Modulum matrix | 100 | qa3+qa2 × 32k/64k/128k sequential, 0 errors | DONE |
| 4 | PPL capture | 20 | Logprob capture across all cells; revealed model is overconfident | DONE |
| 5 | N=200 extension | +100 | qa1 128k + qa2 64k/128k + qa3 128k to N=200 | DONE |
| 8 | Phase-8 frontier baseline | 50 | Gemini 3.1 Pro + Grok 4.3 — 2026-vintage current frontier | DONE (Gemini), Grok ~95% done |
| 9 | Gemini qa3 128k extension | +150 | To N=200 on qa3 128k headline cell | DONE |
| 10 | Both sides to N=500 on qa3 128k | +300 | Modulum + Gemini qa3 128k to N=500 each; gap +13.8pp p<0.0001 | DONE |
| 11 | GPT-5.5 + GPT-5.3-codex batch | 50 | OpenAI Batch API. GPT-5.3-codex rejected by batch API (no batch support) | GPT-5.5 DONE |
| 12 | Opus 4.7 batch (sampled) | 50 | Anthropic Message Batches API; default temp due to deprecated temperature | DONE |
| 13 | Vanilla Gemma-4 mask ablation | 50 | Same Gemma-4-31B-Q4 weights without Modulum platform components. qa1+qa2 complete (300 rows); qa3 32k complete; qa3 64k partial (39/50); qa3 128k not run. | DONE (qa3 128k pending) |
| 14 | Opus 4.7 rerun temp=0 | 50 | FAILED — temperature deprecated for Opus 4.7. No cost. (Earlier batch without temp param is canonical.) | FAILED |
| 15 | Gemini extended context | 50 | 256k/512k/1M on Gemini 3.1 Pro — qa1 256k 84% (0 err), qa1 512k 66% raw (6 HTTP-429 spending-cap errors; 75% on completed), qa1 1M (50/50 HTTP-429 spending-cap — no model data). qa2 256k (16/16 HTTP-429 spending-cap, no data). qa3 256k+ not run. | DONE (partial — budget capped 1M) |
| 16 | Opus 4.6 batch | 50 | Best frontier performer — 88.7% 128k avg | DONE |
Request to Hypernym pending. Critical for testing the central hypothesis that Modulum's platform contribution amplifies at extreme contexts where frontier decays (Gemini at 1M = 0% in our test).
Only cell where Modulum platform underperforms vanilla base (−6pp). Is this real (platform over-correction on short-context multi-hop) or sample-specific noise? Worth running at higher N to clarify.
Single-slot demo backend doesn't reflect production-scale serving economics. Real cost/throughput numbers require Hypernym to expose batched inference behavior.
Current comparison is bare-Modulum vs full-frontier. A fairer comparison would add equivalent orchestration (RAG, chain-of-thought, caching) to Modulum and measure the closure rate.
If Hypernym's platform contribution holds (+13pp avg over bare model), applying it to a 70B+ or 200B+ base could close more of the frontier gap. Currently hypothetical until partner-deployed.
"Hypernym's Modulum inference platform measurably improves long-context performance over the same Gemma-4-31B-Q4 base by +12.15 pp on average across 9 BABILong cells (3 of 8 reach p<0.05 at N=50), achieves the flattest qa3 multi-fact reasoning decay slope of any tested stack at −2.5 pp per doubling of context, and runs on a workstation-scale single-GPU deployment (Hypernym profile) — though it currently trails current-generation closed-weight frontier products (Opus 4.6, GPT-5.5) on absolute 128k accuracy by 38.0–42.7 pp, making the load-bearing value proposition multi-fact context preservation at workstation scale rather than absolute-capability parity with hyperscaler-served frontier."