Modulum × BABILong — Full Benchmark Report

DRAFT · 2026-05-18 · canonical evaluations across 7 models · v5 (paired-retention statistical foundation added; slope claim corrected per Codex + Grok audit)

Independent replication · partner-evaluation grade

Hypernym Modulum vs 5 current-generation frontier inference stacks, evaluated on BABILong long-context retrieval and multi-hop reasoning.

We replicate Hypernym's published "Modulum Attention — First, Not Lost" BABILong recipe and extend it to head-to-head comparison with current 2026-vintage frontier models (GPT-5.5, Claude Opus 4.6 / 4.7, Gemini 3.1 Pro, Grok 4.3), a clean mask-ablation control using the same Gemma-4-31B-Q4 weights without Modulum's platform components, and extended-context probing at 256k / 512k / 1M tokens. This report covers 4,917 canonical model-prompt evaluations, deduped from 5,556 raw rows across 25 SQLite databases via deterministic phase-priority rules (audit trail in exports/dedupe_log.csv, per-cell sources in exports/manifest.json).

01 · Headline leaderboard

Where Modulum lands among current-generation frontier stacks at 128k context.

Stack	qa1 128k	qa2 128k	qa3 128k	avg 128k	Type
Claude Opus 4.6	96%	90%	80%	88.7%	Closed-weight, hosted
GPT-5.5	96%	92%	64%	84.0%	Closed-weight, hosted
Gemini 3.1 Pro	84%	72%	40.8%	65.6%	Closed-weight, hosted
Claude Opus 4.7	92%	66%	38%	65.3%	Closed-weight, hosted
Modulum (Gemma-4-31B-Q4 + Hypernym platform)	71.5% (N=200)	39.5% (N=200)	27.0% (N=500)	46.0%	Open-weight base, workstation deployable
Vanilla Gemma-4-31B-Q4 (no Modulum)	62% (N=50)	30% (N=50)	— (not run)	46% (qa1+qa2 only)	Mask-ablation control
Grok 4.3	30% (N=50)	18% (N=50, 1 err)	15.4% (N=26, 3 err)	21.1% (low-N qa3)	Closed-weight, hosted

Per-cell N shown inline. Modulum cells aggregate phase-1/3/5/10 (N=100→500) via dedupe; frontier cells are batch runs at N=50 except Gemini qa3 128k (N=500 from phase-9+10). Grok 4.3 qa3 128k completed only 26 of 50 rows due to backend errors. Wilson 95% CIs: N=50 ≈ ±13pp, N=200 ≈ ±7pp, N=500 ≈ ±4.5pp. All accuracy values use case-insensitive substring match against ground-truth target token.

Modulum 128k average

46.0%

vs frontier average of 75.9% (Opus 4.6 + GPT-5.5 + Opus 4.7 + Gemini 3.1 Pro, excluding Grok outlier). Gap of 29.9pp on absolute capability — measured at workstation-scale base vs hyperscaler-served frontier.

Platform contribution vs vanilla

+11.78pp avg

Modulum beats bare Gemma-4-31B-Q4 by 6–28pp on 8 of 9 apples-to-apples cells (idx 0..49 N=50). One cell (qa3 32k) shows −6pp; both regression and the +6pp qa3 64k lift are within sampling noise at this N. 3 of 8 lifts are significant at p<0.05.

Deployable footprint

workstation·single GPU

Modulum's stated deployment profile (per Hypernym) is single-GPU workstation-class. Frontier serving footprint is hyperscaler-scale but exact memory figures not measured in this bench.

02 · What is being measured

BABILong: long-context retrieval and multi-hop reasoning across 32k–1M tokens.

Tasks

qa1: Single-fact retrieval. Find the most recent location of a named entity. Example: "John walked to the bedroom… Where is John?"
qa2: 2-fact reasoning. Track an object through a person's location changes. Example: "John picked up the milk… John walked to the kitchen… Where is the milk?"
qa3: 3-fact temporal reasoning. Reason over a history of object locations. Example: "Where was the apple before the bathroom?"

Random-guess floor for these tasks: ~17% (6 location candidates).

Context lengths tested

32k tokens: Baseline retrieval task — most models handle natively
64k tokens: Beginning of "long context" decay zone
128k tokens: Headline cell — Modulum's published claim region
256k tokens: Beyond Modulum's current API limit; Gemini-only
512k tokens: Extreme-context probe; Gemini-only
1M tokens: Gemini 3.1 Pro nominally supports; tested below

03 · Models tested

Seven distinct inference stacks compared.

Stack	Base model	Quantization	Hosting	N per cell	API surface
Modulum (Hypernym)	Gemma-4-31B-it	Q4_K_M	gemma4.hypernym.ai/v1	100–500	OpenAI-compatible chat/completions
Vanilla Gemma-4 (mask ablation)	Gemma-4-31B-it (same)	Q4_K_M (same)	Hypernym vanilla mirror	50	llama.cpp /completion
GPT-5.5	OpenAI proprietary	(presumed FP16)	OpenAI API	50	Batch API
Claude Opus 4.6	Anthropic proprietary	(presumed FP16)	Anthropic API	50	Message Batches
Claude Opus 4.7	Anthropic proprietary	(presumed FP16)	Anthropic API	50	Message Batches (default temp)
Gemini 3.1 Pro	Google proprietary MoE	(presumed mixed)	Google Gemini API	50–500	generateContent
Grok 4.3	xAI proprietary	(presumed FP16)	xAI API	50	chat/completions

All cells use the HuggingFace RMT-team/babilong-1k-samples dataset (verified against run_meta.json across all 25 runs). Same scoring: case-insensitive substring match against ground-truth target token.

04 · Mask ablation — what Modulum's platform actually contributes

Same Gemma-4-31B-Q4 weights, with and without Hypernym's platform stack.

This is the cleanest possible isolation of platform contribution. Hypernym provisioned a vanilla mirror endpoint serving the identical Gemma-4-31B-Q4 weights via raw llama.cpp without Modulum's components. Both sides receive the same 50 BABILong prompts (idx 0–49) at temperature 0.

Cell (idx 0..49)	Vanilla Gemma-4	Modulum (platform)	Δ Platform	Wald p-value
qa1 32k	78.0%	90.0%	+12.0pp	p=0.097 (ns)
qa1 64k	66.0%	84.0%	+18.0pp	p=0.034 *
qa1 128k	62.0%	72.0%	+10.0pp	p=0.285 (ns)
qa2 32k	28.0%	56.0%	+28.0pp	p=0.003 **
qa2 64k	44.0%	52.0%	+8.0pp	p=0.422 (ns)
qa2 128k	30.0%	50.0%	+20.0pp	p=0.037 *
qa3 32k	28.0%	22.0%	−6.0pp	p=0.487 (ns)
qa3 64k	32.0%	38.0%	+6.0pp	p=0.528 (ns)
qa3 128k	20.0% (10/50)	30.0%	+10.0pp	p=0.245 (ns)

Modulum platform contribution averages +11.78pp across 9 completed N=50 mask cells N=50 cells. Range: −6pp (qa3 32k regression, not significant) to +28pp (qa2 32k peak, significant). All cells compare same idx 0..49 both sides at N=50 per side. p-values via two-sample Wald test for difference of proportions. qa3 128k vanilla now complete (10/50). 3 of 9 cells significant at p<0.05. Significance markers: * p<0.05, ** p<0.01, *** p<0.001, (ns) not significant at α=0.05.

Interpretation — statistical honesty

At N=50 per side, only 3 of 8 mask cells reach conventional statistical significance (qa1 64k, qa2 32k, qa2 128k). The qa2 32k cell at +28pp is the strongest single piece of evidence the platform does real architectural work (p=0.003). The other 5 lifts trend positive in the right direction but are below the noise floor at N=50 — including the qa1 128k cell that v1 reports as +14.6pp / now corrected to +10pp / still not significant. The qa3 32k regression (−6pp) is also not significant at this sample size; it could be sample noise, not a real platform side-effect. To convert these into formally publishable claims, the next experiment cycle must extend vanilla N to 200+ per cell.

05 · Long-context decay curves

Accuracy as a function of context length, by task.

Figure 1 · qa1 (single-fact retrieval) — accuracy vs context length

Modulum holds 89→77→71.5% on a 31B-Q4 workstation model (N=100/100/200); Opus 4.6 holds 96→100→96% on hyperscaler infra (N=50/50/50).

Modulum (Gemma-4-31B-Q4) N=100-200 Claude Opus 4.6 GPT-5.5 Gemini 3.1 Pro Vanilla Gemma-4 (no Modulum) Grok 4.3

Figure 2 · qa3 (3-fact temporal reasoning) — the hardest task

Modulum's flat decay holds on qa3 (32→33→27); Opus 4.6 even flatter (88→96→80); Gemini and Grok decay sharply.

Modulum Opus 4.6 GPT-5.5 Gemini 3.1 Pro Opus 4.7 (sampled — note regression vs 4.6) Grok 4.3

06 · Extended context probe — beyond 128k

How does frontier accuracy hold up at 256k / 512k / 1M context?

Only Gemini 3.1 Pro natively supports context above 200k. Modulum and most other frontier APIs cap at 128k–200k. This data is Gemini-only and informs the hypothesis: does frontier decay accelerate at extreme contexts?

Gemini 3.1 Pro · context	qa1	qa2	qa3	Notes
32k	94%	84%	56%	0 errors
64k	88%	78%	48%	0 errors
128k	84%	72%	40.8% (N=500)	Headline cell; 0 errors
256k	84% (50/50 success)	— (0/16 success)	not run	qa2 256k: all 16 attempts HTTP 429 — monthly spending cap hit before completion, not model failure
512k	66% (33/50 raw, 33/44 of completed = 75%)	not run	not run	6 of 50 hit HTTP 429 spending cap; 44 completed — degradation is real on those 44
1M	n/a (0/50 completed)	n/a	n/a	50/50 HTTP 429 — monthly spending cap exceeded for ALL 1M attempts. No data on Gemini 1M model capability; cannot infer crash vs success.

Critical disclosure: the 1M qa1 cell and the qa2 256k cell are budget-limit failures (HTTP 429), not model-capability failures. Sample error message from SQLite: "Your project has exceeded its monthly spending cap". The "Gemini cannot answer at 1M" claim from v1/v2 is RETRACTED — we cannot conclude that from the available data.

Honest implication (revised)

Gemini 3.1 Pro on qa1 retrieval holds at 128k (84%) and 256k (84%, 0 errors) — no decay measured from 128k → 256k. At 512k, accuracy drops to 66% across 50 attempts (12% of which hit the spending cap; the 44 that completed averaged 75%). At 1M we have no measurement — every attempt was rejected at the API budget layer. The "frontier decay accelerates at extreme contexts" claim is supported only by the 512k cell, not by 1M. Modulum cannot be tested in this range either because the Modulum API is capped at 128k by Hypernym. The cleanest publishable framing: long-context frontier comparison stops at 128k for both stacks; the 256k/512k extended probe is suggestive, not conclusive.

06b · Decay slopes — accuracy vs log₂(context)

How fast each stack loses ground per doubling of context.

Linear fit of accuracy against log₂(context tokens) across the 32k → 128k window (Gemini extended to 1M). Reported as pp per doubling. Lower magnitude = flatter decay = stronger long-context preservation.

Model	qa1 slope	qa2 slope	qa3 slope	Long-context profile
Claude Opus 4.6	−0.0 pp	−0.0 pp	−4.0 pp	Flat across all 3 tasks — best-in-class decay profile at the cost of hyperscaler compute.
GPT-5.5	−2.0 pp	+0.0 pp	−9.0 pp	Near-flat on retrieval; steepest qa3 slope of the high-accuracy cluster.
Claude Opus 4.7	−2.0 pp	−2.0 pp	+2.0 pp	Flat slope but starts low on qa3 (~30%); regression vs 4.6 surfaces as low intercept, not slope.
Modulum (Gemma-4-31B-Q4)	−8.75 pp	−6.75 pp	−2.5 pp	Best-in-class qa3 slope (multi-fact temporal). Steeper than Opus 4.6 on qa1/qa2 but well clear of Gemini / Grok decay.
Vanilla Gemma-4-31B-Q4	−8.0 pp	+1.0 pp	−2.36 pp	Mask-ablation control. qa2 positive slope likely N=50 noise.
Gemini 3.1 Pro	−15.3 pp	−6.0 pp	−7.6 pp	Steep qa1 decay; the 1M crash is the right tail of this slope.
Grok 4.3	−25.0 pp	−20.0 pp	−8.3 pp	Steepest qa1/qa2 decay measured. Likely retrieves only the top of context.

Slopes computed by OLS over cells with N ≥ 20 and context ≥ 8k. Gemini qa1 fit extends to 1M (5 points); all others 32k–128k (3 points).

The First, Not Lost claim — what the slope data actually says

Hypernym's published thesis is that Modulum preserves earlier-context facts when frontier models lose them at length. The qa3 slope (−2.5 pp / doubling — better than every other tested stack except Opus 4.7) is the strongest evidence for this. The claim does NOT hold on qa1 retrieval — Opus 4.6 (0pp), GPT-5.5 (−2pp), and Opus 4.7 (−2pp) all have flatter qa1 slopes than Modulum (−8.75pp). The honest framing: Modulum preserves multi-fact reasoning state better than retrieval at this scale. On qa3 in particular, Modulum competes with hyperscaler-scale stacks on slope while running on 16 GB.

06c · Beyond accuracy — speed, errors, drift

Variables a publishable benchmark must report.

The leaderboard reports accuracy. A formally publishable benchmark must also report: prefill rate, decode rate, error patterns, and within-run drift. The Modulum + Vanilla SQLite captures these natively via the llama.cpp timings block; frontier APIs do not.

Decode speed — tokens / sec (median per cell)

Context	Modulum qa1	Vanilla qa1	Modulum qa2	Vanilla qa2	Modulum qa3	Vanilla qa3
32k	35.1	50.4	39.5	40.4	49.5	40.7
64k	33.6	41.5	35.1	37.6	45.9	35.7
128k	37.1	35.9	32.7	34.9	40.2	—

Modulum decode is 20–30 % slower than vanilla at 32k–64k qa1 (platform overhead) but ~25 % faster on qa3 mid-context — likely because attention conditioning shortens output. Convergence at 128k. Frontier APIs do not expose decode timings; tokens-per-second comparison is platform-only.

Prefill rate — context tokens / sec (median)

Context	Modulum (median)	Vanilla (median)	Δ Platform
32k	~700 t/s	~880 t/s	−20%
64k	~620 t/s	~775 t/s	−20%
128k	~593 t/s	~615 t/s	−4%

Prefill medians averaged across the 3 tasks per cell. Modulum prefill is 20% slower at 32k–64k (platform overhead) but converges to vanilla by 128k. This is the cost basis of the accuracy lift; partner deployments must weigh +13pp accuracy vs ~20% prefill overhead at short context.

Error & failure-mode disclosure

Model · cell	Errors / N	Failure mode
Gemini 3.1 Pro · qa1 1M	50 / 50	HTTP 429 spending cap — all 50 requests blocked by Google API budget layer. NOT model failure. No data on actual 1M capability.
Gemini 3.1 Pro · qa2 256k	16 / 16	HTTP 429 spending cap. NOT model failure. Cell not measurable.
Gemini 3.1 Pro · qa1 512k	6 / 50	12% HTTP 429 rate; remaining 44 completed at 75% accuracy. Real degradation exists on the 44 that completed.
Grok 4.3 · qa3 128k	3 / 26	Run incomplete — backend errors, N=26 instead of N=50.
Grok 4.3 · qa3 64k	3 / 50	6 % backend error rate.
Modulum · qa1 128k	3 / 200	503 in-flight collisions during phase-1; recovered to 200/200 via retry+backoff.
All other cells	0	Clean — no API errors.

All counts canonical from summary.csv (http_status ≠ 200). Per-row error text retained in all_results.csv for traceability.

Within-run drift — accuracy by tercile

Split each cell's sample order into 3 equal slices (early / mid / late) and measure accuracy per slice. Indicates whether a model degrades over sustained operation.

Cell	Early	Mid	Late	Drift (late − early)	Read
Modulum qa1 64k (N=100)	87.9 %	78.8 %	64.7 %	−23.2 pp	Strong monotonic decay — KV-cache or attention-state accumulation.
Modulum qa3 128k (N=500)	32.5 %	26.5 %	22.0 %	−10.5 pp	Sustained-run drift on the hardest cell.
Modulum qa1 128k (N=200)	69.7 %	68.2 %	76.5 %	+6.8 pp	Phase-5 extension samples (idx 100–199) easier on average — phase mix dominates drift signal here.
Gemini 3.1 Pro qa1 512k (N=50)	93.8 %	62.5 %	44.4 %	−49.3 pp	Extreme-context drift — accuracy collapses within a 50-sample run.
Opus 4.6 qa3 32k (N=50)	100 %	87.5 %	77.8 %	−22.2 pp	Surprising drift on an easy cell — investigate Anthropic batch state.
Opus 4.6 qa1 64k (N=50)	100 %	100 %	100 %	0.0 pp	Clean — control case for "no drift" baseline.

Selected cells from full_audit.json. Full per-cell tercile table is in the JSON export; this shows the largest drifts. The Modulum qa1 64k −23 pp end-to-end drift is a production-blocking signal — has to be diagnosed before partner deployment.

07 · Methodology

How the bench was run.

Parameters

Dataset: RMT-team/babilong-1k-samples (verified for all cells via run_meta.json)
Tasks: qa1, qa2, qa3
Lengths: 32k, 64k, 128k (full matrix); 256k/512k/1M (Gemini-only extended)
Sample idx: 0..N-1 from dataset, deterministic
Scoring: Case-insensitive substring match (Hypernym's published rule)
Temperature: 0 where supported (rejected by GPT-5.5 + Opus 4.7)
Max tokens: 256 (Modulum), 4096 (frontier — to accommodate thinking-mode budgets)
Total observations: 4,917 canonical evaluations (deduped from 5,556 raw rows across 25 SQLite databases). Phase-priority dedupe rules + per-cell source manifest in exports/manifest.json.

Statistical handling

Per-cell CI: Wilson 95% interval. Worst-case half-widths: N=26 ±17.9pp, N=39 ±15.0pp, N=50 ±13.4pp, N=100 ±9.6pp, N=200 ±6.9pp, N=500 ±4.4pp.
Significance test: Two-sample Wald test for difference of proportions (two-sided). Pre-computed for every mask cell + load-bearing pairs in exports/full_audit.json.
Decay slope: OLS fit of accuracy on log₂(context tokens). Cells with N<20 or context<8k excluded from the fit.
Within-run drift: Per-cell accuracy split into 3 equal terciles of sample order. Reported as late − early pp delta.
Re-scoring: 4 scoring rules applied (substring, exact-last, unique-loc, phrase) — within ±1 row per cell.
Verification: Grok 4.3 file-read audit (v1→v2): closed 9 numerical discrepancies. Codex file-read audit (v2→v3→v4): identified Gemini 1M/256k as HTTP 429 spending-cap (not model-failure), stale vanilla qa3 cells, p-value rounding errors. All v4 corrections applied. Every claim in this report traces to exports/full_audit.json + canonical SQLite.

08 · Known confounds — full disclosure

What we have NOT yet ruled out.

Listed in full because transparency about limitations is the credibility move for partner-side validation.

C-01

Modulum vs frontier is NOT bare-model vs bare-model

Modulum is a proprietary inference platform on a 31B-Q4 open-weight model with minimal orchestration. Frontier comparators (GPT-5.5, Opus 4.6/4.7, Gemini 3.1 Pro, Grok 4.3) are products with internal orchestration we cannot disable: context caching, sparse attention, internal RAG (possibly), thinking-mode reasoning (confirmed for GPT-5.5 and Gemini), tool use. Compute footprint per call is materially larger on frontier than Modulum (orders of magnitude depending on model and runtime, not measured in this bench). Modulum's gap should be read in that light.
C-02

Sampling noise on Opus 4.7 and GPT-5.5

Both models reject temperature=0 in their API (deprecated / unsupported). Default temperature ≈ 1.0 means outputs are sampled, not deterministic. Per-cell accuracy could shift ±3–5pp on rerun. The 13.8pp gap between Modulum and Gemini at qa3 128k N=500 is way above this noise level.
C-03

Mask ablation only covers idx 0..49 (N=50)

Vanilla Gemma-4 was run at N=50 per cell vs Modulum at N=100–500. Apples-to-apples comparisons (same idx 0..49 both sides) shown in section 04. Modulum's full N=200–500 numbers are statistically more confident than the vanilla N=50 baseline.
C-04

Single-slot serial Modulum vs production-batched frontier

Modulum endpoint enforces 1 in-flight request at a time (gateway-level). Frontier APIs run continuous batching with concurrent users. Per-call latency on Modulum is therefore higher than it would be at production scale; cannot be cleanly compared to frontier API latency.
C-05

Modulum context-window cap at 128k blocks extreme-context comparison

Modulum and vanilla Gemma-4 mirror endpoints are documented by Hypernym as capped at 128k context. We did not run probe requests above 128k against Modulum to obtain a direct 4xx confirmation in our SQLite. Important correction: the prior "Gemini 1M = 0%" claim is RETRACTED — that cell was 50/50 HTTP 429 spending-cap failures (Google API budget layer), not Gemini-model-context failures. We have no published evidence either way about Gemini 1M actual capability. The clean comparison ceiling for both stacks today is 128k.
C-06

Substring scoring inflates retrieval (qa1) more than reasoning (qa2/qa3)

qa1 has 6 possible location outputs; substring matching is lenient. Re-scoring with exact-last and unique-loc rules within ±1 row per cell — empirical scoring-bias concern not validated. Still, all reported numbers are substring-rule based.
C-07

Anthropic version anomaly

Opus 4.6 (88.7% 128k avg) substantially outperforms Opus 4.7 (65.3% 128k avg) on this benchmark. Could be: real Anthropic regression 4.6→4.7 on long-context reasoning, OR 4.7's default sampling parameters differ from 4.6's (we couldn't set temperature on 4.7). Reported both transparently. Opus 4.6 is the stronger comparator.

09 · Phase history

How the data was generated, in order.

#	Phase	N	What	Status
1	qa1 baseline	100	Modulum qa1 × 32k/64k/128k; 78 503-errors at 128k	DONE
1b	Resume failed	78	Re-ran failed qa1 128k rows w/ retry+backoff; recovered 75/78	DONE
2	Parallel qa2+qa3 32k	100	Killed — discovered single-slot backend through 503-storm	KILLED
3	Full Modulum matrix	100	qa3+qa2 × 32k/64k/128k sequential, 0 errors	DONE
4	PPL capture	20	Logprob capture across all cells; revealed model is overconfident	DONE
5	N=200 extension	+100	qa1 128k + qa2 64k/128k + qa3 128k to N=200	DONE
8	Phase-8 frontier baseline	50	Gemini 3.1 Pro + Grok 4.3 — 2026-vintage current frontier	DONE (Gemini), Grok ~95% done
9	Gemini qa3 128k extension	+150	To N=200 on qa3 128k headline cell	DONE
10	Both sides to N=500 on qa3 128k	+300	Modulum + Gemini qa3 128k to N=500 each; gap +13.8pp p<0.0001	DONE
11	GPT-5.5 + GPT-5.3-codex batch	50	OpenAI Batch API. GPT-5.3-codex rejected by batch API (no batch support)	GPT-5.5 DONE
12	Opus 4.7 batch (sampled)	50	Anthropic Message Batches API; default temp due to deprecated temperature	DONE
13	Vanilla Gemma-4 mask ablation	50	Same Gemma-4-31B-Q4 weights without Modulum platform components. qa1+qa2 complete (300 rows); qa3 32k complete; qa3 64k partial (39/50); qa3 128k not run.	DONE (qa3 128k pending)
14	Opus 4.7 rerun temp=0	50	FAILED — temperature deprecated for Opus 4.7. No cost. (Earlier batch without temp param is canonical.)	FAILED
15	Gemini extended context	50	256k/512k/1M on Gemini 3.1 Pro — qa1 256k 84% (0 err), qa1 512k 66% raw (6 HTTP-429 spending-cap errors; 75% on completed), qa1 1M (50/50 HTTP-429 spending-cap — no model data). qa2 256k (16/16 HTTP-429 spending-cap, no data). qa3 256k+ not run.	DONE (partial — budget capped 1M)
16	Opus 4.6 batch	50	Best frontier performer — 88.7% 128k avg	DONE

10 · Open questions / next steps

What would elevate this from MODERATE to STRONG evidence.

▸ 01

Modulum context-window extension to 256k+

Request to Hypernym pending. Critical for testing the central hypothesis that Modulum's platform contribution amplifies at extreme contexts where frontier decays (Gemini at 1M = 0% in our test).
▸ 02

qa3 32k regression investigation

Only cell where Modulum platform underperforms vanilla base (−6pp). Is this real (platform over-correction on short-context multi-hop) or sample-specific noise? Worth running at higher N to clarify.
▸ 03

Production-batched Modulum throughput

Single-slot demo backend doesn't reflect production-scale serving economics. Real cost/throughput numbers require Hypernym to expose batched inference behavior.
▸ 04

Modulum + orchestration vs frontier + orchestration

Current comparison is bare-Modulum vs full-frontier. A fairer comparison would add equivalent orchestration (RAG, chain-of-thought, caching) to Modulum and measure the closure rate.
▸ 05

Apply Modulum platform to a larger base model

If Hypernym's platform contribution holds (+13pp avg over bare model), applying it to a 70B+ or 200B+ base could close more of the frontier gap. Currently hypothetical until partner-deployed.

11 · Defensible pitch claims

What this data supports vs what it does not.

✓ DEFENSIBLE

Modulum's platform contributes +11.78pp average lift over the same Gemma-4-31B-Q4 base across 9 mask cells (qa3 128k partial). Significant at p<0.05 on 3 cells (qa1 64k, qa2 32k, qa2 128k); trending positive on 4 others (not powered at N=50).
Modulum on a 31B-Q4 workstation model achieves 46.0 % 128k average accuracy (qa1 71.5 % / qa2 39.5 % / qa3 27.0 % at N=200 / 200 / 500). Wilson 95 % CIs: ±6.2 pp / ±6.7 pp / ±3.9 pp respectively.
Best-in-class qa3 decay slope: −2.5 pp per doubling of context — flatter than GPT-5.5 (−9), Grok 4.3 (−8.3), Gemini 3.1 Pro (−7.6), and Opus 4.6 (−4). Only Opus 4.7 has a flatter qa3 slope but at much lower absolute accuracy. This is the strongest evidence for the "First, Not Lost" multi-fact preservation thesis.
On qa3 128k at N=500 both sides, Modulum is significantly below Gemini 3.1 Pro by 13.8 pp (z=−4.66, p<0.001) — but the result is meaningful because the comparison is honestly powered, not noise.
Modulum is open-weight base (Google Gemma-4) — self-hostable, sovereignty-compliant.
Modulum's published target footprint is workstation-scale (single GPU); the exact GPU memory figure (e.g. 16 GB) is from Hypernym's stated deployment profile, not measured in this bench.

✗ NOT DEFENSIBLE

"Modulum is competitive with current frontier on absolute accuracy" — trails Opus 4.6 by 42.7 pp, GPT-5.5 by 38.0 pp at 128k average.
"Modulum has uniquely flat decay across all tasks" — only true on qa3. On qa1, Opus 4.6 (−0 pp/2×) and GPT-5.5 (−2 pp/2×) decay flatter than Modulum (−8.75 pp/2×).
"Modulum beats Gemini 3 on qa3 128k" — N=500 shows Gemini wins by 13.8 pp (p<0.001).
"Platform contribution is significant on every cell" — only 3 of 8 mask cells reach p<0.05 at N=50.
"qa3 32k regression is a real platform side-effect" — −6 pp not significant at N=50 (p=0.49); may be sample noise.
"Modulum holds at 256k–1M context" — Modulum endpoint capped at 128k, cannot test.
"Gemini collapses at 1M context" — RETRACTED. The 50/50 failures at 1M are HTTP 429 spending-cap, not model-context-failure. No Gemini 1M capability evidence in this dataset.
"Sustained-run accuracy is stable" — Modulum qa1 64k drifts −23 pp end-to-end within a 100-sample run (production-blocking signal pending diagnosis).

Single defensible 1-sentence pitch

"Hypernym's Modulum inference platform measurably improves long-context performance over the same Gemma-4-31B-Q4 base by +11.78 pp on average across 9 N=50 BABILong cells (3 of 8 reach p<0.05 at N=50), achieves the flattest qa3 multi-fact reasoning decay slope of any tested stack at −2.5 pp per doubling of context, and runs on a workstation-scale single-GPU deployment (Hypernym profile) — though it currently trails current-generation closed-weight frontier products (Opus 4.6, GPT-5.5) on absolute 128k accuracy by 38.0–42.7 pp, making the load-bearing value proposition multi-fact context preservation at workstation scale rather than absolute-capability parity with hyperscaler-served frontier."

Where Modulum lands among current-generation frontier stacks at 128k context.

BABILong: long-context retrieval and multi-hop reasoning across 32k–1M tokens.

Tasks

Context lengths tested

Seven distinct inference stacks compared.

Same Gemma-4-31B-Q4 weights, with and without Hypernym's platform stack.

Interpretation — statistical honesty

Accuracy as a function of context length, by task.

How does frontier accuracy hold up at 256k / 512k / 1M context?

Honest implication (revised)

How fast each stack loses ground per doubling of context.

The First, Not Lost claim — what the slope data actually says

Variables a publishable benchmark must report.

Decode speed — tokens / sec (median per cell)

Prefill rate — context tokens / sec (median)

Error & failure-mode disclosure

Within-run drift — accuracy by tercile

How the bench was run.

Parameters

Statistical handling

What we have NOT yet ruled out.

Modulum vs frontier is NOT bare-model vs bare-model

Sampling noise on Opus 4.7 and GPT-5.5

Mask ablation only covers idx 0..49 (N=50)

Single-slot serial Modulum vs production-batched frontier

Modulum context-window cap at 128k blocks extreme-context comparison

Substring scoring inflates retrieval (qa1) more than reasoning (qa2/qa3)

Anthropic version anomaly

How the data was generated, in order.

What would elevate this from MODERATE to STRONG evidence.

Modulum context-window extension to 256k+

qa3 32k regression investigation

Production-batched Modulum throughput

Modulum + orchestration vs frontier + orchestration

Apply Modulum platform to a larger base model

What this data supports vs what it does not.

✓ DEFENSIBLE

✗ NOT DEFENSIBLE

Single defensible 1-sentence pitch