Hypernym is the reality substrate for the post-Bitter-Lesson era.
Five reasoning models, four ideation rounds, ~20 outputs. What the panel converged on: the unit of product is the Persistent Domain Schema; the empirical proof that 75% of attention is structural noise demands a new architecture; and Hypercore + Modulum together compose the first commercial Grounded State Compiler.
An LLM that routes attention through a PDS is a different kind of system than an LLM that searches a token field. Five models with very different priors arrived at the same architecture in five different names: Reality Substrate (Claude), Grounded World Kernel (Codex), Verifiable Causal Engine (Grok), Bicameral Architecture (Gemma), Hypernym World Model (Gemini). One concept, five mouths. That is the actual market signal pushing through.
Five compounding zero-to-ones.
Each step exists today; each enables the next. The stack is what makes Hypernym non-cloneable: every layer below has to exist before the layer above is even legible.
Most attention is noise.
Across 4 architectures (Llama 3.1 8B, MiniMax M2.5 228B, plus two others), approximately 75% of attention heads contribute structural noise; 25% carry signal. The implication is architectural, not engineering. Demands publication-grade replication.
Persistent Domain Schema.
The unit of product. Entities, facts, confidence, provenance, embeddings, vocab window. Compiled from a corpus, mounted as substrate at inference time. Convergent across all four rounds and 5 of 5 models. Karpathy markdown vaults, Mem0, OpenHarness MEMORY.md are downstream of this. Hypernym's wedge is grounded PDS: mechanical confidence with structural provenance.
Grounded State Compiler.
Takes Hypercore truth (Omnifact 60-trial extraction, HyperRemember reranking, Compressed Repo Analyze 87%) and outputs a Modulum-loadable state. Already the beachhead in Forge: dispatch-core/src/hypernym.ts ships S37/S38 cache-first FORGE.md compression in production. The concept compiles outward to every domain.
Reality Substrate.
The agent's attention routes through the PDS, not a token field. Every claim has provenance and confidence. Five models named the same architecture five different ways: Reality Substrate, Grounded World Kernel, Verifiable Causal Engine, Bicameral Architecture, Hypernym World Model. The convergence is the signal.
Bicameral Architecture.
Substrate and LLM as two cooperating minds. The substrate guarantees mechanical truth (PDS, confidence, provenance); the LLM does inference. Reference implementation: Modulum-7B-Native, continued pretrain of Qwen 3.5-7B or Llama 3.1-8B with attention-modification objectives. Four of five panel models picked this option. Once it exists, the PDS spec becomes load-bearing infrastructure.
Concepts that ≥3 of 5 models named.
Different words, same building blocks. Not products: building blocks. The primitives compose; the products are downstream compositions of them.
compress → ground → CBAS → route on every dispatch. The trunk line: until it exists, every other Hypernym primitive is advisory.rm -rf, deploy_to_prod) if confidence falls below threshold. Per-track τ via replay calibration.remember, recall, reconcile, replay. Confidence-gated writes. Vendor-neutral. Closes the "memory stores but does not surface" failure.disputes edge between dissenting reviewers turns round-2 disambiguation from human re-reading into machine-readable diff.Bash during BUILD sees Grind-mode rules; the same call during PLAN sees Pivot-mode rules. PDS conditions tool selection.The big matrix.
All ~60 ideas. X axis: distance to ship (now to 2-year lab). Y axis: research depth (engineering to frontier). Color: surface (Hypercore-now / Hypercore-future / Modulum-now / Modulum-future / Lab / Outlier). Size: panel convergence (1 to 5 of 5 models). Hover any item for detail. This is a matrix to process, not a pre-filtered top five.
Where each idea lives.
Hypercore surfaces convergent product-grade work. Modulum surfaces the architecture that proves the thesis. The Lab is where world-model and simulation experiments stress-test the substrate.
- Hypernym Vault Grounded memory for Goose / OpenHarness / Cline. 5/5 panel · the cash and distribution flywheel
- SectorPack API Sector-tuned Omnifact (legal, medical, finance, academic). 5/5 panel · monetizes a latent
sector:param - GroundedNotes / Receipt Vault Obsidian-first, B2C prosumer. 5/5 panel · Karpathy-vault but grounded
- DomainForge CLI
domainforge compile --source ./docs --output my.pds --sector legal. 3/5 panel · the "git" for knowledge compilation - Omnifact-arXiv Public fact-graph + SPARQL. CC-BY 4.0. 5/5 panel · NeurIPS / ICLR Verifiability Reports
- CBAS Runtime Confidence-bound action gating before tool execution.
- Magic plugin v1 Already shipping; extends to substrate-diff in CR panels.
- Compressed Repo Analyze 87% compression. Bootstraps any repo into a PDS.
- Goose Expert Pack Exchange Marketplace. Stripe pack, Terraform pack, FDA pack. Codex outlier · network effect
- Sector PDS Hub Vertical SKUs. Legal, medical, finance, energy, research.
- PDS-to-LoRA pipeline Customer uploads PDS; service generates synthetic Q&A; LoRA. 3/5 panel
- Grounded Expertise Cartridge H2 outlier. Pre-built per-domain bundles.
- Memetic PDS GitHub for fact-graphs. Anime canon, D&D, Wiki-style.
- arXiv Citation Standard "Verifiability Report per accepted paper." Standard capture.
- LongMemEval-Grounded Cert + HLI Category capture. Pairs with EU AI Act Art. 52.
- Hyper-Synthetix PDS-driven synthetic-data factory for vertical model trainers.
- Living Corpus / EchoStream Streaming PDS with freshness half-lives.
- Magic plugin v2 Cross-doc contradiction surfacing in real time.
- PDS.md spec DESIGN.md analog. Format-as-product.
- RepoTwin Compressed Repo Analyze + Dreamer 4 to pre-flight agent commits.
- 75%-Attention-Is-Noise Empirical study across 4 architectures. The structural finding the rest stands on.
- 3.04× Decode Speedup Demonstrated on Modulum substrate, no weights modified.
- Infinite Context via Attention Modification Software-only; ships through inference-time substrate-loading.
- Cache-First FORGE.md Compression Forge S37/S38, already production. dispatch-core/src/hypernym.ts
- HyperRemember Embeddings + fact-based reranking. Substrate query layer when Modulum runtime is mocked.
- Omnifact 60-trial Quorum Stochastic semantic-fact extraction with confidence. A Semantic-Quorum primitive in production.
- Substrate-Mounted Inference (SW) Pure software composition (HyperRemember + clever prompting). Reproduces Modulum substrate behavior on any base. Crafter MVP uses this.
- Modulum-7B-Native Continued pretrain of Qwen 3.5-7B / Llama 3.1-8B with attention-modification objectives. 4/5 panel · 12 weeks · category-defining
- Substrate-Native Harness LoRA + pre-attention substrate-loading hooks via SGLang RadixAttention or vLLM. OSS Python package.
- Substrate-1 (from-scratch) 1B–4B with 25%-signal-heads-only attention. The "we know the architecture is wasteful, build the right one" play.
- Modulum Runtime (hosted) 3-of-5 R7.7 outlier convergence: skip-training-build-runtime.
modulum serve <any-hf-model>. - 25%-Signal-Heads Attention First architectural implementation of the noise-pruned head class.
- PDS-as-Shareable-Prefix Mass-shareable SGLang prefix cache. 40-60% token-cost reduction vs RAG.
- KV-as-Substrate Treat KV cache as queryable substrate, not ephemeral compression.
- In-Place TTT × PDS Fast-weights as memory; PDS as the structured TTT input.
- Distill-Frontier-into-Substrate-Native Gemma outlier. Claude / GPT-4o teacher into Phi-4 student via SFT. $30–50K pre-flight.
- Modulum chip narrative Hardware ceiling rising (Flash Attention 4 at 1605 TFLOPs/s). Modulum kernels ship via vLLM.
- Crafter v1 substrate-mounting MVP 5/5 panel pick. 21 days · ~$40K · publication-grade falsifier
- SWE-Bench Verified v2 5/5 outlier convergence. 6–8 weeks after Crafter. Real-codebase substrate test.
- Dreamer 4 grounded variant Diamonds-in-Minecraft from offline data. PDS-grounded equivalent loads compiled domain facts.
- Genie 3 textual analog DeepMind ships visual world models; Hypernym's lane is textual / factual.
- NetHack lifetime-learning + PDS Persistent expertise across deaths.
- Brax / MuJoCo physics-grounded Physics-domain PDS: every fact has a physical referent.
- AgentGym-RL RL-train agents with PDS for compounding session expertise.
- Gym-Anything (200 apps) Software-as-environment paradigm. Hypercore facts pre-populate every env.
- WebArena substrate Browser-task agent benchmark; PDS for "browser tasks at site X."
- ml-intern HF post-training Substrate-mounted research-loop agent.
- MiroShark cross-pollination Swarm-intelligence sim engine; agent influence leaderboard scored by Omnifact.
- TimesFM time-series PDS Specialized PDS for time-series forecasting; loadable into Modulum.
- Heaviside-style domain foundation models EM 800,000× faster than commercial sim. Direct PDS-for-physics analogy.
- CORAL multi-agent discovery Self-evolving agents need state continuity; the Grounded State Compiler is the missing piece.
- Counterfactual Futures Market Branching PDS forks. Foundation for HLI.
- Hypernym Court Multi-agent arbitration mechanism.
- Contradiction Atlas Catalogues active disputes across docs, models, time.
- VeriBrand Algorithmically verifiable marketing: brand uploads spec PDS, content auto-checked.
- ProvenanceShield / ShieldRuntime Anti-hallucination streaming proxy; SSE-token-retract via MCP.
- APN Receipts Cryptographically signed reviewer verdicts.
- Knowability Replay Time-indexed reconstruction of what was knowable when.
- Self-Evolving Harness Hypercore profiles harness logs and proposes patches.
The cleanest place to falsify the thesis.
Schmidhuber's Neural World Model Boom essay positions world models as the substrate competitor to next-token LLMs. Hypernym's lane is the textual / factual world model, the one with provenance. Five models on the panel, asked to design a 3-week MVP, picked the same environment.
Crafter v1 substrate-mounting MVP UNANIMOUS
Open-world Minecraft-like with 22 achievements as built-in benchmarks. Symbolic state (grid, inventory, vitals) maps 1:1 to PDS entities. Published baseline (Hafner 2022 geomean; DreamerV3 ~14%, SPRING/GPT-4 ~27%). Clean A/B isolates the substrate layer.
- Target. 1.5× sample efficiency over 100 train eps; ≥+8 absolute Crafter points; 95% bootstrap CI excludes zero.
- Hardest falsifier (Codex). A baseline given a hand-written 20-fact static checklist reproduces the gain — meaning the PDS does no runtime work. Pre-registered.
- Vocab window. ~90–120 canonical tokens (17 actions + 22 achievements + ~20 objects + ~12 predicates).
- Fact tiers. Repo-rule (Omnifact bootstrap, conf 0.90–0.97); transition (env-observed, 0.95); strategy (post-episode distillation, 0.65–0.90).
SWE-Bench Verified UNANIMOUS
Every model on the panel was asked for a "wildly different" outlier MVP. All five picked SWE-Bench. Reason is structural: highest-blast-radius LLM-agent benchmark, and Compressed Repo Analyze finally earns its keep on real code (Crafter does not really need it).
- Per-repo PDS. Entities = files, functions, tests, issues. Facts = call graphs, deps, type sigs, tracebacks.
- Target. +8–15% absolute pass rate over baseline (SOTA is 3–6%, so +8% is publication-worthy).
- Sequence. Crafter v1 (3wk); if positive, fund SWE-Bench v2 (6–8wk).
Environment evaluation
Why Crafter wins.
| Environment | Hypercore-ingestable | Clean A/B | Failure decomposable | Panel verdict |
|---|---|---|---|---|
| Crafter | Yes, symbolic state | Yes, published baseline | Yes, per-achievement | 5 / 5 — PICK |
| SWE-Bench Verified | Yes, repo PDS | Yes, pass rate | Yes, per-test | 5 / 5 — v2 outlier |
| AgentGym-RL | Multi-ontology friction | RL confounds | Mixed | Rejected, confound risk |
| NetHack LE | Extreme partial obs. | Strong | Hard | v3+ candidate |
| Brax / MuJoCo | Continuous physics | Strong | Numeric | Wrong domain story |
| Gym-Anything (200 apps) | Yes, software state | Per-app | Strong | Too broad for v1 |
| WebArena | DOM-shaped | Strong | Per-task | v3+ candidate |
| SCADA / clinical sim | Domain-shaped | No published baseline | Hard | Customer story, not v1 |
Surrounding research the substrate work composes with
What we lift, what we extend, what we co-position with.
Schmidhuber — Neural World Model Boom
Authoritative survey from the field's founder. Positions WMs as the substrate competitor to next-token LLMs. Modulum's PDS-into-attention is structurally the same bet, but with the grounded provenance Schmidhuber does not address.
Dreamer 4
Diamonds in Minecraft purely from offline data (Hafner). PDS-grounded equivalent: load a domain's Hypercore-compiled facts as the offline corpus, then deterministic agent expertise.
Project Genie 3
DeepMind's interactive 3D world model. Visual WM lane is taken; textual / factual WM lane is open. Co-positioning, not collision.
PufferLib 2.0
RL at 1M steps/s with vectorized environments. PDS-into-Modulum becomes the "expertise prior" PufferLib environments load.
In-Place Test-Time Training (ICLR 2026 oral)
Continual learning via projection-matrix fast weights. PDS-into-fast-weights is a plausible product wedge: structured TTT input that is also human-auditable.
CALM (Continuous Autoregressive)
Replaces next-token with vector-prediction at K-token chunks. Hypernym's attention-DB-query frame aligns: the structured chunk is a PDS-block.
SGLang RadixAttention
25K stars, 400K GPUs deployed. Prefix-cache and KV-share ecosystem; PDS as mass-shareable prefix. The cache plane Modulum substrate slots into.
TurboQuant / RocketKV / ChunkKV / Expected Attention
Training-free KV compression at production scale. ChunkKV's "semantic chunks" maps cleanly to "fact units" in Hypercore. Expected Attention's future-query awareness implies PDS knows the future-query distribution per domain.
MiroShark
Swarm-intelligence simulation engine; daily autonomous influence leaderboard already scored. Hypernym Omnifact extracts claim-grade facts; Modulum-as-sim-substrate plugs straight in.
TimesFM
Google open-source time-series foundation model, 100B data points zero-shot. Specialized FMs for narrow modalities. PDS analog: time-series persistent expertise loadable into Modulum.
Heaviside
Foundation model for electromagnetism, 800,000× faster than commercial sim. Direct analogy: PDS for physics-domain.
CORAL
Autonomous multi-agent open-ended scientific discovery. Self-evolving agents need state continuity, and the Grounded State Compiler is the missing piece they hand-wave.
Concept-level zero-to-ones.
The list overlaps the §02 primitives but is read at the engineering-architecture grain. Names you could put on a whiteboard and build to.
compress → ground → CBAS → route as a single chokepoint. Promotes Forge's grounding-gate.ts from observability v1 to enforcement v2.remember, recall, reconcile, replay. Confidence on writes; substrate-diff on disagreement; lineage walk on replay.disputes edge between dissenting reviewers. Open-disagreements query becomes the seed for the next sprint.What we lift, what we extend.
110 OSS items mapped from a 286-item bookmark sweep plus targeted X-trends search. Filtered for relevance to Hypercore (comprehension and grounding) and Modulum (PDS-into-attention).
World Models & Simulation
- PufferLib 2.0 RL at 1M steps/s. PDS-into-Modulum as the expertise prior PufferLib envs load.
- Crafter 22-achievement open-world. 5/5 panel pick for substrate-mounting MVP.
- NetHack Learning Env Lifetime-learning. Persistent expertise across deaths via PDS.
- Brax / MuJoCo Playground Physics-grounded. PDS for physics-domain; every fact has a physical referent.
- AgentGym-RL Long-horizon multi-turn LLM-agent RL. RL-train agents with PDS for compounding session expertise.
- Gym-Anything (CMU) 200-app software-as-environment. Hypercore facts pre-populate every Gym-Anything env.
- Dreamer 4 (Hafner) Diamonds in Minecraft from offline data. PDS-grounded equivalent: load Hypercore-compiled facts as the offline corpus.
- Genie 3 (DeepMind) Interactive 3D world model. Visual WM lane occupied; textual / factual lane open.
- Heaviside EM foundation model, 800,000× faster than sim. Direct PDS-for-physics analogy; specialized FM template.
- TimesFM Google OSS time-series foundation model. Time-series persistent expertise loadable into Modulum.
- CORAL Multi-agent autonomous discovery. Self-evolving agents need state continuity; GSC is the missing piece.
- MiroShark Swarm-intelligence sim engine + influence leaderboard. Modulum-as-sim-substrate plugs straight in.
- Schmidhuber WM Boom essay Authoritative WM survey. Direct positioning opportunity; Hypernym = grounded WM.
- Hitchhiker's Guide to World Models Field-consolidating survey. Right moment for a "grounded WM" thesis paper.
Agent Harnesses & Runtime
- Goose (Block) 35K stars, most-loved OSS harness. Day-1 distribution channel for Hypernym Vault.
- OpenHarness (HKUDS) 9.1K stars, MEMORY.md, MCP, ReactTUI. PDS is the missing "compiled domain" layer.
- Pi-mono Simplest harness, highest cache hit, lowest tokens. Reference impl for thin harness; Hypernym = fat-skill source.
- Hermes Agent v0.12 (Nous) Multi-agent Kanban. Each kanban task carries a PDS preload.
- Hermes self-evolution $2 to rewrite own brain. Auto-evolving agents need durable state ground-truth = Hypernym.
- Flue First headless TS agent harness. Magic-plugin-style PDS injection is the complementary layer.
- AutoAgent #1 SpreadsheetBench, top GPT-5 TerminalBench. Meta-agent rewrites harness overnight; PDS shouldn't be re-derived.
- Anthropic Managed Agents Hosted harness, "Dreaming" feature. Anthropic ships sessions as durable state; Hypernym's grounded dream is the differentiator.
- OpenClaw + Knowledge System Knowledge templates. Lacks grounded provenance; Omnifact adds it.
- Claude Artifacts (open-sourced) Sandboxed iframe. "Show me the PDS" B2C surface.
- Adaptive Passport Agent acquires its own API keys. Now agents acquire their own PDS bundles.
- Browser Harness Self-healing, edits helpers.py on the fly. PDS for "browser tasks at site X" directly applies.
Attention / KV-cache / Inference
- SGLang 25K stars, 400K GPUs deployed (RadixAttention). Prefix-cache + KV-share; PDS = mass-shareable prefix.
- vLLM Day-0 MiniMax M2.7 support, multi-agent orchestration. Modulum kernels could ship via vLLM.
- CALM (Tencent + Tsinghua) Continuous Autoregressive Language Models — vector-prediction at K-token chunks. Structured chunk = PDS-block; attention-DB-query alignment.
- TurboQuant (ICLR 2026) 6× memory, 8× faster on H100, training-free. Hypernym should ship a PDS-aware variant.
- RocketKV 400× compression, 32.6% peak memory reduction. Aggressive eviction + sparse attention combo.
- ChunkKV Semantic-chunk compression, +26.5% throughput. "Semantic chunks" maps cleanly to "fact units" in Hypercore.
- Expected Attention KV compression by estimating future-query attention. PDS knows the future-query distribution by domain.
- Self-Indexing KVCache 1-bit vector quantization unifies compression and retrieval. Compression-as-index = PDS-as-query-target.
- SALS (Sparse Attention in Latent Space) 6.4× compression, 5.7× speedup. PDS-into-attention is structurally compatible.
- Multi-head Latent Attention (MLA) Winning at scale. Latent attention is the substrate trend; PDS lives natively in latent.
- Flash Attention 4 1605 TFLOPs/s on Blackwell. Hardware ceiling rising; Modulum chip narrative benefits.
- OpenMythos Recurrent-depth transformer reverse-engineering of Mythos. Looped transformer = compute-adaptive depth; PDS routes depth selection.
- Atomic Chat TurboQuant Gemma 4 at 25 tok/s on 16GB MacBook Air. Same wedge Modulum cuts.
Memory / Persistent State
- Karpathy markdown-vault thesis "AI files itself." Most-cited memory-pattern of Q2 2026; Hypernym is its grounded backend.
- claude-mem ~50K stars, persistent context across sessions. Memory plugin gold rush; Hypernym = grounded version.
- Mem0 84.23% LongMemEval. Direct competitor; Hypernym differentiates on grounding.
- Byterover 92% accuracy claim, Git-like hierarchy, 50–70% token savings. Reproducibility is the bar.
- Zep · Letta · Cognee · Honcho Top-6 agent-memory frameworks 2026. Hypernym benchmarks against all on LongMemEval.
- Graphify 71.5× fewer tokens per query, no vector DB. LLM-knowledge-graph trend; Hypercore is the grounded version.
- llm-wiki (nvk) Persistent personal KB. Wiki pattern + Hypernym substrate.
- Single Brain Vector DB ingests Slack/CRM/calls every 15min. "Company brain" pattern; PDS is the unit.
- MemMachine Ground-truth-preserving memory. Direct conceptual sibling to Hypernym ground-truth.
- Persistent Identity in AI Agents Multi-anchor identity. PDS = identity-of-domain.
- Icarus inside Obsidian Hermes memory as readable notes + graph. Mechanical-confidence overlays directly.
Standards & Protocols
- DESIGN.md (Stitch by Google) Apache 2.0 open spec, 5.2K stars in 72hr. Reference for "PDS.md" specification proposal.
- MCP 97M downloads, Anthropic + OpenAI + Google + Microsoft adoption. Tool protocol settled; PDS-as-MCP-resource is a path.
- UCP (Tobi/Shopify) Universal Commerce Protocol. PDS could be the equivalent for domain expertise.
- C2PA 6,000 members, AI provenance global ref. Mechanical-confidence is the LLM-content peer.
- Pricing.md trend Auth0, Resend, WorkOS. .md-as-API standardization expanding.
- OWASP Top 10 Agentic 2026 Agent security framework. PDS provenance maps to ASI controls.
- Agentic Risk Standard DeepMind + MS + Columbia + Virtuals. Risk-rating standard.
- Resolver.md (Garry Tan) Routing-table-in-markdown, 200 lines replaces 20K. PDS-routing primitive.
- x402 / ERC-8004 / ERC-8183 stack On-chain agentic-economy standards. Modulum-as-service inference micropayments path.
Post-training / Fine-tuning
- LlamaFactory Unified efficient fine-tuning of 100+ LLMs/VLMs. Distribution surface for PDS-to-LoRA pipeline.
- ml-intern Automates HuggingFace post-training team. End-to-end research-loop agent; composes with GSC.
- In-Place Test-Time Training (ICLR 2026) Fast-weights as memory. PDS-into-fast-weights as a product wedge.
- SHL0MS Autoreason Agent-debate reasoning method. Consistency arbitration is exactly Hypercore's mechanic.
- Trinity (Arcee) Agent-coherence-tuned model. PDS-tuned models is the dual.
- Carnice-9b Qwen3.5-9b harness-tuned. Harness-specific fine-tunes; PDS-conditioned variants are the next step.
- Budgeted LoRA Distillation as structured compute allocation. PDS-conditioned LoRAs as a product line.
- SLAD Shared LoRA adapters for task-specific distillation. One PDS, many LoRA.
Compression / Encoding
- Compressed Repo Analyze (Hypernym) 87% compression. Already shipping in Forge S37/S38.
- Hypernym Omnifact 60-trial fact extraction. Already shipping.
- Bonsai 8B (PrismML) 1-bit intelligence-density paradigm (1.06/GB vs 0.10/GB). Framing wedge for Modulum's compression story.
- Google KV-cache 6× compression Same quality. Direct competitor in compression; differentiate on persistent expertise.
- Anthropic compaction 4 levels, disk-backed task list, CLAUDE.md memory. Compaction-as-product; Hypernym Magic plugin is the production version.
B2C / Dev-tool Surface
- CodeWiki (Google) Paste GitHub repo, get interactive guide. Compressed Repo Analyze is the substrate.
- Karpathy second-brain Claude Skill B2C distribution channel.
- GBrain v0.10 RESOLVER.md + SOUL.md + ACLs, 24 fat skills. Personal-OS reference impl.
- Garry Tan fat-skills/fat-code/thin-harness Dominant agent-engineering thesis. Hypernym is the source of fat-skill content.
- Glass by Ramp Every-employee AI with one-click setup. Enterprise-rollout pattern needing pre-built domain PDS.
- Claude Doctor Reads ~/.claude, writes CLAUDE.md rules. Self-healing config; Magic plugin can do the same with grounded facts.
- SKILLIFY pattern (Garry Tan) Skill-creation loop. PDS-ify is the structural cousin.
- Bud (AI Human Emulator) Full computer + comms. Human-emulator agents demand durable identity = PDS.
Eval / Benchmark
- LongMemEval Memory-vendor benchmark. 87%+ target for Hypernym Vault publication.
- Crafter benchmark (Hafner) 22-achievement geomean. 5/5 panel MVP target.
- SWE-Bench Verified Real-codebase agent benchmark. Unanimous 5/5 v2 outlier.
- LegalBench / FinanceBench / MedQA SectorPack falsifiers. +15 F1 / +12 F1 deltas.
- LongBench-v2 / RULER-128K / NIAH-extreme Long-context substrate benchmarks. Modulum-7B-Native targets.
- NeurIPS / ICLR Verifiability Reports Standard-capture mechanism for arXiv service.
- TerminalBench (Stanford + Laude) 89 tasks, harness+model pair eval. PDS is harness-orthogonal IP.
Research-loop / Self-Improvement
- Karpathy Autoresearch Self-improving research loops. Verify-run-loop is PDS-validation-loop.
- Meta-Harness (Stanford) Karpathy Autoresearch on steroids. Reference impl.
- Agentic Harness Engineering paper Automated harness evolution. PDS as a learnable component.
- DeepAgents Harness Profiles Model-harness-task fit. Versioning the PDS around the harness.
- Tinker for autoresearch Autoresearch tooling category.
Forge × Hypernym, file-cited.
Where the Hypernym primitives compound into Forge's research infrastructure. Each upgrade is implementable today on top of the existing dispatch-core/src/hypernym.ts beachhead. Effort: S (1–2 days), M (~1 week), L (2–4 weeks).
The trunk line. dispatchThrough(substrateRouter, envelope) chains compress → groundClaims → CBAS gate → routeProviderChain. Every other Hypernym primitive remains advisory until this lands. Promotes grounding-gate.ts from observability v1 to enforcement v2.
The measurement flywheel. Adds a 7th suite running the same OutcomeEvalCase set twice: once with the existing dispatcher, once with a substrate-mounted dispatcher. Quantifies the substrate's value-add. Without it, the rest of the backlog becomes a belief system.
Smallest effort, largest behavioral change. Executor refuses high-cost tools (rm -rf, deploy_to_prod) if Omnifact confidence falls below τ (default 0.85). Per-track τ via replay calibration. Kills the "burn budget on low-confidence retries" failure mode mechanically.
Four primitives alongside store/search/get: remember(scope,key,val,{confidence,provenance}), recall(query,{minConfidence}), reconcile(key), replay(entryId). Vendor-neutral; every provider gets an adapter. Closes the "memory stores but does not surface" failure.
Adds disputes edge relation. New endpoint GET /api/v1/cxdb/disputes/:entryId. New CLI forge convergence diff <sha> treats per-reviewer findings as N substrates and emits exactly which semantic facts differ. Round-2 disambiguation becomes machine-readable.
Each reviewer verdict and findingsDigest signed with reviewer-keyed material via Keychain (existing credential-broker). Adds signatures[].receipt: APNReceipt. Audit log becomes non-repudiable, a precondition for ever exposing review verdicts off-host.
CheckpointPreview gains {confidence, provenance: {transitionEventId, attestationPath, dispatchId, instructionsHash}}. forge replay <sha> reconstructs "what was epistemically available when this transition fired." Post-mortem becomes a one-liner.
Adds confidence REAL + quorum_trials INT. autoAttest calls runSemanticQuorum(entry, model) which submits the entry text through Hypernym N times. Stable attractors score high, drifting attractors score low. Drives F4's reconcile and recall.
Insert quorumCheck(proposal) before autoApprove(). Submit each proposal's hypothesis text through Hypernym N=60. Require attractor stability ≥ τ before council vote. Catches hypotheses that are linguistically smooth but semantically incoherent.
ForgeEvent gains confidence?: number, groundedBy?: string[], semanticHash?: string. POST /api/v1/events becomes a fact-ingest endpoint. External agents subscribe to facts, not just events. No new infra; the pipes already exist.
Promote FORGE.md to PDS manifest. Each section gets factId, confidence, provenance pointer. Envelope ships only sections relevant to the current FSM node, picked by the existing dispatch metadata. Constitutional Clauses carry nonDeletable: true.
New script reads events.jsonl + per-iter retros + dispatch-metrics.jsonl, compresses via Hypernym, ranks "bottleneck facts" (e.g. "Codex round 2 finds bypass-class bugs that round-1 prompts miss"). Output to .forge/artifacts/harness-patch-proposals/{date}.md; the next sprint SEED reads it.
Cost reports gain a confidence field. New budget rule: "do not spend more than $X on dispatches where substrate confidence is below τ." Pre-dispatch query returns 402 when violated. Three weeks of memory feedback say we burn budget on low-confidence retries; this kills it mechanically.
Adds disputes to edge relations. When the outliers sidecar records a dissentFromConsensus, ingest creates a disputes edge. Combined with semantic-quorum confidence (F8), AUDIT phase ranks "highest-stakes open disagreement" automatically. Seed for the next sprint, surfaced for free.
Each classified intent carries substrateProfile: 'pivot' | 'grind' | 'neutral' derived from the FSM node. Tool calls inside dispatch get the substrate annotation: a Bash call during BUILD sees Grind-mode rules; the same call during PLAN sees Pivot-mode rules. The dynamic-injection that the telemetry note describes.
Compose order · foundations first, then the substrate plane, then close the loop
- Foundations (S, parallelizable). F7 checkpoint provenance, F10 Reality Bus events, F13 CBAS-weighted cost, F14 disputes edge. All additive schema plus one handler each.
- Confidence math (M, sequential). F8 CXDB attestations gain confidence; F4 Memory Router primitives consume it.
- Substrate plane (L, builds on 1+2). F1 Substrate Router, F11 PDS for FORGE.md, F15 substrate-aware ToolSearch.
- Loop closure (M, builds on plane). F9 hypothesis quorum, F5 substrate diffing, F12 self-evolving harness, F7 knowability retros, F2 substrate-mounted evals, F6 APN receipts.
High-variance R&D bets.
Single-model proposals or 3-of-5 outlier convergence. Outliers reset categories. The 3-of-5 "skip-training-build-runtime" convergence in R7.7 is the strongest non-recommended signal in the entire panel.
Three of five R7.7 models independently raised this as their outlier. modulum serve <any-hf-model> applies substrate-mounting at inference time without changing weights. Sized at $200K parallel runtime + $550K B-track for a $750K total inside the $1M envelope. Strongest signal not voted as primary, and the cleanest hedge against B-track failure.
Quarterly Moody's-style rating of top-40 AI products by per-domain fabrication rate. Pairs with EU AI Act Art. 52 (Aug 2026). Pulls demand for every Hypernym SKU. The category-defining move: convert Hypernym from "another memory vendor" into the scorekeeper.
Free benchmark + paid certification. Hypernym sets the scoreboard for the memory-vendor category. Open-source-from-day-one as the methodology defense. Highest-leverage outlier in the round; closely related to HLI; together they reframe procurement.
Branching PDS forks for "what if F had been different?" Substrate diffing applied to alternative timelines. Enterprise strategists, regulators. Speculative; long-horizon; foundation for HLI's underlying mechanism.
When two agents disagree, who decides? Mechanical fact-graph diff and contradiction resolution. Creates a market category that does not yet exist. Multi-agent ops at scale will need it within 18 months.
Use Claude or GPT-4o as teacher; distill PDS-reasoning into Phi-4 or 1B small model via SFT. $30–50K, 2 weeks. Pre-test for the B-track: validate the substrate thesis before committing $550K. Possibly the single highest-ROI experiment in the deck.
Hypernym contributes PDS dataset, IP, and $100–200K; partner provides compute and co-authors a paper. De-risks B-track at 30–50% cost share. Different IP terms but cleaner distribution if it works.
Anime canon, D&D rules, Wiki-style domains. Confidence-as-leaderboard social mechanic. Plausibly viral; lower direct revenue but unique enterprise funnel. The B2C Hypernym-as-platform play.
Financially-incentivized adversarial fact-disputing platform. Users stake on truth claims; the byproduct is an invaluable training corpus. Wild but plausible: data asset over revenue.
Brands upload a product spec PDS; all marketing content is auto-checked. FTC and EU compliance. Clean B2B revenue line; underdeveloped in the round but obvious surface for HLI customers.
Compressed Repo Analyze + Dreamer 4 to simulate agent commits before they land. Aligns with the branchable-counterfactual theme. Long-game; high-value enterprise infrastructure once Modulum runtime exists.
PDS-driven synthetic-data generation for vertical model trainers. Vertical SLMs need 3K–30K training examples; PDS shortcuts the example-generation phase. Composes with the PDS-to-LoRA pipeline.
Agent due-diligence service. Upload traces, receive a competence envelope. Cloud-uptime / security-attestation analog for the agent era. Long-game strategic positioning.
If C-track (from-scratch substrate-native) is the right answer at some point, you would need 2 architecture-and-alignment researchers to spec it. Worth keeping a year-out option open even without funding it now.
What R7 did not resolve.
Each item needs another round, a customer conversation, a partnership conversation, or a focused experiment. Listed here so the team picks them up, not so they sit unowned.