Why Ergo
Why a judgment layer beats a bigger vector store — with the numbers.
The problem: agents don't just forget — they remember wrong
A vector store gives an agent recall. It does not give it consistency.
When a policy, path, or config changes, the old claim stays in the store — still embedded, still retrievable, still confident. The next session recalls it and acts on it. The failure mode isn't amnesia; it's stale memory poisoning new decisions, silently, with no signal that anything is off.
Retrieval alone can't fix this. On a realistic contradiction benchmark, an embedder pulls a stored claim's contradiction into the top‑10 neighborhood 87% of the time (57% at top‑1). To a vector store, a contradiction is just a very similar document — the retriever finds it; nothing judges it.
What Ergo does differently: verify at write time
Ergo is the judge bolted onto the retriever. Every guarded write (remember, learn,
supersede) runs a four-stage check against the active claims in scope before storing:
normalize → structural compare → gated NLI → tier decisionA hard conflict comes back as HTTP 409 with the prior claim and its reason — the
caller must resolve it: cancel, supersede with a new reason, or force an exception on the
record. Softer conflicts store but warn. Every claim can carry a reason; why refuses to
answer from reason-less facts; supersede chains preserve why the decision changed.
The write that comes back 409 is the product. It converts "silently overwrite a settled decision" into "confront the prior decision, on the record."
Measured, not asserted
On 105 labeled real-prose pairs, the judge runs at roughly P ≈ 0.80–0.86 / R ≈ 0.81 with zero false hard-blocks — and zero false blocks across every deploy smoke test since. The full engine ships with 292 offline tests, and the Docker build fails if the image touches the network at boot (models are baked in).
| config | precision | recall | F1 | false hard-blocks |
|---|---|---|---|---|
| structural-only | 0.93 | 0.47 | 0.62 | 0 |
| NLI ungated | 0.42 | 0.70 | 0.53 | 0 |
| NLI + gates (Ergo) | 0.96 | 0.73 | 0.83 | 0 |
Normalizing first, comparing structure, then using NLI only on the fuzzy residual with an overlap gate — beats every piece alone.
A real catch (anonymized)
A team recorded a convention: "image releases must go through the release script, not a hand-run sequence of docker commands." Two days later the script moved to a different directory. When someone recorded the new convention, the write-time gate matched it against the stale claim — so it didn't become a second, contradicting "truth" sitting next to the first. It resolved as a supersede, with why it moved preserved in the new reason.
The avoided mistake: a future session recalling "how do I release," getting the dead path,
or concluding the convention was contested and hand-running the exact docker commands the
claim existed to prevent.
Contradictions are rare events with outsized cost — that's the point. A guard that fires seldom but truthfully beats a feed of noisy similarity alerts.
Why not just X?
| Alternative | What it does NOT do (that Ergo does) |
|---|---|
| ChromaDB (+ embedder) | Retrieval only. No write-time judgment — a contradiction is stored as one more similar document (and it will co-retrieve, unjudged). No reasons, no supersede semantics, no "why did this change" history. |
| Pinecone | Same judgment gap, plus it's a managed cloud service — a non-starter for air-gapped deployments. |
| pgvector | Brings a Postgres server dependency for the same unjudged-similarity semantics. Ergo's deploy artifact is one container + one SQLite file; restore = copy one file. |
| Plain SQLite + embedder | This is Ergo's storage layer — minus the entire point: the normalizer, structural comparator, gated NLI judge, 409 contract, reasoned why, supersede chains, and the eval suites that calibrate the judge. The store is not the moat. |
Where Ergo is overkill — use commodity RAG instead
- Reference retrieval. Docs, runbooks, notes — anything you want found, not defended.
Ergo routes this through
ingest, which deliberately skips the gate and is ~30× faster. - High-volume, low-stakes memory. Chat history, scratch notes, transient status. A guarded write costs one normalizer call; paying that tax on content nobody will contradict is waste.
- High-throughput multi-writer workloads. Ergo is single-writer SQLite by design.
- Claims that churn constantly. If "the truth" changes hourly, supersede chains become noise. Ergo is for settled decisions with a shelf life.
The honest positioning
Ergo is not a universal memory firewall. It is selective safety rails for high-stakes, slow-changing decision domains — conventions, configs, policies, architecture choices, ops-runbook invariants — where a stale claim silently reversing a settled decision costs real money or real outages.
The moat is not the vector store (that part genuinely is a few lines). The moat is the judgment layer, the labeled evals that keep it honest, and single-file air-gap deployability.