Semantic WAL: Write-Ahead Logs for LLM Systems
2026-05-24
A serious LLM workflow needs more than request logs. It needs a semantic write-ahead log that records prompts, evidence, model contracts, parsed decisions, reducers, and replay boundaries.
A normal write-ahead log protects a database from losing committed state. LLM systems need a similar idea, but for meaning: what evidence was shown, which prompt contract was used, what the model returned, what parser accepted, and which deterministic reducer changed product state.
The mistake is to log only the final answer. The final answer is the least useful artifact when something goes wrong. You need the chain of semantic inputs that made the answer possible.
A semantic WAL entry is not a debug string. It is an append-only fact with enough structure to replay or invalidate a decision later.
CREATE TABLE semantic_wal (
id BIGSERIAL PRIMARY KEY,
aggregate_id TEXT NOT NULL,
sequence_no BIGINT NOT NULL,
event_type TEXT NOT NULL,
evidence_digest TEXT NOT NULL,
prompt_contract TEXT NOT NULL,
model TEXT NOT NULL,
model_output TEXT NOT NULL,
parsed_output JSONB,
reducer_version TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
UNIQUE(aggregate_id, sequence_no)
);
The important field is not model_output. It is the pair prompt_contract and reducer_version. A prompt can change the shape of uncertainty. A reducer decides what uncertainty is allowed to do to state. If either changes, old decisions may need re-evaluation.
from dataclasses import dataclass
from typing import Protocol
@dataclass(frozen=True)
class SemanticEvent:
event_type: str
parsed_output: dict
evidence_digest: str
reducer_version: str
class Reducer(Protocol):
version: str
def apply(self, state: dict, event: SemanticEvent) -> dict:
...
This design lets the LLM remain non-deterministic while the product state transition stays deterministic. The model generates an event candidate. The parser turns it into structured data. The reducer applies it or refuses it.
The replay boundary is where this becomes useful. You should be able to replay reducers over accepted semantic events without calling the model again. You should also be able to mark some events stale when the evidence set or prompt contract changed.
def replay(initial_state: dict, events: list[SemanticEvent], reducer: Reducer) -> dict:
state = initial_state
for event in events:
if event.reducer_version != reducer.version:
raise RuntimeError("reducer version mismatch")
state = reducer.apply(state, event)
return state
Evaluation becomes less mystical too. Instead of asking whether a generated answer “looks good,” compare reducer outputs across prompt contracts. If the prompt changes but the reducer produces the same state for a golden event set, the product impact is bounded.
SELECT prompt_contract,
reducer_version,
count(*) FILTER (WHERE parsed_output IS NULL) AS parse_failures,
count(*) AS total
FROM semantic_wal
GROUP BY prompt_contract, reducer_version;
There is a subtle benefit: the WAL separates epistemology from mutation. Evidence and model output describe what the system believed. Reducers describe what the system did. Mixing those two is why many AI products become impossible to audit.
This is not about storing more logs. It is about making model-mediated state changes replayable. If an AI feature cannot explain, replay, or invalidate its own decisions, it is not an intelligent system. It is a very expensive side effect.