Does redaction make the model worse?

The obvious objection to the gateway is: if you swap the names and numbers for placeholders, do you get a worse answer? It deserves a number, not a hand-wave — so I measured it.

Method

Three real Swiss business documents × three models (Gemini 3.5 Flash, Claude Sonnet 4.6, DeepSeek V4 Pro). Each task was run two ways — on the raw document and on the sanitized-then-restored one (exactly what the gateway delivers) — and a vendor-diverse judge panel compared the two answers blind and order-randomized, calling each pair raw better, shielded better, or tie.

Result

Privacy: total, and free. No raw personal data reached a model, and the token↔value round-trip was flawless — the restored answer was correct every time.
Utility: a small, task-dependent cost. The panel mildly preferred the raw-context answer in about two-thirds of pairs, but usually by thin margins that flipped by judge. Near-neutral on extraction and summarisation; a mild style tax on open-ended customer copy. Correctness — names, amounts, numbers — was intact in both arms.

Document	Model	Verdict	Votes (raw / shielded / tie)
Support email	Gemini 3.5 Flash	raw	3 / 0 / 0
Support email	Claude Sonnet 4.6	shielded	1 / 2 / 0
Support email	DeepSeek V4 Pro	raw	2 / 0 / 1
Insurance claim	Gemini 3.5 Flash	raw	2 / 1 / 0
Insurance claim	Claude Sonnet 4.6	raw	2 / 0 / 1
Insurance claim	DeepSeek V4 Pro	shielded	1 / 2 / 0
HR onboarding	Gemini 3.5 Flash	raw	3 / 0 / 0
HR onboarding	Claude Sonnet 4.6	tie	1 / 1 / 1
HR onboarding	DeepSeek V4 Pro	raw	3 / 0 / 0

The honest headline is not “it's free.” It is near-total privacy for a small, measurable, mitigable utility cost — and for most teams that is an easy trade for a provable residency boundary. The cost is smallest on the structured tasks that make up the bulk of enterprise LLM traffic, and it can be reduced further by tuning the gateway prompt or tokenizing only the most sensitive fields.

Caveats

This is a prototype, not a study: n = 9, a single document set, one run, and an LLM-judge panel rather than human raters (with the judge noise the thin margins reflect). A full evaluation would scale documents and tasks, run multiple trials for confidence intervals, add human evaluation, and sweep redaction granularity. It is the seed, not the last word.

← How it works · Try the live gateway →