Does redaction make the model worse?
The obvious objection to the gateway is: if you swap the names and numbers for placeholders, do you get a worse answer? It deserves a number, not a hand-wave — so I measured it.
Method
Three real Swiss business documents × three models (Gemini 3.5 Flash, Claude Sonnet 4.6, DeepSeek V4 Pro). Each task was run two ways — on the raw document and on the sanitized-then-restored one (exactly what the gateway delivers) — and a vendor-diverse judge panel compared the two answers blind and order-randomized, calling each pair raw better, shielded better, or tie.
Result
- Privacy: total, and free. No raw personal data reached a model, and the token↔value round-trip was flawless — the restored answer was correct every time.
- Utility: a small, task-dependent cost. The panel mildly preferred the raw-context answer in about two-thirds of pairs, but usually by thin margins that flipped by judge. Near-neutral on extraction and summarisation; a mild style tax on open-ended customer copy. Correctness — names, amounts, numbers — was intact in both arms.
| Document | Model | Verdict | Votes (raw / shielded / tie) |
|---|---|---|---|
| Support email | Gemini 3.5 Flash | raw | 3 / 0 / 0 |
| Support email | Claude Sonnet 4.6 | shielded | 1 / 2 / 0 |
| Support email | DeepSeek V4 Pro | raw | 2 / 0 / 1 |
| Insurance claim | Gemini 3.5 Flash | raw | 2 / 1 / 0 |
| Insurance claim | Claude Sonnet 4.6 | raw | 2 / 0 / 1 |
| Insurance claim | DeepSeek V4 Pro | shielded | 1 / 2 / 0 |
| HR onboarding | Gemini 3.5 Flash | raw | 3 / 0 / 0 |
| HR onboarding | Claude Sonnet 4.6 | tie | 1 / 1 / 1 |
| HR onboarding | DeepSeek V4 Pro | raw | 3 / 0 / 0 |
The honest headline is not “it's free.” It is near-total privacy for a small, measurable, mitigable utility cost — and for most teams that is an easy trade for a provable residency boundary. The cost is smallest on the structured tasks that make up the bulk of enterprise LLM traffic, and it can be reduced further by tuning the gateway prompt or tokenizing only the most sensitive fields.
Caveats
This is a prototype, not a study: n = 9, a single document set, one run, and an LLM-judge panel rather than human raters (with the judge noise the thin margins reflect). A full evaluation would scale documents and tasks, run multiple trials for confidence intervals, add human evaluation, and sweep redaction granularity. It is the seed, not the last word.