Does redaction make the model worse?

The obvious objection to the gateway is: if you swap the names and numbers for placeholders, do you get a worse answer? It deserves a number, not a hand-wave — so I measured it.

Method

Three real Swiss business documents × three models (Gemini 3.5 Flash, Claude Sonnet 4.6, DeepSeek V4 Pro). Each task was run two ways — on the raw document and on the sanitized-then-restored one (exactly what the gateway delivers) — and a vendor-diverse judge panel compared the two answers blind and order-randomized, calling each pair raw better, shielded better, or tie.

Result

DocumentModelVerdictVotes (raw / shielded / tie)
Support emailGemini 3.5 Flashraw3 / 0 / 0
Support emailClaude Sonnet 4.6shielded1 / 2 / 0
Support emailDeepSeek V4 Proraw2 / 0 / 1
Insurance claimGemini 3.5 Flashraw2 / 1 / 0
Insurance claimClaude Sonnet 4.6raw2 / 0 / 1
Insurance claimDeepSeek V4 Proshielded1 / 2 / 0
HR onboardingGemini 3.5 Flashraw3 / 0 / 0
HR onboardingClaude Sonnet 4.6tie1 / 1 / 1
HR onboardingDeepSeek V4 Proraw3 / 0 / 0

The honest headline is not “it's free.” It is near-total privacy for a small, measurable, mitigable utility cost — and for most teams that is an easy trade for a provable residency boundary. The cost is smallest on the structured tasks that make up the bulk of enterprise LLM traffic, and it can be reduced further by tuning the gateway prompt or tokenizing only the most sensitive fields.

Caveats

This is a prototype, not a study: n = 9, a single document set, one run, and an LLM-judge panel rather than human raters (with the judge noise the thin margins reflect). A full evaluation would scale documents and tasks, run multiple trials for confidence intervals, add human evaluation, and sweep redaction granularity. It is the seed, not the last word.

← How it works  ·  Try the live gateway →