24 APRIL 20265 MIN READ

AI Corrupts Your Documents When You Delegate Tasks to It

AI Research Risk

A new Microsoft Research paper quantifies what many have suspected but few have measured.

Microsoft Research published a paper in April 2026 that deserves more attention than it has received. Three researchers, Philippe Laban, Tobias Schnabel, and Jennifer Neville, set out to test a simple question: when you hand a task to an AI and let it edit your documents over a long workflow, how faithful is it to your original content?

The answer, across 19 models and 52 professional domains, is not reassuring.

Contents

Summary
What the Study Actually Tested
The Core Finding
Agentic Tool Use Makes Things Worse
What Makes Degradation Worse
Where AI Is Actually Reliable
The Practical Implication

Summary

It found that even the best frontier models corrupt an average of 25% of document content during long workflows, with average content retention dropping to just 50% after 20 interactions. The errors are silent and not visible on the surface. Giving AI agents more tools makes the problem worse, not better, and degradation shows no plateau even at 100 interactions.

What the Study Actually Tested

The researchers built a benchmark called DELEGATE 52. It simulates what they call delegated work, the interaction model where a user assigns a task to an AI and expects it to execute without introducing errors. Think of editing a report across multiple sessions, refactoring a codebase iteratively, or managing structured records through an automated workflow.

The methodology was straightforward. Models were given a document and asked to make a series of edits, then reverse them. If the AI is reliable, you end up back where you started. If it isn't, the document drifts from the original. They ran this across 52 domains including coding, accounting, music notation, crystallography, and recipe management, using models from OpenAI, Anthropic, Google, Mistral, and others.

The Core Finding

Even the best available frontier models, including GPT 5.4, Gemini 3.1 Pro, and Claude 4.6 Opus, corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. Across all 19 models tested, the average reconstruction score after 20 interactions sits at around 50%, meaning half of the original semantic content is lost or altered.

The errors are not gradual or obvious. The paper describes them as sparse but severe. A document can look structurally intact while containing meaningful changes to values, classifications, or logical relationships that a quick review would miss. The researchers call this silent corruption, and it is an accurate description.

Agentic Tool Use Makes Things Worse

The intuitive response to unreliable AI outputs is to give models more tools. File access, code execution, search and replace. This is the premise behind most agentic AI products being built and sold right now.

The paper tested this directly. Agentic tool use does not improve performance on DELEGATE 52. Models given file and code tool access frequently performed worse than models working with text directly, likely because of token processing overhead and an inability to use code execution effectively for non trivial edits.

This is worth sitting with. The architecture that the industry is building toward, autonomous agents with broad tool access, appears to compound the reliability problem rather than solve it.

What Makes Degradation Worse

The researchers identified several factors that multiply corruption severity.

Longer documents degrade faster, with each additional thousand tokens amplifying the effect over successive interactions. Longer workflows showed no plateau even at 100 interactions. Distractor files in the working environment increase errors, and the effect grows with interaction length. Task diversity also matters: varied editing tasks accumulate errors significantly faster than repeated single task types.

These are not edge case conditions. They describe normal professional work.

Where AI Is Actually Reliable

The paper is not uniformly negative. Python code manipulation is the one domain where most models perform reliably across long workflows, with reconstruction scores above 98%. Highly structured domains with constrained syntax, such as chemical notation and chess, also perform better.

The pattern is consistent. AI handles formal, constrained content well. It struggles with open ended, semantically rich documents where meaning is carried in ways that are harder to preserve mechanically.

The Practical Implication

If you are using AI to edit, restructure, or manage professional documents across multiple interactions, the research suggests you cannot assume the output is faithful to the original. The document will likely look fine. The AI will not flag what it changed. But something may be wrong.

This does not mean AI tools are without value in document workflows. It means that unsupervised delegation, handing a task over and trusting the output without verification, carries a quantified risk that most current workflows do not account for.

The researchers are direct about their conclusion: current LLMs are not ready to be trusted as delegates across the broad range of professional document work.

That finding comes from Microsoft Research, from people building these systems. It is worth taking seriously.

Source: Laban, P., Schnabel, T., and Neville, J. (2026). LLMs Corrupt Your Documents When You Delegate. arXiv:2604.15597. Microsoft Research.

All articles on this site are written by me. I use AI to assist with final formatting and editing before publication.