I Was Wrong About AI Readiness
Thinking about unstructured data governance
When I was asked to bring governance to our company’s unstructured data for AI agents, I thought I understood the problem. Clean up the wikis, deduplicate the SharePoint sites, build taxonomies! In short, get everything organized so our AI tools could find the right information.
I was approaching this the wrong way.
The problem wasn’t that I misunderstood the goal. The problem was that I was solving for the wrong type of failure mode. I was thinking about how humans experience messy documentation when I should have been thinking about how machines experience it.
The consumer of unstructured data is shifting from humans to machines, and they fail differently.
When a human reads two contradictory internal documents, they use judgment. They know the 2019 policy doc is probably stale. They remember that Ben’s team does things differently than Susan’s. They can tell when something doesn’t add up. The cost of messy documentation in a human world is inefficiency. People waste their time finding the right source.
When an AI agent encounters the same contradiction, it treats both documents as equally valid. It doesn’t know which is current, has no institutional memory about team differences, and can’t detect inconsistencies the way humans can. The cost shifts from inefficiency to confident automation of errors. The agent doesn’t get confused; it gets wrong.
This matters because agents are being deployed into environments where unstructured data has never been properly organized. Most enterprises spent decades accumulating documents, wikis, email threads, Slack chats, meeting recordings, and PDFs with minimal governance. While historically annoying, the ROI for organizing that content never penciled before. Why would it now?
Why I stopped planning document cleanup
Content curation means reorganizing documents: deduplicating, building taxonomies, standardizing formats, migrating to unified platforms. It’s what most “AI readiness” initiatives are doing. It’s what I was planning to do.
But the problems kept stacking up in my planning:
It’s expensive. Requires dedicated headcount, consultant engagements, content migration projects. You’re paying people to reorganize documents that are probably fine where they are.
It’s fragile. Tools change constantly. SharePoint becomes Confluence becomes Notion. File formats evolve. Each platform shift breaks the curation work. The half-life of organizational content infrastructure is… maybe 3-5 years?
It will be too slow to help AI. RAG architectures, context windows, and retrieval strategies are evolving fast. What you optimize for today’s chunking logic may be irrelevant in six months when context windows hit 10M tokens.
Nobody does it even when they should. Most organizations had decades to organize their content before AI agents arrived. They didn’t, because the cost-benefit wasn’t there. The existence of document management tools doesn’t mean organizations use them well.
What I suggest instead
Once I understood the new failure mode for agents, I stopped trying to reorganize content. Epistemic governance accepts that organizational content will stay messy. It adds a governance layer on top without reorganizing the underlying documents.
The analogy is DNS, the Domain Name System, which powers the web. DNS doesn’t reorganize the internet or deduplicate websites. It simply provides a lightweight registry that maps domain names to IP addresses and includes mechanisms for resolving conflicts (TTL expiration, authoritative nameservers).
Epistemic governance for unstructured data works the same way. It has three components that work together.
First is the authority registry. This records who knows what. For each knowledge domain and subdomain, you capture who the recognized expert is, what the canonical source document is, what documents it supersedes, and when it was last validated. This is a small, maintained data structure, not a reorganization of documents themselves. The documents stay where they are. You’re just making the informal authority network explicit and machine-queryable.
An example entry: “For wholesale vehicle price adjustments, Jane Smith (Senior Pricing Analyst) is the authority. The canonical source is wholesale-adj-v3.docx on SharePoint, last validated Feb 2025. It supersedes the 2022 Confluence page and the 2023 draft that was never finalized.”
Don’t worry, the registry doesn’t need to cover everything. Start with 5-10 subdomains in a single knowledge area. Each entry comes from a 90-minute session with the domain expert where you identify the canonical sources, document what they supersede, and set a validation cadence.
Second is the contradiction log. This tracks known conflicts between sources. When two sources are known to conflict, you record what specifically contradicts (precise claims, not vague descriptions), who reviewed it, and what the resolution is.
Contradictions come in types:
Resolved conflicts have a clear answer: one source is right, the other is wrong. The log records “Use exponential depreciation curve from Source B. Source A’s linear model was replaced in Q3 2024.”
Accepted conflicts, however, are intentional. Both sources are correct for different contexts. Wholesale and retail pricing models legitimately use different methods. The log records “Present both approaches and ask user which context applies.”
Under review means identified but not yet adjudicated. The log captures that the conflict exists so agents can caveat their responses until it’s resolved.
Third are the resolution rules. These are the policies agents follow when encountering ambiguity. Start with five simple rules that handle most scenarios:
If the contradiction log has an explicit resolution for this conflict, follow it. This overrides everything else.
When sources conflict and one is registered as canonical, defer to it. Exception: if validation is stale (over 12 months), flag for re-validation.
If a question falls outside registered subdomains, answer with available information but caveat that the topic isn’t covered by the governance system. Log the gap.
If canonical source validation is stale, use it anyway (stale governance beats none) but caveat the staleness and flag for review.
When contradiction is marked as accepted (intentional), present both approaches and ask user which context applies.
Why this compounds
The authority registry and contradiction log become more valuable over time.
Each new domain entry makes the system more useful through network effects. An agent that can handle wholesale pricing conflicts becomes more capable when retail pricing gets added.
Once you have a registry, agents encountering unregistered subdomains generate demand signals for where governance should expand next. Your gaps become visible through conflict detection.
The last_validated date creates pressure to keep expertise fresh. When a canonical source goes stale, the system flags it. This validation forcing function is better than relying on document modification timestamps because currency is about expert validation, not file system events.
The registry survives platform migrations. When your organization inevitably moves from Confluence to Notion (Or back. Again.), the authority records stay valid. You just update the URLs in the canonical_source field. Compare this to soul-crushing curation work that has to be redone with each tool change. The governance layer is portable across tools.
Let’s Compare
The two approaches aren’t mutually exclusive. You can do both. But they have different cost profiles, sustainability characteristics, and failure modes. Curation is expensive and fragile but visible (executives can see reorganized folders). Epistemic governance is cheaper and more durable but requires accepting that metadata about content matters more than reorganization of content.
Wait, what about usage signals?
I hear some of you already asking why we aren’t just crowdsourcing the right data as many tools today recommend. Usage data is valuable but not as governance input. Instead it is valuable as triage input. Which documents agents actually retrieve, how often, and in what contexts does matter, but that isn’t trust.
High retrieval frequency on an unregistered subdomain also tells you where to register authority next. Conflicts discovered through agent use will tell you what to add to the contradiction log. Therefore, usage patterns are diagnostic: they tell you where governance attention is needed, not which sources are authoritative.
This distinction matters. If you let usage signals directly determine authority (the most-retrieved document is the best), you’ve removed the expert from the loop. That collapses the system back to the same problem it was designed to solve: machines deciding what machines should trust.
Usage data tells you where to apply expert judgment. It doesn’t replace expert judgment.
What changed my thinking
Three realizations shifted my approach from content curation to epistemic governance.
First was understanding that agents experience document chaos differently than humans do. Humans get frustrated and inefficient. Agents get confidently wrong. This changes the economics of governance because the cost of ungoverned data shifted from friction to automation risk. Once I saw this, I couldn’t unsee it.
Second was the DNS analogy. I was thinking about knowledge governance as content reorganization because that’s how most data governance literature frames it: schemas, taxonomies, data lakes. But DNS doesn’t reorganize the internet. It provides a lightweight registry layer. Dave McComb’s Software Wasteland touches on this with knowledge graphs as governance infrastructure, but the analogy to DNS as a minimal viable registry clarified something for me. You can govern without reorganizing.
Third was recognizing that usage data tells you where to look, not what to trust. I initially thought about letting usage signals become authority decisions. Most-retrieved document becomes canonical, right? But that conflates popularity with correctness (wasn’t true in high school, isn’t true today). That works in consumer web contexts (PageRank, collaborative filtering) but breaks in enterprise contexts where popularity and correctness diverge, especially when incentives favor document creation over document maintenance. Usage data is triage input. It tells you where to apply expert judgment. It doesn’t replace expert judgment.
I was wrong about AI readiness because I was solving for retrieval when I should have been solving for trust. The documents don’t need to be findable in a cleaner folder structure. They need to carry signals about authority, currency, and conflict resolution that agents can act on.
The work isn’t reorganizing content. It’s making the informal authority networks in your organization explicit and machine-queryable. That’s a different kind of governance, with different economics and different durability characteristics. It compounds instead of decaying.
Further reading
Martin Kleppmann, Designing Data-Intensive Applications: Chapters 5 and 9 on eventual consistency and conflict resolution. The concepts translate directly from distributed systems to unstructured data governance.
Dave McComb, Software Wasteland: Makes the case that we over-invest in organizing data and under-invest in making meaning explicit.
Andrew Ng, Data-Centric AI Resource Hub: The case for investing in data quality with human-in-the-loop processes rather than relying on model improvement alone.




