Ask any data engineer about their organization's classification systems and you'll hear the same story:
"We have equipment types in the maintenance system, but they don't match the equipment types in the asset register. Finance uses different product categories than Operations. The taxonomy exists, sort of, but nobody knows the authoritative version. We think Sarah in Engineering has the latest Excel file, but she's been here 20 years and it's all in her head anyway."
These informal codesets work well enough for human understanding. People learn the terminology, figure out the exceptions, and develop institutional knowledge about what classifications really mean.
Then you try to build AI systems that need to understand this classification chaos - and everything breaks.
The core problem: Enterprise AI requires machine-readable taxonomies with formal specifications, URIs, versioning, and governance. Most organizations have none of these things. The gap between "informal codeset that people understand" and "formal taxonomy that systems can process" is where enterprise AI projects go to die.
What Makes a Codeset "Informal"?
Informal codesets share common characteristics that make them unsuitable for enterprise AI:
No Unique Identifiers
Values are represented as human-readable strings without stable identifiers:
- "Type-A Equipment" in the maintenance system
- "Type A Equipment" in the asset register (note the space)
- "Equipment Type A" in the financial system
- "A-Type" in historical documents
These are all meant to refer to the same thing, but systems can't know that. String matching fails. Different variations create duplicate classifications. AI systems treat them as four different equipment types.
No Version Control
Classifications evolve organically over time:
- 2018: "Type-A" meant one thing
- 2020: Definition changed but old documents weren't updated
- 2023: "Type-A" split into "Type-A1" and "Type-A2"
- 2024: "Type-A1" was deprecated in favor of "Type-B"
Nobody documented these changes. Historical data uses old classifications. Current systems use new ones. No specification explains how to reconcile them. AI systems have no idea which version of "Type-A" a document is referencing.
No Formal Definitions
Classifications are defined implicitly through usage, not explicitly through specifications:
- "Everyone knows what Type-A means" (but they don't - ask three people, get three answers)
- "Just look at the examples" (but examples show edge cases and exceptions, not core definitions)
- "It's in the training materials somewhere" (but which version? From which year?)
Humans can tolerate this ambiguity. AI systems cannot.
No Governance
Different departments maintain their own classifications independently:
- Engineering has equipment taxonomies
- Finance has asset categories
- Operations has process classifications
- Procurement has supplier types
These taxonomies overlap but aren't reconciled. Nobody has authority to standardize across departments. Cross-references don't exist. Integration requires manual mapping that breaks when any taxonomy changes.
Why This Breaks Enterprise AI
Modern AI systems - particularly RAG, knowledge graphs, and semantic search - require taxonomies that informal codesets can't provide:
RAG Retrieval Accuracy Depends on Semantic Coherence
When your documents use inconsistent terminology, RAG systems can't retrieve accurately:
- User searches for "Type-A equipment"
- System retrieves documents mentioning "Type-A"
- Misses relevant documents using "Type A", "Equipment Type A", "A-Type", or "Category-1" (if that's what Engineering calls it)
- Retrieval accuracy: 40-60% instead of 85-95%
Embeddings capture surface-level similarity but miss semantic equivalence. Without formal taxonomy mappings, your RAG system treats synonymous terms as different concepts.
Knowledge Graphs Need Stable Identifiers
Knowledge graphs connect entities through relationships. But if entity identifiers aren't stable, the graph breaks:
- Document from 2018 references "Type-A" (meaning the old definition)
- Document from 2024 references "Type-A" (meaning the new definition)
- Knowledge graph creates one node for "Type-A"
- Combines incompatible information from different time periods
- Relationships become meaningless
Without version control and stable identifiers, you can't build reliable knowledge graphs.
Analytics Requires Cross-System Reconciliation
Enterprise analytics combines data from multiple systems. If classifications don't map cleanly, analysis fails:
- Finance system: "Category-A assets" cost £X
- Maintenance system: "Type-A equipment" generated Y work orders
- Are these the same thing? Different systems, different naming
- Without formal cross-reference, you can't calculate cost per work order
- Analytics requires manual reconciliation - which breaks when taxonomies evolve
Machine Learning Needs Labeled Training Data
ML models require consistent labels. Informal taxonomies create label noise:
- Training data spans 10 years
- Classification definitions changed 3 times during that period
- Same label means different things at different times
- Model learns contradictory patterns
- Accuracy plateaus at 60-70% regardless of model sophistication
Better models can't fix inconsistent training labels caused by informal taxonomies.
What Formal Taxonomies Look Like
Formal taxonomies have specific characteristics that enable enterprise AI:
1. Unique, Stable Identifiers (URIs)
Every classification value has a unique identifier that never changes:
taxonomy:equipment/type-a-v2
label: "Type-A Equipment"
aliases: ["Type A", "Equipment Type A", "A-Type"]
definition: "Rotating equipment with specified characteristics..."
validFrom: 2023-01-01
supersedes: taxonomy:equipment/type-a-v1
Now systems can reliably identify what you're referring to regardless of which string variation someone uses.
2. Version Control
Taxonomies evolve, but changes are tracked explicitly:
taxonomy:equipment/type-a-v1 (deprecated 2023-01-01)
replacedBy: [
taxonomy:equipment/type-a1-v1,
taxonomy:equipment/type-a2-v1
]
taxonomy:equipment/type-a1-v1 (deprecated 2024-06-01)
replacedBy: taxonomy:equipment/type-b-v1
Now when an AI system encounters "Type-A" in a 2018 document, it can determine which version was valid then and how that maps to current classifications.
3. Formal Definitions
Classifications have explicit, machine-readable definitions:
taxonomy:equipment/type-a-v2
definition: "Centrifugal pump with the following characteristics:
- Flow rate: 100-500 GPM
- Discharge pressure: 50-150 PSI
- Motor power: 10-50 HP
- Applications: Process fluid transfer in chemical plants"
includes:
- All equipment meeting above specifications
- Regardless of manufacturer or specific model
excludes:
- Positive displacement pumps (see taxonomy:equipment/type-c)
- Pumps <100 GPM (see taxonomy:equipment/type-a-small)
Now AI systems can determine whether new equipment should be classified as "Type-A" based on formal criteria rather than guessing from examples.
4. Cross-References
Formal mappings connect related classifications across systems:
taxonomy:engineering/type-a-v2
sameAs: taxonomy:finance/category-a-v1
sameAs: taxonomy:operations/process-equipment-1-v3
relatedTo: taxonomy:maintenance/rotating-equipment-v1
broaderThan: taxonomy:procurement/pump-category-v2
Now systems can reconcile across departments automatically instead of requiring manual mapping.
5. Governance Metadata
Taxonomy includes information about authority, ownership, and change process:
taxonomy:equipment/type-a-v2
maintainedBy: "Engineering Standards Committee"
approvedBy: "Chief Engineer"
approvalDate: 2023-01-01
reviewSchedule: "Annual"
changeProcess: "Requires committee approval + 30-day notice"
contactEmail: "taxonomy@company.com"
Now there's clear authority for taxonomy decisions and a defined process for evolution.
The Transformation Path
Moving from informal codesets to formal taxonomies follows a predictable process:
Phase 1: Discovery and Documentation
Identify all existing classification systems:
- Interview domain experts who maintain informal taxonomies
- Extract classifications from operational systems
- Document current usage patterns and variations
- Map relationships between different department's classifications
- Identify conflicts, ambiguities, and gaps
Timeline: 2-4 weeks per major codeset
Cost: £10,000-£15,000 per codeset
Phase 2: Formalization
Convert informal codesets to formal specifications:
- Assign URIs to all classification values
- Write formal definitions with inclusion/exclusion criteria
- Document historical versions and evolution
- Create cross-reference mappings
- Establish governance process
Timeline: 4-8 weeks per major codeset
Cost: £20,000-£40,000 per codeset
Phase 3: Implementation
Deploy formal taxonomies in operational systems:
- Load taxonomies into governance platform or triple store
- Provide APIs for system integration
- Build mapping layers for legacy systems
- Train users on new taxonomy structure
- Establish change management process
Timeline: 6-12 weeks
Cost: £30,000-£60,000 for foundational infrastructure
Phase 4: Continuous Governance
Maintain and evolve taxonomies over time:
- Regular review cycles (quarterly or annually)
- Change request process with impact analysis
- Version releases with migration guidance
- Usage monitoring and quality metrics
Ongoing cost: £10,000-£15,000 per quarter
The ROI of Formal Taxonomies
Organizations investing in formal taxonomies see returns across multiple dimensions:
AI Implementation Success
RAG retrieval accuracy improves from 40-60% to 85-95%. Knowledge graphs become reliable. ML models achieve 15-20 percentage point accuracy gains. AI projects that would have failed become successful.
Impact: Avoid £250,000+ in failed AI project costs, realize intended AI ROI
Cross-System Integration
Analytics work across systems because classifications map cleanly. Integration projects stop requiring endless manual reconciliation. M&A integrations happen in weeks instead of months.
Impact: 40-60% reduction in integration costs, faster time to value
Operational Efficiency
Consistent classifications enable automation. Reporting becomes accurate. Compliance is easier. Users spend less time clarifying what terminology means.
Impact: 10-20% efficiency gains in data-dependent operations
Organizational Agility
New AI initiatives start faster because data is already standardized. Technology changes don't require taxonomy rework. Business evolution proceeds without data obstacles.
Impact: Strategic capability that compounds over time
Typical ROI: Invest £80,000-£120,000 in taxonomy standardization, avoid £250,000+ in failed AI projects, realize £100,000+ annual operational efficiency gains. Payback period: 6-12 months.
The Alternative: Continuing with Informal Taxonomies
Some organizations choose to continue with informal codesets. Here's what that costs:
- Every AI project requires custom taxonomy work - reinventing the same wheel repeatedly
- RAG and semantic search remain unreliable - users don't trust the systems
- Analytics projects consume months of manual reconciliation
- M&A integrations take 12-24 months instead of 8-12 weeks
- Digital transformation initiatives stall waiting for data standardization
- Competitive advantage erodes as other organizations achieve operational AI
The question isn't whether to formalize taxonomies - eventually, it becomes unavoidable. The question is whether to do it proactively as strategic investment, or reactively after multiple expensive project failures.
Taxonomy Maturity Assessment
How formal are your current taxonomies? 2-3 week assessment evaluates your classification systems, identifies gaps, and provides a roadmap for formalization.
Schedule AssessmentLooking Forward
As AI adoption accelerates, the gap between informal codesets and formal taxonomies becomes a bottleneck. Organizations with mature taxonomy infrastructure will deploy AI systems in weeks that take competitors months. They'll achieve higher accuracy, better integration, and faster ROI.
The work of taxonomy formalization isn't glamorous. It doesn't make headlines. But it's the foundation that determines whether enterprise AI actually works in production or remains perpetually stuck in proof-of-concept.
"Informal codesets work for humans. Formal taxonomies work for machines. Enterprise AI needs both humans and machines - so you need formal taxonomies."
Related reading: See how taxonomy gaps cause RAG project failures and M&A integration problems across industries.