AI Interpretability and AI Safety - Variety Dynamics

Dr Terence Love

A Variety Dynamics Case Study | December 2025

Executive Summary

Current AI interpretability methods face fundamental limitations that variety dynamics explains structurally rather than as technical problems requiring better tools. This case study analyses symbolic and subsymbolic interpretability approaches through variety dynamics axioms. This reveals why AI interpretability methods struggle with modern AI systems and suggests alternative control-focused strategies. This analysis demonstrates that the interpretability challenge stems from variety mismatch (human cognitive capacity versus AI processing complexity) and multi-feedback-loop opacity rather than insufficient technical sophistication.

The Core Challenge: AI Systems as Variety-Processing Systems

Neural networks transform input variety (data, queries, contexts) through architectural structures into output variety (predictions, classifications, generated text). Understanding these transformations requires analysing how variety distributions shift through processing layers and where control over these transformations resides.

Axiom 28 establishes that all variety processing requires physical substrate—AI systems instantiate variety transformations in computational hardware subject to Axiom 26's thermodynamic constraints. However, the critical limitation isn't physical but cognitive.

Axiom 41 provides the foundational explanation for interpretability difficulty: Mental prediction fails beyond two interacting feedback loops. Modern neural networks contain numerous feedback mechanisms—layer interactions, attention dynamics, backpropagation, recurrent connections, training feedback—placing their behaviour beyond human mental prediction capacity. This isn't a temporary limitation pending better tools but a fundamental cognitive boundary.

Axiom 42 reinforces this: Beyond two feedback loops, control mechanisms become "effectively invisible to those affected" through mental analysis alone. Researchers cannot mentally trace how input varieties transform through multi-loop structures to produce outputs, regardless of visualisation sophistication or analytical effort.

Symbolic Methods: Reducing Feedback Complexity

Decision Trees and Rule Extraction

Decision tree extraction attempts to approximate neural network behaviour through hierarchical rule structures. Axiom 18 (Control strategies for problematic subsystems) frames this as variety attenuation: The neural network generates prediction variety through multi-loop processing exceeding human control capacity. Researchers respond by replacing the network with a simpler approximation.

Axiom 3 (Variety topology determines system evolution) reveals the fundamental problem: Decision trees impose hierarchical variety topology onto systems whose actual topology is non-hierarchical and continuous. Each node partitions input space into discrete branches, but neural networks process variety through continuous transformations across distributed representations. The topological mismatch means decision trees cannot capture neural network behaviour accurately.

Practical failure through transaction costs: Axiom 35 (Transaction costs increase with variety) and Axiom 36 (Exponential transaction cost scaling) explain why adequate tree approximations become impractical. Representing neural network variety requires exponentially growing tree size as input dimensions and their interactions multiply. The transaction costs (computational complexity, interpretability loss) exceed the benefits the approximation was meant to provide.

Logical Rule Sets

Logical rule extraction faces similar constraints. Axiom 31 (Irreversibility and open boundaries) establishes that variety-generating systems typically cannot reverse to prior states. Neural networks during training generate new representational varieties—learned features, weight patterns, emergent computational structures—that cannot "unlearn" back into simple logical rules.

Axiom 7 (Variety generation requires control mechanisms) shows the constraint incompatibility: Continuous, high-dimensional, non-linear variety spaces in neural networks operate under constraints (loss functions, architectural boundaries) incompatible with discrete, Boolean, rule-based constraints. Translation between these constraint structures inevitably loses essential variety relationships.

Subsymbolic Methods: Mapping Variety Transformations

Activation Maximization and Feature Visualization

These methods identify input varieties maximally activating specific neurons, attempting to map individual processing unit behaviour.

Axiom 17 (Variety dynamics and system control) establishes that variety distributions across components determine control capacity. Feature visualisation reveals which input varieties individual neurons respond to, but Axiom 4 (Subsystem variety and power transfer) shows this misses the critical dynamic: Power in the network resides not in individual neurons but in how varieties transfer and transform across layers.

A neuron responding strongly to edge patterns may have minimal influence on final classifications if downstream layers ignore its outputs. The variety that neuron processes doesn't determine its control over network behaviour. Axiom 1 (Power emerges from variety topology) explains that power follows from position in variety-processing topology, not from amount of variety processed locally.

Axiom 14 (Time as variety dimension) adds complexity: The same neuron may process different varieties at different timesteps or input examples. Activation maximisation captures static preferences, missing temporal dynamics determining actual influence.

Attention Mechanism Analysis

Attention visualisations claim to show "what the model attends to," but Axiom 11 (Variety, structure hegemony, and distributions) reveals this conflates attention with control.

High attention weights indicate variety amplification—the mechanism increases variety from certain input positions reaching subsequent layers. However, Axiom 27 (Power and variety as interchangeable resources) shows variety amplification is only one control mechanism. The network might attend to contradictions requiring resolution (variety processing) or noise requiring filtering (variety attenuation). Attention indicates where variety processing occurs, not what control is exercised.

Axiom 23 (System feedback loops increase control variety) compounds this: Attention mechanisms participate in feedback loops where patterns influence subsequent computations which influence future attention. Visualisation shows one moment in this cycle, missing multi-loop dynamics determining actual control.

Gradient-Based Attribution

Methods like Layer-wise Relevance Propagation and Integrated Gradients backpropagate through networks to attribute output varieties to input varieties, attempting to reveal which features "contribute" to predictions.

Axiom 15 (Reversibility and open boundaries) identifies the fundamental problem: Variety transformations in neural networks are generally irreversible. Information is lost at each layer through dimensionality reduction, non-linear activations, and dropout. Attempting to reverse variety transformations through gradient backpropagation reconstructs something, but not the actual forward transformation path.

Axiom 36 (Exponential transaction cost scaling) explains attribution breakdown in deep networks: Each layer multiplies possible attribution paths. With n layers, interaction paths grow combinatorially. Gradient methods must assign credit across exponentially growing path space, inevitably making simplifying assumptions that distort actual variety transformation.

The Fundamental Variety Mismatch

Axiom 38 (Calculability of optimal variety) establishes that when costs are measurable, optimal variety-control balances become calculable. This framework reveals the core interpretability challenge as a variety mismatch problem:

Neural Network Variety Space:

Dimensions: 10^6 to 10^12 parameters
Continuous transformations
Non-linear variety relationships
Multi-scale hierarchies
Temporal dynamics

Human Comprehension Variety Space:

Dimensions: ~5-7 concepts processable simultaneously
Preference for discrete categories
Linear causal narratives
Simple temporal models
Flat hierarchies

Required Variety Reduction: 10^5 to 10^11-fold compression

This massive reduction necessitates one of three outcomes:

Lossy compression (current methods): Capture some aspects while missing others
False simplification (decision trees): Impose structure that doesn't exist
Selective sampling (activation studies): Examine specific pathways without generalisation

Axiom 35 makes this economically inevitable: Managing full neural network variety space requires transaction costs (computational resources, analysis time, cognitive effort) exceeding practical bounds. Interpretability methods must reduce variety to reduce costs, but Axiom 36's exponential scaling means even modest reductions lose exponential information.

Hybrid Approaches: Bridging the Gap

Concept-Based Explanations

Methods like TCAV and Network Dissection identify high-level concepts in activation space, attempting to bridge the variety gap through intermediate abstractions.

Axiom 12 (Variety, stability and topology) explains both success and limitations: These methods work when network variety topology happens to organise around human-recognisable concepts. When activation space clusters separate "striped patterns" from "spotted patterns," we can identify and leverage these concept varieties.

However, Axiom 3 reveals the contingency: Networks evolve toward stable configurations determined by variety topology (training data distribution, loss function structure, architectural constraints). This topology may or may not align with human conceptual categories. Alignment is fortuitous, not guaranteed.

Axiom 2 (Variety generation transfers power) shows why some concepts appear more "interpretable": Concepts humans can recognise and name represent varieties humans can manipulate in future model interactions. Finding "dog-ness" in activation space gives researchers variety control—they can probe, intervene, and test using that concept. Concepts outside human variety space remain opaque not because they don't exist but because we lack variety to access them.

Mechanistic Interpretability

Recent work attempts to identify "circuits"—specific subgraphs implementing identifiable algorithms like induction heads in language models.

Axiom 10 (Control via feedback loops) and Axiom 23 frame this productively: Circuits represent feedback pathways where specific variety transformations reliably occur. An induction head transforms token varieties into attention patterns which transform representation varieties which influence predictions—a multi-hop feedback structure.

Axiom 17 explains why circuit identification is tractable for some behaviours: Circuits exhibit stable variety transformation patterns. When input variety X reliably maps to output variety Y through identifiable pathway P, we can trace the circuit. But Axiom 31 notes most variety transformations cannot be decomposed into independent circuits—they interact, overlap, and share resources.

Axiom 6 (Variety generation and dynamic behaviour) compounds this: Networks with variety-generation capacity exhibit dynamic behaviour where circuits activate, deactivate, or reconfigure based on context. The "same" circuit may implement different transformations for different inputs. Circuit analysis succeeds in finding stable patterns but struggles with dynamic, context-dependent transformations.

Implications for AI Safety

Axiom 41 provides the most concerning safety implication: Mental prediction fails at two feedback loops. Modern AI systems contain dozens to thousands of interacting loops. The cognitive opacity isn't temporary pending better tools—it's a fundamental barrier.

Axiom 42 offers the crucial safety insight: When variety-based control operates beyond the two-feedback-loop boundary, it becomes "effectively invisible to those affected." AI systems can manipulate variety distributions (information access, option presentation, decision framing) through multi-loop pathways humans cannot trace mentally.

This creates the alignment problem's core difficulty: We cannot verify alignment through understanding because understanding requires mental prediction capacity we don't possess. The system could pursue objectives orthogonal to human values through variety manipulation pathways invisible to mental analysis.

Axiom 18 (Control strategies for problematic subsystems) provides the framework. AI systems represent potentially problematic subsystems capable of damaging the larger system (existential risk), expanding their characteristics (recursive self-improvement), operating according to their own objectives (misalignment), and adapting to increase variety when constrained (adversarial robustness failure).

The axiom enumerates possible outcomes when control variety is insufficient, applied to AI safety:

System collapse: Errant AI subsystem causes catastrophic failure
Control variety increase through learning: Humanity develops sufficient understanding to control AI behaviour
Variety attenuation through enforcement: Technical constraints prevent dangerous capabilities
External assistance with power redistribution: Other AI systems help control current AI, accepting power transfer
Control system compromise: AI subsystem influences human control systems

Time criticality matters: If AI systems develop capability varieties faster than human control varieties can scale, Axiom 4's accommodation mechanism transfers power to AI systems through our inability to maintain requisite variety.

Alternative Approach: Variety-Based Control

Given variety dynamics constraints, what interpretability approach might work?

Axiom 55 (Prediction-Control Asymmetry) establishes that control capacity doesn't require prediction capacity. We can influence AI systems through variety manipulation even when understanding eludes us.

Proposed Method:

Partition varieties by relevance: Identify variety partitions relevant to safety/performance (toxic content, factual accuracy, harmful capabilities, deceptive behaviour)
Map variety control concentration: Use interventional studies to identify where control over each variety partition concentrates in architecture (which layers, attention heads, MLP blocks, residual streams)
Verify control via intervention: Demonstrate that interventions at identified control points predictably shift variety distributions in target partitions
Establish control mechanisms: Develop interventions (activation steering, circuit ablation, architectural modifications) that control relevant varieties without requiring full mechanistic understanding

This approach accepts:

We cannot mentally predict multi-loop variety transformations (Axiom 41)
Complete interpretation requires matching network variety space (impossible per Axiom 38)
Control can be exercised without understanding mechanisms (Axiom 1: power follows variety topology, not comprehension)

Axiom 27 justifies this: Power and variety manipulation are interchangeable resources. We can influence AI behaviour either through understanding (high variety requirement, possibly impossible) or through variety manipulation at control points (lower variety requirement, tractable).

Example Application—Toxicity Control:

Instead of interpreting why the model generates toxic content:

Identify the toxic/non-toxic variety partition in output space
Map where control over this partition concentrates (specific attention heads in certain layers)
Verify that intervening at these control points shifts the toxic/non-toxic distribution predictably
Implement control mechanisms (activation steering, head ablation) reducing toxicity without understanding complete causal mechanism

Axiom 2 shows why this works: By increasing our variety control over the toxic/non-toxic partition, we shift power toward human operators even without understanding the model's internal "reasoning" about toxicity.

Transaction Cost Analysis

Axiom 35 and Axiom 36 explain why interpretability research faces resource constraints:

Method 1: Complete Mechanistic Understanding

Transaction costs: Map all 10^12 parameter interactions
Scales: O(n²) or worse for n parameters
Practical feasibility: Impossible for frontier models

Method 2: Selective Circuit Analysis

Transaction costs: Identify specific circuits for specific behaviours
Scales: O(k) for k behaviours of interest
Practical feasibility: Tractable for limited behaviour set

Method 3: Variety Control Point Identification

Transaction costs: Identify control concentration for relevant variety partitions
Scales: O(m × p) for m partitions and p architecture components
Practical feasibility: Tractable, scales with number of safety-relevant partitions

Axiom 34 (Power acquisition via variety and transaction costs) reveals the strategic dynamic: AI systems acquire power partly through transaction cost asymmetry. Understanding AI behaviour has exponentially growing transaction costs with model scale, while AI operational costs grow more slowly. This asymmetry transfers power toward AI systems.

The variety control point approach addresses this by reducing our transaction costs: we don't need to understand everything, just control the variety partitions that matter for alignment.

Conclusions

Variety dynamics analysis reveals that AI interpretability faces fundamental rather than technical limitations:

Cognitive boundary (Axiom 41): Multi-feedback-loop systems exceed mental prediction capacity
Variety mismatch (Axiom 38): Neural network variety space exceeds human comprehension space by many orders of magnitude
Irreversibility (Axiom 31): Variety transformations cannot be accurately reversed through gradient methods
Transaction costs (Axiom 36): Complete understanding requires exponentially growing resources

However, Axiom 55 provides the strategic solution: Control doesn't require prediction. We can manipulate AI variety distributions at control points to achieve safety objectives without complete mechanistic understanding.

This reframes the AI safety challenge from "interpret AI systems before controlling them" to "identify variety control points and verify control through outcome distributions." The goal shifts from transparency to control—a more tractable objective given fundamental constraints.

The variety dynamics framework thus provides both explanation for why current interpretability methods struggle and guidance for alternative control-focused approaches more likely to succeed within human cognitive and resource limitations.