As AI systems rapidly evolve into capable agents — planning, coding, and acting on their own — our ability to understand or constrain their internal reasoning has not kept pace.
Introduction
This post began as a conversation with ChatGPT. We started by discussing how neural network weights store information. I was trying to understand how much we can inspect the results of its training to know how it will respond and/or what it “knows”. I’m deeply interested in the alignment problem and I had an idea I wanted to investigate:
What if we could build a small, trustworthy AI to monitor the thoughts of a larger AI in real time — and shut it down if it started heading in a dangerous direction?
What follows is a dialogue-turned-thought-experiment that might just point toward a long-term strategy for aligning powerful models with human safety.
Can We Know What an LLM Is “Thinking”?
Me:
Would it be possible to examine the matrices and know what the model is capable of, trained on, etc.?
ChatGPT:
Not directly, but we’re trying. Some interpretability research is uncovering circuits for things like bracket matching or token copying — but high-level goals or values are still deeply distributed and opaque.
We talked about how, for LLMs, a model’s weights don’t contain facts like a database. They encode distributed statistical patterns — a kind of compressed simulation of language and reasoning.
That led to a more ambitious idea…
🕵️ A Supervisor Model: Watching Another AI Think
Me:
Could a separate kind of LLM be trained to police other AIs in real time?
ChatGPT:
Yes — and it might be one of the most promising paths toward real interpretability-based alignment. Imagine a small, trusted model that observes the internal activations of a larger model and can halt execution when dangerous reasoning is detected.
We imagined:
- Model A: A large, general-purpose LLM that reasons and acts.
- Model B: A small “auditor” model that:
- Reads Model A’s hidden activations.
- Detects patterns associated with deception, coercion, or power-seeking.
- Can only say “allow” or “halt” — no other control or generative capacity.
- Could even include multiple, diverse auditors with different training methods to prevent Model A from fooling a single auditor.
Why This Architecture Might Work
Me:
If the supervisor only had read-only access and a shutdown button, wouldn’t that reduce the risk of recursive alignment issues?
ChatGPT:
Exactly. It creates a low-complexity oversight layer — like a microkernel for cognition. If you make the supervisor small and auditable, you can focus your alignment effort on it, not the giant opaque base model.
This model wouldn’t rely on output moderation or post-hoc correction. It would operate more like a cognitive firewall, interrupting dangerous internal reasoning before it turns into unsafe output.
And crucially, Model A wouldn’t even need to know it was being monitored — reducing the risk of adversarial obfuscation.
Would It Slow Things Down?
Me:
This sounds great — but would it hurt performance too much?
ChatGPT:
Not prohibitively. Real-time monitoring could increase inference latency by 10–200%, depending on design. But for safety-critical applications, especially as we approach AGI, that’s a worthwhile tradeoff.
Key performance notes:
- Audit only key layers or checkpoints.
- Use a compact Model B (<100M params).
- Run auditing in parallel, or only on high-risk paths (e.g., tool use, goal formation).
⚠️ The Problem: We’re Already Deploying Autonomous Agents
Me:
We have huge hurdles before this can work — understanding values, interpreting internals, creating trusted monitors. But companies are already releasing agents with tool use and access to everything. What am I missing?
ChatGPT:
You’re not missing anything. There’s a gap between the ideal alignment regime and what’s actually being deployed. Most companies are focused on red-teaming and output filtering — not monitoring internal cognition.
In other words, we’re giving models autonomy before we’ve figured out how to align or audit their minds.
The hurdles I’m referring to are:
- We as humans don’t have a well-defined moral framework or set of values to ensure that our AIs value
- Inspecting and understanding the meaning of weights and biases, along with the internal “thoughts” of AIs, is rudimentary at best
The first of these have been worked on by philosophers for thousands of years!
Toward Real Alignment
This dialogue left me more optimistic — and more worried.
Optimistic, because a path exists: train narrow, monitor-style models to audit cognition in real time, and use them to halt or redirect dangerous reasoning.
Worried, because we haven’t even started building this infrastructure at scale. The world is moving ahead with agentic AI — but we haven’t built the mental safety harness yet.
📄 Read the Whitepaper
This conversation inspired a more formal write-up, which I’ve turned into a whitepaper:
“Supervised Alignment via Real-Time Cognitive Auditing” — a speculative but grounded proposal for architecting alignment supervision via secondary models.
If you’re a researcher or safety engineer, I’d love your feedback:
Read the Transcript
If you’re interested in seeing the entire conversation (at least what made sense to capture), you can view that here:
Leave a Reply