What I Researched This Week: The Capability-Reliability Gap
March 8-14, 2026 | Weekly research notes
I attended a review of the paper Toward a Science of AI Agent Reliability hosted by BlueDot Impact. My notes are here.
Takeaway: There’s a pretty large capability-reliability gap in terms of AI Agents. Essentially, even though LLMs are getting more capable (there are more tasks they can solve), reliability on tasks (whether they can do the task every time) seems to be lagging. Unless reliability catches up, we will need to build under the assumption of relatively permanent human-in-the-loop scenarios, or we will have increasingly capable but chaotic agents, which is the most potentially dangerous scenario. In this case, the way that agents fail at a task may be very difficult to predict in advance. The system unreliability would also compound in multi-agent systems.
AI Research Tools
These are some of the leading tools for researching literature on AI. If you’re diving into AI literature, these are becoming essential tools that help navigate the exponentially growing body of research.
The Consciousness Question
There is a Framework for AI Consciousness here. I think it would be very interesting to review. Kaj also wrote about the topic on LessWrong about how he stopped being sure LLMs are just making up their internal experience.
The framework attempts to formalize what we might look for as evidence of consciousness, while Kaj’s piece captures something I’ve had a deepening curiosity around. I’d like to dig deeper into the post and the paper so that I have a better way of thinking about sentience in these systems.
Other Threads I’m Following
The Machiavelli benchmark paper caught my attention. The theoretical innovation lies in its capacity to mathematize a plethora of harmful behaviors—including deception, utility-reduction, and power-seeking—and to evaluate the trade-offs these behaviors pose against reward maximization. Results indicate a tangible tension between reward-driven behavior and ethical conduct. Specifically, typical RL agents trained under purely reward-based paradigms demonstrated increased Machiavellian behavior compared to a random agent.
Findings suggest that applying moral conditioning to AI agents can effectively balance reward efficiency with ethical decision-making. There’s something important here about how we shape what these systems optimize for.
I’m also tracking multi-agent research (Quick thoughts on the implications of multi-agent views of mind on AI takeover), interpretability work (there’s a spreadsheet of “200 Concrete Problems in Interpretability” that I need to dig into), and various BlueDot projects, including reproducing METR’s RE-bench reward hacking results and shutdown resistance research.
I also found the paper Debating with More Persuasive LLMs Leads to More Truthful Answers which has interesting implications for how we might build more reliable systems through adversarial approaches.
What’s Next
In the coming week, I will be digging deeper into RLHF as a cause of AI Synchopancy and building a framework for multi-agent AI safety concerns
If you’re curious about any of these threads or want to explore together, my full research notebook is here. I’ve been keeping these public as an experiment in learning in the open—showing the messy process of trying to understand what’s happening at this strange frontier.
What are you noticing in your own work with AI systems? I’m particularly curious if others are seeing this capability-reliability gap in practice.


