Your AI agents are failing because you rely on vibes. Discover the proven 5-step structural framework to fix broken workflows and unlock true scale today.
You have a solid thesis here. We have built a solution to these problems that is open source, portable, and will save you more in token spend by creating deterministic workflows and APIs for pennies while saving that precious reasoning compute for runtime execution.
I usually really appreciate your perspective, but I think this article misses the mark, IMHO. I hope you’ll see why after reviewing our architecture here: https://instantaiguru.com/architecture#jsfe
"PLEASE DO NOT HALLUCINATE" in a system prompt is the AI equivalent of writing "PLEASE DO NOT CRASH" on the dashboard of a car. its so perfectly absurd and yet I guarantee half the people reading this have something like it in their codebase right now.
The deeper problem you've identified here is cultural not just technical. the entire AI industry has an anthropomorphisation habit that is the root cause of most reliability failures. Teams "negotiate" with models. they write system prompts that read like performance reviews. they say things like "the model is being stubborn today" as though it has moods. And once you start thinking about the model as a person you naturally reach for person-shaped solutions: ask it more clearly, give it better instructions, be more specific about what you want. All of which is vibes.
The latent versus deterministic distinction cuts through all of that because it forces you to stop asking "how do I get the model to do this correctly" and start asking "should the model be doing this at all." The proration example is perfect. the model doesnt need to be good at maths. the model needs to be good at knowing when to call a function that is good at maths. Thats a completely different capability and honestly its one that current models are actually quite reliable at once you give them the tools.
I spent months watching a team debug a scheduling agent that kept getting timezone conversions wrong. they tried everything, chain of thought prompting, few-shot examples, explicit timezone tables in the system prompt. nothing worked reliably. The fix took an afternoon: a 30-line function that handled every timezone conversion deterministically and returned the result to the model as structured data. three months of prompt archaeology replaced by an afternoon of actual software engineering.
The "generate-validate-fix" loop is where this framework really earns its weight though. the principle underneath it is ancient in regulated industries. Banking has had separation of duties since before computers existed. the person who initiates a transaction cannot be the person who approves it. the system that generates a trade cannot be the system that validates it. The AI industry spent three years rediscovering a governance principle that auditors have enforced for decades, and this is one of the clearest explanations Ive seen of how to actually implement it.
brilliant piece. bookmarking this for every team I work with that still has "be careful and accurate" somewhere in their prompts.
The latent versus deterministic distinction is the single most important mental model in AI engineering right now and most teams are still building without it.
I watched a fintech company spend three months prompt-engineering their way around a currency conversion bug. they kept adding instructions like "always use the latest exchange rate" and "double check your calculations." The model kept hallucinating rates that were close enough to look plausible but wrong enough to cost real money. The fix was a 12-line function that pinged an API. Took twenty minutes to write. Three months of prompt negotiation replaced by twenty minutes of actual engineering.
The context-as-L1-cache framing is the other insight that deserves wider adoption. most developers treat the context window like a filing cabinet, stuff everything in and hope the model finds what it needs. Treating it as cache, where you control exactly what loads and when, changes the reliability profile of every downstream operation. the model stops drowning in irrelevant context and starts operating on precisely scoped data.
The "generate-validate-fix" loop connects to something the financial industry learned decades ago with trading systems. you never let the same system that generates the trade also validate the trade. Separation of generation and validation is an audit principle thats been standard in banking since the 1990s. the AI industry is slowly rediscovering principles that regulated industries figured out thirty years ago, and this framework is one of the clearest articulations of how to apply them.
You have a solid thesis here. We have built a solution to these problems that is open source, portable, and will save you more in token spend by creating deterministic workflows and APIs for pennies while saving that precious reasoning compute for runtime execution.
Check out what we are up to!
https://valkyrlabs.com
I usually really appreciate your perspective, but I think this article misses the mark, IMHO. I hope you’ll see why after reviewing our architecture here: https://instantaiguru.com/architecture#jsfe
"PLEASE DO NOT HALLUCINATE" in a system prompt is the AI equivalent of writing "PLEASE DO NOT CRASH" on the dashboard of a car. its so perfectly absurd and yet I guarantee half the people reading this have something like it in their codebase right now.
The deeper problem you've identified here is cultural not just technical. the entire AI industry has an anthropomorphisation habit that is the root cause of most reliability failures. Teams "negotiate" with models. they write system prompts that read like performance reviews. they say things like "the model is being stubborn today" as though it has moods. And once you start thinking about the model as a person you naturally reach for person-shaped solutions: ask it more clearly, give it better instructions, be more specific about what you want. All of which is vibes.
The latent versus deterministic distinction cuts through all of that because it forces you to stop asking "how do I get the model to do this correctly" and start asking "should the model be doing this at all." The proration example is perfect. the model doesnt need to be good at maths. the model needs to be good at knowing when to call a function that is good at maths. Thats a completely different capability and honestly its one that current models are actually quite reliable at once you give them the tools.
I spent months watching a team debug a scheduling agent that kept getting timezone conversions wrong. they tried everything, chain of thought prompting, few-shot examples, explicit timezone tables in the system prompt. nothing worked reliably. The fix took an afternoon: a 30-line function that handled every timezone conversion deterministically and returned the result to the model as structured data. three months of prompt archaeology replaced by an afternoon of actual software engineering.
The "generate-validate-fix" loop is where this framework really earns its weight though. the principle underneath it is ancient in regulated industries. Banking has had separation of duties since before computers existed. the person who initiates a transaction cannot be the person who approves it. the system that generates a trade cannot be the system that validates it. The AI industry spent three years rediscovering a governance principle that auditors have enforced for decades, and this is one of the clearest explanations Ive seen of how to actually implement it.
brilliant piece. bookmarking this for every team I work with that still has "be careful and accurate" somewhere in their prompts.
The latent versus deterministic distinction is the single most important mental model in AI engineering right now and most teams are still building without it.
I watched a fintech company spend three months prompt-engineering their way around a currency conversion bug. they kept adding instructions like "always use the latest exchange rate" and "double check your calculations." The model kept hallucinating rates that were close enough to look plausible but wrong enough to cost real money. The fix was a 12-line function that pinged an API. Took twenty minutes to write. Three months of prompt negotiation replaced by twenty minutes of actual engineering.
The context-as-L1-cache framing is the other insight that deserves wider adoption. most developers treat the context window like a filing cabinet, stuff everything in and hope the model finds what it needs. Treating it as cache, where you control exactly what loads and when, changes the reliability profile of every downstream operation. the model stops drowning in irrelevant context and starts operating on precisely scoped data.
The "generate-validate-fix" loop connects to something the financial industry learned decades ago with trading systems. you never let the same system that generates the trade also validate the trade. Separation of generation and validation is an audit principle thats been standard in banking since the 1990s. the AI industry is slowly rediscovering principles that regulated industries figured out thirty years ago, and this framework is one of the clearest articulations of how to apply them.