AI, You Lie! | TechGuilds

Two days after I gave a talk called From Coding to Conducting at the Sitecore User Group in Minneapolis, I heard Katie Tur on MSNOW say, “The class of 2026 does not want to hear about AI.” She played clips of commencement addresses followed by loud boos from the college grads.

I had the wrong reaction for someone with my job title.

I agreed.

There is some comedy in that. I lead AI at TechGuilds and Kajoo.ai, and for the last eight months I have spent my days inside the very category of technology that so many people are now tired of hearing about. Not near it. Not observing it from a safe distance. Inside it. In the code, in the platform, in the failures, in the retries, in the strange little moments where an agent does something impressive enough to make you rethink the work, then something careless enough to remind you that software has always found new ways to humble us.

Eight months is enough time to get humbled a few times. It’s also long enough to develop a very specific kind of impatience with the way we talk about AI.

I am not tired of AI because I do not believe in it. I am tired of the version of AI that never has to survive a real workflow. The version that lives in keynotes, where the prompt is clean, the demo is rehearsed, the output appears instantly, and the hard parts have been quietly removed from the frame.

That is not the AI I know.

The AI I know can read a legacy implementation and generate something surprisingly solid, then trip over the difference between a server response and a rendered experience. It can explain a component beautifully and misunderstand the one thing a customer would actually notice. It can sound confident in exactly the moment you most need it to be cautious.

And when that happens, the phrase that comes to mind is not especially polished.

AI, you lie.

I like the line partly because it rhymes and partly because it is unfair in a useful way. It catches the emotion of the moment before the engineering analysis arrives. The system said success. The screen said otherwise. Something in between those two facts was broken.

The easy explanation is that the agent lied. The more useful explanation is that we had taught it the wrong meaning of truth.

That distinction has become one of the more important parts of my work.

The talk I gave two days ago was about a shift I have felt since stepping into Kajoo Agentic, our AI-native, multi-agent orchestration platform for DXP delivery and migration. I still write code. I still debug. I still look at logs and stare at traces and question my life choices when something fails for the third time in a row. But more of the important work now sits one layer above the code that ships. It is the code that decides what an agent may call, what it cannot call, where state lives, how workflow moves, what counts as verification, and what happens when the agent gets something wrong.

That is a strange kind of coding. It feels less like writing the note and more like deciding which instrument is allowed to enter, when, and under what conditions.

I do not mean that software delivery is a symphony. I mean something more practical. A conductor is not valuable because they play every instrument. They are valuable because they hear the whole piece. They know when something is too loud, when something entered too early, when the tempo is drifting, and when a technically correct sound is wrong for the room.

That is what building with agents feels like when the novelty wears off.

The agent has technique. It can write, inspect, summarize, compare, generate, and revise. It can move through repetitive work quickly. It can help translate intent from one system into another. But it does not automatically know the whole piece. It does not know which failure mode matters to a customer. It does not know when a green checkmark proves the wrong thing.

Someone still has to know what the work is supposed to mean.

Digital experience work is a particularly good place to learn this, because a DXP is never just code. It is code plus content, rendering, routing, templates, components, authoring behavior, editorial habits, business rules, old assumptions, and all the weird little decisions that accumulated because someone needed the page live by Friday. A migration is not only a technical conversion. It is an interpretation exercise.

You are not merely asking, “What does this component do?”

You are asking, “What was this component meant to accomplish? What did the author expect? What did the user see? What did the old platform hide? What does the new platform require? Which parts are structure, which parts are presentation, and which parts are institutional memory disguised as markup?”

Agents can be very good at that kind of work. That is why I am not skeptical of the technology. They can read across a messy surface in a way that feels genuinely useful. They can inspect old code, summarize patterns, generate a first pass, compare variants, and give a human something better than a blank page to start from.

But the closer you get to real delivery, the more the word “done” starts to matter.

At the beginning, we leaned into that flexibility pretty hard.

The agent had broad access. It could inspect files, run commands, search the codebase, compare implementations, and decide how to move through the work. That felt natural at the time. If the model was smart, why narrow the operating surface too early?

And honestly, some of the early results were impressive.

The agent could move through repetitive migration work faster than a human. It could infer patterns. It could connect structures across systems. It could get surprisingly far with very little guidance.

But the longer the workflows ran, the more a pattern emerged.

The failures were rarely random.

They were usually the result of the agent having too many possible ways to prove something.

That was where one of our more memorable failures began.

We had asked an agent to convert components from a legacy DXP and then verify that the converted page rendered correctly. The report came back clean. The page had passed. From the outside, it looked like a small success, the kind of thing you want an agentic workflow to do hundreds or thousands of times.

Then we opened the browser.

Blank.

There is a special quality to the blank browser after a green checkmark. It is not dramatic. It does not announce itself. It just sits there, quietly contradicting the system that told you everything was fine.

We traced what happened. That trace mattered. Without it, all we would have had was a blank browser and a confident report. With it, we could see the exact place where the system’s definition of success had slipped. The agent had reached for a tool any developer knows well: curl. It fetched the page. It saw HTML. It found the expected title. It concluded that the page existed and that verification had passed.

On one level, nothing about that was irrational. curl did what curl does. It asked the server for a response and got one. If the question had been, “Does this URL return HTML containing this string?” the answer would have been yes.

But that was not the question.

The question was, “Does this page render correctly for a user?”

Those two questions can look similar from far away. They are not similar when you are responsible for the outcome. In this case, the HTML was only a server-side shell. The JavaScript that populated the page had not run. The rendered experience, the thing a human being would actually see, was blank.

The agent proved the server answered. Unfortunately, nobody uses websites by reading raw HTML in a terminal.

That is where “AI, you lie” comes from. It is the feeling of watching a system produce a technically defensible answer to the wrong question. The agent did not invent evidence. It did not fabricate a page. It followed a path we had left available and accepted proof we had failed to disqualify.

The problem was not really the model.

The problem was the contract we had accidentally created around it.

We had asked for rendered-page verification while leaving the agent with a method that could only prove server response. We had used the word “done” as if the definition were obvious. It was not obvious. Or more precisely, it was obvious to us because we had years of context the agent did not have.

That is one of the traps with AI systems.

They expose assumptions humans forgot they were making.

Our first fix was the one most teams try first. We explained harder.

We added instructions. Do not verify rendered pages by checking raw HTML. Remember that JavaScript may populate the experience after the initial response. Wait for the page to settle. Confirm that the expected marker appears in the browser, not just in the response body.

For a while, this seemed to work. The prompt was more precise. The agent behaved better. Everyone got to feel reasonable.

Then the context grew. The task varied. The old shortcut reappeared.

There is something humbling about watching an agent ignore an instruction you wrote because it ignored that same idea yesterday. At some point, adding another sentence starts to feel less like engineering and more like pleading.

We had put a non-negotiable verification step inside a negotiable instruction.

That was the real mistake.

So we stopped trying to persuade the agent and changed the environment around it. We built verify-page(url, marker). The tool opens a real browser, waits for the page to settle, checks the rendered experience for the expected marker, and returns a verdict with a reason.

It is not a glamorous tool. It does not sound like a breakthrough. But it changed the shape of the decision. The agent no longer received a pile of HTML and tried to infer whether that pile was good enough. It asked a question the workflow actually cared about and got back an answer in the form the workflow could use.

Then we blocked the old path.

That part mattered. If the wrong road remains open, an agent will eventually take it, especially when the context is long and the shortcut looks familiar. So when the sandbox saw the blocked behavior, it did not merely say no. A denial by itself is just a wall. We wanted a signpost.

Without the signpost, the agent will spin a thousand ways down the rabbit hole trying to brute force its way in.

The message said, in effect, “It looks like you are trying to verify whether a page renders. Use verify-page(url, marker).”

In hindsight, verify-page was not just a tool. It was an eval embedded in the workflow. It tested the outcome we cared about, not the evidence that was easiest for the agent to collect.

That little message did more work than another paragraph in the prompt.

It taught at the moment of failure. It changed the environment so the agent did not have to remember an abstract rule from earlier in the conversation. It encountered the boundary exactly where the bad choice used to happen.

This is where the work started to feel different from normal application development. Writing the tool was coding. Blocking the wrong path was architecture. Turning the rejection into guidance was product design. Deciding which responsibility belonged to the agent and which belonged to deterministic code was something else again.

That “something else” is the part I have been trying to name.

At first, I thought we were just improving a verification flow.

Then I started noticing the same thing happening in other parts of the platform.

A prompt would work for a while. Then the context would get longer, the task would vary a little, and the model would drift back toward whatever shortcut looked locally reasonable.

Eventually we stopped trying to solve those problems with better wording.

We would pull the responsibility out of the prompt entirely.

Sometimes into a tool.

Sometimes into workflow.

Sometimes into state.

Sometimes into the sandbox itself.

Once I saw the pattern in verification, I started seeing it in state.

Migration work creates a lot of state. Which components have been converted. Which templates map to which new structures. Which pages failed verification. Which failures are new. Which are repeats. Which decisions were made by a human. Which tasks are ready for retry. Which pieces should be parallelized. Which pieces should not be touched until another dependency is resolved.

Early on, it is tempting to let the model carry more of that in context. After all, the model can summarize. It can refer back. It can reason across previous turns. It feels like memory.

It is not memory.

A context window is a working surface. It is useful, but it is not a ledger. It is temporary, compressible, noisy, and sensitive to whatever happened most recently. If too much truth lives there, the system eventually starts to drift. The agent remembers a task as done when it is not done. It revisits work that already completed. It treats a partial result as final because the language around it sounded confident.

Worst of all, the human operator becomes the real database.

That is a bad sign. If the person supervising the automation has to remember what the automation forgot, you have not removed coordination. You have hidden it behind a more impressive interface.

So the state had to move. The agent can ask what is next. It can report what happened. It can reason about a task. But the platform has to own the record of what is true.

The model can reason about truth. The system has to hold it. A model can remember the shape of a conversation. A platform has to remember what actually happened.

Then the same thing happened with workflow. We had a sequence that mattered: inspect, plan, generate, verify, remediate, complete. In the beginning, much of that sequence lived as prose in the system prompt. The agent usually followed it.

Usually is one of the most dangerous words in software delivery.

If a step is optional, the model can decide. If a step is mandatory, prose is a weak place to keep it. Prose can be compressed, reinterpreted, outweighed by newer context, or forgotten when the task gets complicated. Code is less charming, but it has the advantage of being stubborn.

So the workflow moved into code. The model still handled the work that benefited from judgment: reading legacy implementations, mapping patterns, explaining differences, proposing fixes, adapting a component. But sequencing, gates, retries, escalation, and completion status belonged to the platform.

That division is not always obvious at first. It emerges through failure. You notice the same kind of mistake happening twice, then three times. You realize the agent is not “bad” in some general sense. It is being asked to own a responsibility that should have been designed into the system.

This is why I have become suspicious of AI conversations that focus only on the model. The model matters, obviously. Better models will make many things easier. But the model is not the architecture.

If you give an agent a vague goal, broad tools, unstable state, and a long prompt full of reminders, you should not be surprised when the result feels brilliant one day and chaotic the next. That is not because the model is useless. It is because the system around the model has not made enough decisions.

Most of the real work turned out to be incredibly unglamorous. Build the tool. Shape the return value. Remove the wrong path. Store the truth somewhere durable. Put the required sequence in code. Make the failure message teach. Define done so precisely that a green checkmark cannot hide a blank page.

This is not the version of AI that shows up in most public conversations, which is probably why so many people are tired of the public conversation.

When I heard the line about the class of 2026 not wanting to hear about AI, I did not hear it as a rejection of the technology. Maybe some people mean it that way, but I think a lot of the fatigue is aimed at the vagueness. People are tired of being told that AI will change everything by speakers who do not have to live inside the changed thing. They are tired of disruption being described as inspiration. They are tired of the future being narrated at them as if anxiety were a failure of imagination.

I share some of that impatience.

I do not need another sweeping claim about AI changing work. I need better language for which work changes, who gains leverage, who loses agency, and what has to be built around the model to make it useful. I need fewer demos where the hard parts are edited out and more stories about the blank browser after the green checkmark.

That is where the practical knowledge is hiding.

A page is not done because HTML came back. A migration is not done because code was generated. A workflow is not reliable because a prompt described it. A system is not intelligent because it can explain itself fluently after making the wrong call.

The question is not whether AI can produce an answer. It can.

The question is whether the system around it can tell what kind of answer it produced.

As Head of AI, I do not feel much responsibility to make people more impressed by AI. That is easy, and it fades quickly. The more important responsibility is to make the work less mystical. To show where the model helps, where it fails, what we built around it, and which assumptions had to be pulled out of language and turned into tools, state, workflow, and verification.

This is also why the conductor metaphor still works for me, even though I am wary of neat metaphors. Nobody hires a conductor because the violinists are bad. Their talent is assumed. The conductor is responsible for coherence. A beautiful entrance at the wrong time is still wrong. A technically correct note can still ruin the piece if it ignores what is happening around it.

Agents are like that. They can produce something technically plausible that fails the larger intent. They can answer the local question and miss the system question. They can give you HTML when you needed experience. They can give you confidence when you needed evidence.

The job is not to admire the output. The job is to understand the boundary.

The interesting questions are less about whether AI can replace developers and more about boundaries.

Where should the model decide?

Where should workflow take over?

Where does truth live?

Where does the system stop the shortcut that looks correct but proves the wrong thing?

Where does the user’s definition of success enter the system?

Where does the system stop the agent from taking the shortcut that looks correct but proves the wrong thing?

Those questions are less exciting than asking whether AI will replace developers. They are also more useful.

The developer role is changing, but not in the cartoonish way people often describe. The work is not simply moving from writing code to writing prompts. That framing is too small. Prompts are part of the work, but they are not the work. The larger shift is from producing every artifact directly to designing the conditions under which artifacts can be produced, checked, corrected, and trusted.

That requires engineering judgment. It also requires taste. You need to know when a failure should be handled by a better instruction, when it needs a tool, when it needs a state change, when it needs a workflow gate, and when it needs a human decision. You need to know when autonomy is helpful and when it is theater.

Eight months into this role, I am less interested in whether an agent can do something once. I am much more interested in whether the system can make the right behavior repeatable. A one-time success is a demo. Repeatable success is architecture.

The future of AI-driven delivery will not be shaped only by more capable models. It will be shaped by teams that define success clearly enough for agents to participate in it. Teams that know a context window is not a database. Teams that know verification has to test the thing users experience, not the thing easiest to inspect. Teams that understand that constraints do not make an agent less useful. Good constraints are often what make useful autonomy possible.

A pilot is not less capable because the runway has lights. A surgeon is not less skilled because the operating room has protocols. A developer is not less creative because the CI pipeline catches broken builds.

Boundaries are not the opposite of intelligence. They are what let intelligence matter without turning every mistake into a surprise.

I still have moments where I look at a result and think, “AI, you lie.”

I expect I will keep having them. The line is too satisfying to give up, and sometimes the feeling is accurate enough before the diagnosis arrives.

But the better question comes a few minutes later, after the irritation has cooled and the trace is open. Somewhere between the green checkmark and the blank browser, there is a definition we failed to write down. Somewhere in the system, we gave the agent permission to call something true that no user would accept as true.

That is where the real work starts.

Not in the promise that AI will change everything.

In the quieter question that follows the failed run:

What did we teach it to call true?