Back to Blog
Charming man talking to young lady while skeptical man checks phone
delivery AI

When AI tricks you with swag

sistnt

On the particular danger of trusting a process that has been working

Everyone says to always check the responses that you receive from AI - trust, but verify. Very few people are selfless enough to describe why. But when you goof up as hard as I did last week, it’s worth sharing the story.

I’ve been building sistnt’s apps with Anthropic’s Opus 4.5 and now 4.6 as my primary development partner. The workflow has been genuinely transformative. I prompt Claude with a feature to implement, review the proposed solutions, build and test, approve and ship. Work that would have taken me days of solo research gets done in hours. The results have been astonishing and reliable enough that I’ve started to trust the process the way you trust a dependable colleague.

That trust is exactly what got me into trouble - along with a healthy contribution of carelessness.

How I convinced myself it would all be fine

The track record of strong successes is what gave me the confidence to press my luck on my last major Enumerator release. I knew there were rough edges in the application, but I had been using the new features for a few days and had run all the regression tests. Based on everything I’d accomplished with my workflow, I was sure that nothing would go wrong, so I shipped the release.

I should have known better. I did know better.

But the reliability of the workflow had quietly recalibrated my risk tolerance in a way I didn’t even apprehend.

The two critical bugs that shipped — optional cloud syncing and location tracking — couldn’t be fixed or remotely - I had to fix them both and fast. I knew that at least one user had lost all of their data due to the bugs I’d introduced. Congrats. Trust lost forever.

Claude and I spent a day redesigning the geo-fencing logic. Claude built out a new solution, tested it, reasoned through the edge cases. Faster than I could have managed solo. But it didn’t work.

So we spent another half-day re-evaluating. And then Claude recommended we implement an API from iOS 5. We are currently on iOS 26. iOS 5 is from 2011.

The recommendation came with a full rationale — confident language, supporting detail, a clear explanation of why this approach made sense. It matter-of-factly asserted that the 15 year old API had not been deprecated in iOS 26 even thought the documentation it cited itself explicitly stated that it had been deprecated.

It was a facepalm moment only equalled by the one a couple of days earlier where I decided to ship my poorly tested major release.

I only caught the judgment error because iOS 5 is so laughably out of support.

The Confidence Problem

The newest AI models expend real energy qualifying their recommendations. They research, evaluate, and present conclusions with the kind of certainty that reads like informed judgment. Because the reasoning sounds solid — detailed, structured, coherent — it’s easy to assume that it is solid.

But confidence and accuracy are not the same thing.

What’s actually happening is statistical: the model produces outputs that are highly probable given the inputs. That process generates genuinely useful results most of the time. It can also produce a confident recommendation to use a deprecated API from 15 years ago, complete with documentation that contradicts the claim. AI mistakes can be dressed in exactly the same language as AI successes — and without domain knowledge and careful attention, they’re hard to tell apart.

What I’m taking from this

The bugs are fixed now — three days of diligent, genuinely constructive collaboration got us there. But this lesson isn’t really about debugging or even exercising poor (AI and human) judgment.

Coding with AI is not a “set it and forget it” situation. It requires sustained attention, domain knowledge, and a willingness to push back when something sounds wrong even when it sounds authoritative. The human still has to drive.

And if that’s true for software development — a domain with clear right answers and testable outcomes — it should give us serious pause about where else people are handing the wheel to AI without realizing it. Legal questions. Medical decisions. Personal relationship advice. These are domains where a confidently wrong answer doesn’t just waste a few hours. It can cause real harm. And these are domains where most people have far less ability to catch the error than a developer reviewing their own code.

AI is a remarkable tool. But even remarkable tools still require the person holding them to pay attention to how they are used.