Engineering·Apr 8, 2026

How we cluster AI agent failures without a single label

Maya Chen

When we started obsrv, we tried the obvious thing: build a taxonomy of agent failure modes — tool misuse, hallucination, goal drift, unsupported request — and ask customers to label their traces against it. It worked beautifully on a deck. It collapsed in week one of every real deployment.

The reason: youragent fails in ways that look nothing like the next team's agent. A taxonomy that fits a customer support copilot does not fit a robotics planner. By the time we'd agreed on a category list, the agent had shipped a new prompt and invented three new failure modes.

The pipeline we ship today

obsrv runs an embed → cluster → labelpipeline, continuously, against every project's trace stream:

Embed. A worker extracts a structured text representation from each trace — user message, retrieved context, tool calls, final reply, error spans — and embeds it into a 1536-dimension vector. We store the vector in pgvector, alongside the trace ID and project.
Cluster.A small, pure-Go k-means runs on a moving window of recent traces with cosine distance. We use k-means specifically because it's deterministic, fast, and easy to operate at the volumes obsrv sees per project. We pick k via silhouette score.
Label. Each cluster is sent to Claude with a handful of representative behaviors and asked for a short, human-readable name. Labels are regenerated when membership shifts beyond a threshold.

Why this matters

Two things happen when failure modes surface from real behavior instead of a template:

You catch issues you didn't know to look for.
The names match how your team actually talks about them.

That second one is more important than it sounds. The whole point of a label is to be useful in a Slack channel at 11pm. "Tool selection failure (cluster 3)" is useless. "Cancels the wrong order when the user mentions a previous one" is something an on-call engineer can act on.

What didn't work

We tried HDBSCAN first. It produced beautiful, unstable clusters that shifted under the slightest input drift. We tried letting customers pre-define cluster prompts; nobody used them. We tried community detection on a co-failure graph; it was elegant and none of the clusters were actionable.

Boring k-means with auto-labels has held up.

What's next

Right now obsrv clusters once per project on a schedule. We're moving to incremental clustering so emerging modes show up minutes after they appear, not on the next batch run. We're also testing learned embeddings tuned specifically for agent traces — early results suggest about a 30% improvement in cluster purity over off-the-shelf models.

If you want to play with cluster discovery on your own traces, spin up a project— it's on every plan.

The pipeline we ship today

Why this matters

What didn't work

What's next

Ship the agent.obsrv will record everything.

Ship the agent.
obsrv will record everything.