How we cluster AI agent failures without a single label
When we started obsrv, we tried the obvious thing: build a taxonomy of agent failure modes — tool misuse, hallucination, goal drift, unsupported request — and ask customers to label their traces against it. It worked beautifully on a deck. It collapsed in week one of every real deployment.
The reason: youragent fails in ways that look nothing like the next team's agent. A taxonomy that fits a customer support copilot does not fit a robotics planner. By the time we'd agreed on a category list, the agent had shipped a new prompt and invented three new failure modes.
The pipeline we ship today
obsrv runs an embed → cluster → labelpipeline, continuously, against every project's trace stream:
- Embed. A worker extracts a structured text representation from each trace — user message, retrieved context, tool calls, final reply, error spans — and embeds it into a 1536-dimension vector. We store the vector in pgvector, alongside the trace ID and project.
- Cluster.A small, pure-Go k-means runs on a moving window of recent traces with cosine distance. We use k-means specifically because it's deterministic, fast, and easy to operate at the volumes obsrv sees per project. We pick k via silhouette score.
- Label. Each cluster is sent to Claude with a handful of representative behaviors and asked for a short, human-readable name. Labels are regenerated when membership shifts beyond a threshold.
Why this matters
Two things happen when failure modes surface from real behavior instead of a template:
- You catch issues you didn't know to look for.
- The names match how your team actually talks about them.
That second one is more important than it sounds. The whole point of a label is to be useful in a Slack channel at 11pm. "Tool selection failure (cluster 3)" is useless. "Cancels the wrong order when the user mentions a previous one" is something an on-call engineer can act on.
What didn't work
We tried HDBSCAN first. It produced beautiful, unstable clusters that shifted under the slightest input drift. We tried letting customers pre-define cluster prompts; nobody used them. We tried community detection on a co-failure graph; it was elegant and none of the clusters were actionable.
Boring k-means with auto-labels has held up.
What's next
Right now obsrv clusters once per project on a schedule. We're moving to incremental clustering so emerging modes show up minutes after they appear, not on the next batch run. We're also testing learned embeddings tuned specifically for agent traces — early results suggest about a 30% improvement in cluster purity over off-the-shelf models.
If you want to play with cluster discovery on your own traces, spin up a project— it's on every plan.