← Back to blog
Engineering·Apr 8, 2026

How we cluster AI agent failures without a single label

Maya Chen

When we started obsrv, we tried the obvious thing: build a taxonomy of agent failure modes — tool misuse, hallucination, goal drift, unsupported request — and ask customers to label their traces against it. It worked beautifully on a deck. It collapsed in week one of every real deployment.

The reason: youragent fails in ways that look nothing like the next team's agent. A taxonomy that fits a customer support copilot does not fit a robotics planner. By the time we'd agreed on a category list, the agent had shipped a new prompt and invented three new failure modes.

The pipeline we ship today

obsrv runs an embed → cluster → labelpipeline, continuously, against every project's trace stream:

  1. Embed. A worker extracts a structured text representation from each trace — user message, retrieved context, tool calls, final reply, error spans — and embeds it into a 1536-dimension vector. We store the vector in pgvector, alongside the trace ID and project.
  2. Cluster.A small, pure-Go k-means runs on a moving window of recent traces with cosine distance. We use k-means specifically because it's deterministic, fast, and easy to operate at the volumes obsrv sees per project. We pick k via silhouette score.
  3. Label. Each cluster is sent to Claude with a handful of representative behaviors and asked for a short, human-readable name. Labels are regenerated when membership shifts beyond a threshold.

Why this matters

Two things happen when failure modes surface from real behavior instead of a template:

  • You catch issues you didn't know to look for.
  • The names match how your team actually talks about them.

That second one is more important than it sounds. The whole point of a label is to be useful in a Slack channel at 11pm. "Tool selection failure (cluster 3)" is useless. "Cancels the wrong order when the user mentions a previous one" is something an on-call engineer can act on.

What didn't work

We tried HDBSCAN first. It produced beautiful, unstable clusters that shifted under the slightest input drift. We tried letting customers pre-define cluster prompts; nobody used them. We tried community detection on a co-failure graph; it was elegant and none of the clusters were actionable.

Boring k-means with auto-labels has held up.

What's next

Right now obsrv clusters once per project on a schedule. We're moving to incremental clustering so emerging modes show up minutes after they appear, not on the next batch run. We're also testing learned embeddings tuned specifically for agent traces — early results suggest about a 30% improvement in cluster purity over off-the-shelf models.

If you want to play with cluster discovery on your own traces, spin up a project— it's on every plan.

08 — DEPLOY

Ship the agent.
obsrv will record everything.

Drop the SDK in three lines. The recorder lights up the moment traffic starts flowing.

$pip install theta-obsrv·npm i @theta-lab/obsrv
FDR · OBS-1
REC
tr_01HXR4Z9CK
support_agent · 4.2s
✗ wrong_order