Computer-use agents — the ones that drive a real browser, a desktop, or a phone — change the trace problem in two ways. First, the most useful evidence is visual: every step is a screenshot plus an action. Second, the volume goes up: an hour-long browser session emits hundreds of frames.
Tracing a computer-use agent the way you'd trace an LLM call is a recipe for a 200MB JSON blob nobody can open.
What we capture
For every browser/desktop step, obsrv's SDK captures four things:
- The action the agent took (
click,type,scroll,navigate,wait,ask_user,tool_result). - Action metadata — selectors, coordinates, key sequences, URLs.
- A pre-action and post-action screenshot (deduped by perceptual hash to skip identical frames).
- The agent's rationale, if the model exposed one.
Screenshots go to object storage under the org/project/trace path. The trace JSON only carries pointers, not pixels.
How we render them
The replay viewer is a horizontal frame strip with a sliding action label, paired with a step rail on the left. You can scrub through screenshots like a video, or jump to any individual action and see the rationale, the selector, and the resulting DOM diff.
Cross-trace navigation matters here more than anywhere else: a failure in step 47 of one trace usually has cousins in dozens of other traces, and the only fast way to debug it is to put four replays next to each other.
What we're working on
Replay parity for mobile and desktop OS automation is the next piece. It's a different action vocabulary and a different screenshot resolution policy, but the core trace contract holds.
We're also experimenting with a learned diff renderer that shows what changed between two screenshots without you having to spot it. Early demos are shockingly useful.
Read the docs to instrument a computer-use agent in about ten minutes.