// evaluation · non-deterministic work · long-life agent runs

The controller decides. The evaluator only scores.

Python · Claude Agent SDK (text judge) · Gemini (vision judge) · Playwright · runs over Z.ai at runtime
A long-running agent is only as trustworthy as the thing that decides whether its work is done.
1:0
deterministic controller owns the stop/pivot — the LLM never does

The core inversion: the evaluator never decides. It scores the artifact against the compiled contract and returns findings. A deterministic controller — ordinary code — reads that score and chooses refine, pivot, reset, accept, or human-review. Putting the stop decision in code, not in the judge prompt, is what makes a loop you can leave running unattended.

Everything else is in service of that split. The judge is calibrated, self-improving, and replaceable; the controller is stable, auditable, and never touched by calibration. You can sharpen the evaluator as much as you want and the safety boundary doesn't move.

// the loop — judge scores, controller decides, generator revises
goal                     # "write a report on whether an AI agency is viable"
 → contract compiler    # acceptance criteria, compiled to a rubric
 → generator            # produces the artifact
 → evidence collector  # screenshots, DOM, console health (web_app)
 → evaluator / critic   # scores against rubric — never decides
 → loop controller      # refine / pivot / reset / accept / human_review
 ↺ (until accept, budget, or human_review)

// structural decisions worth knowing

  • build vs runtime
    QLoop never edits ~/.claude/settings.json and never redirects your development session. Building the tool uses normal Anthropic Claude Code; running it loads .env keys and routes the judge over Z.ai. The Z.ai mapping is confined to one provider boundary file. Your dev environment is untouched — a discipline most agent tools violate.
  • judge the evidence, not the transcript
    This borrows /goal's loop skeleton but rejects its transcript-judging. The judge sees real artifact evidence — the rendered page, the report file, the collected screenshots — never the conversation that produced it. Judging the transcript rewards eloquence about the work; judging the evidence rewards the work.
  • calibration that can't corrupt the controller
    The judge improves across runs by calibrating against curated reference anchors (vision + text). A self-improvement loop distils concrete scoring lessons and keep-or-reverts each one only if a held-out anchor doesn't regress. Crucially: all calibration sharpens only the evaluator, never the deterministic controller. The safety boundary is a hard line.
  • discrimination, not just calibration
    A lesson can lower average drift by compressing scores toward the middle — improving agreement while collapsing the judge's ability to tell good from bad. The curator reverts any lesson that regresses an individual anchor's agreement, so no quality tier is sacrificed for another. Average drift is a lie without per-anchor discrimination.
  • contract negotiation
    Before building, a generator-voice proposes acceptance criteria and a harsh evaluator-voice pushes back until they agree; the negotiated criteria become the contract. A deterministic structural guard vetoes untestable criteria even when the critic is too agreeable. The contract isn't handed down — it's argued into existence, then guarded.
  • OK... go ahead and ask...
    Can I really get my agent to work on a task for days, or can only Boris do that?