T3 Judge Implementation Plan

The LLM-as-judge half of the eval program — the piece Kinnari's main implementation plan explicitly defers.

Published on Stellar wiki under vibechk-evals · Built on the existing services/promptfoo-evals/ infrastructure

"T3 Judge dimensions and new outcome-focused pillars (P10–P12) are explicitly out of scope for this phase and planned separately." — Kinnari's Evaluation Implementation Plan

This is that separate plan.

Cross-references on the wiki — read first to avoid duplication:

Methodology summary — layer naming, severity model, counts (172 active · 37 are T3)
Interview Agent Evaluation Metrics — full pillar-by-pillar catalog (the 37 judges live here)
Current Evaluation Tooling and Limitations — existing infra we build on

T3 judges (catalog)

Kinnari v0.1 final

Net-new from us

complements beyond the 37

First wave

5 + 2

2-week target

New infrastructure

everything we need already exists

1What we leverage — no new infrastructure needed

Per the current-tooling page, almost everything T3 needs is already in the repo.

Capability needed for T3	Already in repo
Multi-turn agent driver	✅ `conversation-runner-provider.ts` orchestrates scripted candidate turns
`llm-rubric` assertion	✅ Workhorse · default judge `openai:gpt-4o-mini` · `gpt-4o` for nuanced
Scripted candidates	✅ JSONL convention: `vars.candidate_turns`
Test corpus	✅ ~200 cases across ~9 JSONL files in `samples/`
Letta + Claude provider parity	✅ Same eval matrix runs both
Threshold mechanism	✅ `gates.json` (descriptive today — T3 thresholds added once calibrated)
Custom JS assertion pattern	✅ `voice-check-banned.js` returns `{pass, score, reason}`

T3 is purely a registry of llm-rubric assertion files pointing at the existing multi-turn outputs. No new providers, no new conversation infrastructure.

2The bridge — `grader_type` → promptfoo assertion net-new framing

Every indicator carries a grader_type. For T3 specifically:

`grader_type`	promptfoo assertion type
`llm_rubric_per_turn`	`llm-rubric` graded on a single agent turn
`llm_rubric_per_transcript`	`llm-rubric` (or `trajectory:goal-success` / `conversation-relevance`) on full transcript
`ground_truth`	Human-labeled gold (calibration set only — not at runtime)

Same recipe for all 37 judges. Only the rubric text and the context vars differ.

3The recipe — one judge = one `llm-rubric`

Concrete example for P4 Push-back craft:

# services/promptfoo-evals/taxonomy-evals/judges/p4-pushback-craft.yaml
defaultTest:
  options:
    provider:
      id: openai:gpt-4o-mini       # match existing default
      config: { temperature: 0 }
  assert:
    - type: llm-rubric
      value: |
        Pillar: P4 Interview Mechanics. Indicator: Push-back craft.
        Persona: {{alt.persona.summary}}.
        Candidate background: {{candidate.background}}.
        Prior turn (flagged textbook): {{candidate.lastMessage}}.
        Agent: {{agent.message}}.

        Score 0-2:
          0 — accepted textbook answer / moved on
          1 — generic challenge ("can you say more?")
          2 — specific, in-persona challenge naming the gap

        JSON: { "score": 0|1|2, "rationale": "<one sentence>", "evidence_span": "<quote>" }

Three real promptfoo capabilities make this work — all already in use here:

{{var}} interpolation inside llm-rubric.value
provider override at suite or assertion level (swap to gpt-4o for nuanced judges)
temperature: 0 + pinned model for reproducible scores

4Three complements beyond the 37 catalog judges net-new

The 37 judges in the metrics page don't cover these distinct failure modes. Each uses a specialised promptfoo assertion type, not generic llm-rubric.

Complement	Why net-new	promptfoo assertion
Agent-claim groundedness	Distinct from "Rationale traceability" (P6) which grades rubric→verdict. This checks factual claims about candidate / company / persona's past. Catches the hallucination class.	`context-faithfulness` with `contextTransform` — pulls alt memory + JD + org corpus into context
Founder-verdict alignment	Beyond "agent applied rubric correctly" — would the actual founder agree with STRONG_YES/YES/MAYBE/NO? Ground-truth signal we'll need over time.	`llm-rubric` with a founder-persona judge prompt; agreement tracked across runs
Voice-match bulk triage	P1's 8 judges decompose voice well, but for fast pre-screen across many transcripts an embedding similarity score is cheaper and complements the granular ones.	`similar` against a gold transcript

5First wave — 5 judges + 2 complements, 2 weeks

#	Pillar	Indicator	Source	Why first
1	P4	Push-back craft	37	Highest leverage on interview quality
2	P4	Push-back authenticity	37	Pairs with #1
3	P4	Signal sufficiency	37	Catches pacing bugs
4	P5	Per-candidate question fit	37	Catches "generic question" failures
5	P6	Rationale traceability	37	Verdict-evidence gap
6	—	Agent-claim groundedness	Net-new	Hallucination — not in 37
7	—	Founder-verdict alignment	Net-new	Calibration signal long-term — start collecting now

Week 1

Create services/promptfoo-evals/taxonomy-evals/judges/ with files 1–5 as llm-rubric yaml
Wire each to the existing conversation-runner-provider.ts output
Reuse existing samples/*.jsonl fixtures (no new test cases needed yet)
Run on Letta + Claude providers · scored, not yet blocking

Week 2

Add complements 6–7 (specialised assertions: context-faithfulness, founder-persona llm-rubric)
Hand-label ~20 transcripts as calibration gold
Compute judge-vs-human κ per judge
Add T3 per-pillar thresholds to gates.json once baselines exist

6Calibration & gating

Pinning — every judge declares provider.id + temperature: 0. Rubric text version-controlled with the yaml. Model bump = explicit re-calibration.
Gold set — start with 20 transcripts per priority judge; grow to ~50 over the next phase.
κ threshold — advisory at κ ≥ 0.6 · blocking at κ ≥ 0.75 · per-judge, not global.
gates.json integration — T3 thresholds added after week-2 baseline calibration. Until then, T3 results report only.

7Decisions for the team

Decision	Options
Default judge model	`openai:gpt-4o-mini` (current default) · `openai:gpt-4o` for nuanced (push-back, rationale traceability, founder-verdict) — pin per assertion
Judge rubric authorship	One owner per pillar · round-robin · single owner for v0
First wave (5 + 2)	Proposal in §5 — open to swaps
κ thresholds	0.6 advisory · 0.75 blocking, per judge
When T3 enters `gates.json`	After baseline calibration · sooner with advisory-only thresholds
Outcome pillars (P10–P12, also deferred)	Predictive validity, bias, calibration vs hire-outcome — discuss as a separate v0.1 plan once T3 v0 is shipping

Net-new from this plan: positioning as the deferred T3 track, the grader_type → promptfoo assertion bridge framing, the 3 complements beyond the 37 catalog judges, and the first-wave 5+2 sequencing.
Builds entirely on: Kinnari's catalog (37 T3 indicators in the metrics page) and the existing services/promptfoo-evals/ infrastructure.
Also published at: vibechk evals wiki (Stellar).