Prescreen Prompt Quality & a Systemic Eval Framework

From "fix this one prompt" → "automatically measure every prompt our workflows generate."

Discussion doc for the team · grounded in a live config (Thumpn / Performance Marketing Lead / Meghana) vs a hand-written ideal · pairs with the eval taxonomy in PR #155

The one thing to take away. The interview agent prompt is already assembled from a fixed template (interviewer.md) whose structure is identical to a hand-written "ideal" prompt — same <role>, <persona>, <communication_style>, <interview_overview>, <conversation_flow>, evaluation, guardrails, anti-AI blocks. The scaffold is a ~100% match. The entire quality gap lives in the variable content the three generator workflows fill in. So the systemic problem is not the template — it is that we have no automated way to measure what each generator produces.

1How the final prompt is built

Four workflows produce content; a compiler slots it into the fixed template at interview runtime.

create-org
Org memory blocks
company, theses, culture
build-alt
Persona · Comm-style · Alt memory blocks
create-job
Job memory block
JD + requirements
create-interview
Interview plan
buckets + rubric
⬇  Compilerinterviewer.md fixed template (role · flow · evaluation · control · guardrails · anti-AI blocks)
FINAL INTERVIEW AGENT PROMPT  (assembled per session)

The fixed scaffold already matches the ideal. Everything below is about the four variable feeds.

2How close is it today?

Scored slot-by-slot against the hand-written ideal for this one config. Overall ≈ 70%. The gaps cluster in build-alt and create-interview.

Prompt slotSourceClosenessMain gap
Scaffold (role / flow / eval / control / guardrails / anti-AI)fixed template ~100%
match Same template as the ideal
Job memory blockcreate-job ~90%
strong JD ≈ ideal almost verbatim
Org memory blockscreate-org ~75%
depends On deep-research depth
Interview plancreate-interview ~70%
gaps Missed the core bucket; scripted Qs
Personabuild-alt ~65%
abstract Trait labels, no concrete background
Memory blocksbuild-alt + memory-analyzer ~60%
verbose 86 blocks, redundant / duplicated
Communication stylebuild-alt ~55%
weakest Learned from LinkedIn posts → sounds like a job ad

These are directional scores for one config, to show where the leverage is — not a metric. Building the real metric is the point of section 4.

3Where each workflow falls short (and the fix)

build-alt biggest lever

  • Comm-style learned from broadcast posts. Signature phrases come out as HIRING:, Apply Now:, hashtag blocks — the agent talks like a job ad, not a 1:1 interviewer.
    Fix: weight conversational sources (chat exports, call transcripts) over posts, or add a "post-voice → chat-voice" transform.
  • Persona too abstract. Jumps to OCEAN + vague labels ("People With Standards"); missing the concrete background ("built Insider IPs, 2,500+ events, Arijit UK stadium").
    Fix: synthesize a Professional Background summary into persona.
  • Memory bloat. 86 blocks with near-duplicates (3 overlapping company narratives, duplicate AAM 10x).
    Fix: dedupe/merge + cap (~20–25) + rank by interview relevance.

create-job strongest

  • JD + requirements are already close to the ideal. Smallest gap.
  • One systemic win: emit the role's non-negotiables (CAC/LTV, experimentation, hands-on, ambiguity) as a structured handoff the interview planner consumes — so buckets map 1:1 to must-haves.

create-interview content gaps

  • Dropped the core bucket. Plan double-covers "build from zero" and omits Retention / Discovery-habit — the actual business moat. No dedicated unit-economics bucket either.
    Fix: feed org strategic theses + job non-negotiables into the planner so the key dimension can't be missed.
  • Scripted, not adaptive. Outputs one verbatim question per bucket; the ideal defines territory + signals + per-candidate entry points.
    Fix: change planner output contract to territories, not fixed questions.
  • Under-fed planner (known: "didn't get job input", "no web research").

4The systemic shift: eval the generators, not the prompt

One hand-fixed prompt does not scale across personas, companies, and roles. We need to score every generated artifact automatically. Proposed three layers — cheapest/fastest first.

L1 Deterministic generation checks

Per workflow step. Cheap, fast, runs in CI on every generation. Pass/fail.

  • Context completeness — did the step receive all required inputs? (e.g. planner got job spec + org theses + persona)
  • Structure — required sections present, every template slot filled & non-empty
  • Hygiene — block-count cap, no duplicate / empty memory blocks, OCEAN present, buckets ≥ N, rubric present

Substrate: final-prompt-components.ts already segments the prompt by source — perfect input for these checks.

L2 LLM-judge on the final prompt

Static checks on the assembled prompt. Scored 0–100 per dimension. Runs on every prompt.

  • AI-tell scan (em dashes, triplets, buzzwords)
  • Rudeness / tone & greeting handling
  • Voice match to source persona
  • Bucket coverage vs job non-negotiables
  • Redundancy / verbosity of memory blocks

L3 Behavioral / transcript evals

Run a candidate simulator against the assembled agent; grade the transcript. The ground truth — slowest/costliest.

  • Graded on the PR #155 taxonomy: 9 pillars × 10 gestalts (Authenticity, Specificity, Rigor, Respect, Resilience…)
  • Catches what static checks can't: does it actually push back, adapt, stay in character, resist jailbreaks?
The open question we converged on: are L1 + L2 enough to be confident, or do we need L3 every time?
Proposal to debate → L1 gates every generation (CI, blocking). L2 runs on every prompt (scored, soft-gate). L3 runs sampled / nightly + on any taxonomy or generator-prompt change. We use L3 to validate that L1+L2 correlate with real behavior — if they do, we lean on the cheap layers day-to-day.

5How this fits with PR #155 + what an eval set looks like

Two pieces that snap together

PieceWhat it is
PR #155 taxonomyThe rubric — "what good looks like" (9 pillars, 10 gestalts). Powers L2 & L3 scoring.
Component breakdownThe harness inputfinal-prompt-components.ts splits the prompt by source. Powers L1.
This docThe runner/gating direction that connects them (the layered model).

#155 is the taxonomy (intentionally not the framework). It's detailed; we adopt it as the L2/L3 rubric and keep L1 lean.

The eval set (so we fix the system, not one alt)

  • 3–5 gold cases: persona × company × role (Meghana/Thumpn, Shreyas/…, +others)
  • Each has: blessed reference prompt + a few sample candidate transcripts (strong / mid / weak / adversarial)
  • Persona-generic by design (matches #155's genericity governance) so it scales to new personas via overlays
  • Re-run on every generator-prompt change → catch regressions before they ship

6Decisions to make together

DecisionOptions
First fixes to ship (pick 2–3)① comm-style source bias (build-alt) · ② memory dedupe + cap · ③ planner bucket coverage from job+org · ④ persona background summary
Gating policyWhat blocks a release: L1 only? L1 + L2 soft? L1 hard + L2/L3 advisory?
Adopt #155 taxonomy as the L2/L3 rubricYes / yes-but-trim to a v0 subset / defer
Ownership & cadenceWho owns the gold set + graders; per-PR vs nightly vs on-demand
Candidate simulator for L3Build now / stub for later / start with curated fixed transcripts

7Suggested first two weeks