Prescreen Prompt Quality & a Systemic Eval Framework
From "fix this one prompt" → "automatically measure every prompt our workflows generate."
Discussion doc for the team · grounded in a live config (Thumpn / Performance Marketing Lead / Meghana) vs a hand-written ideal · pairs with the eval taxonomy in PR #155
The one thing to take away. The interview agent prompt is already assembled from a fixed template (interviewer.md) whose structure is identical to a hand-written "ideal" prompt — same <role>, <persona>, <communication_style>, <interview_overview>, <conversation_flow>, evaluation, guardrails, anti-AI blocks. The scaffold is a ~100% match. The entire quality gap lives in the variable content the three generator workflows fill in. So the systemic problem is not the template — it is that we have no automated way to measure what each generator produces.
1How the final prompt is built
Four workflows produce content; a compiler slots it into the fixed template at interview runtime.
weakest Learned from LinkedIn posts → sounds like a job ad
These are directional scores for one config, to show where the leverage is — not a metric. Building the real metric is the point of section 4.
3Where each workflow falls short (and the fix)
build-alt biggest lever
Comm-style learned from broadcast posts. Signature phrases come out as HIRING:, Apply Now:, hashtag blocks — the agent talks like a job ad, not a 1:1 interviewer. Fix: weight conversational sources (chat exports, call transcripts) over posts, or add a "post-voice → chat-voice" transform.
Persona too abstract. Jumps to OCEAN + vague labels ("People With Standards"); missing the concrete background ("built Insider IPs, 2,500+ events, Arijit UK stadium"). Fix: synthesize a Professional Background summary into persona.
Memory bloat. 86 blocks with near-duplicates (3 overlapping company narratives, duplicate AAM 10x). Fix: dedupe/merge + cap (~20–25) + rank by interview relevance.
create-job strongest
JD + requirements are already close to the ideal. Smallest gap.
One systemic win: emit the role's non-negotiables (CAC/LTV, experimentation, hands-on, ambiguity) as a structured handoff the interview planner consumes — so buckets map 1:1 to must-haves.
create-interview content gaps
Dropped the core bucket. Plan double-covers "build from zero" and omits Retention / Discovery-habit — the actual business moat. No dedicated unit-economics bucket either. Fix: feed org strategic theses + job non-negotiables into the planner so the key dimension can't be missed.
Scripted, not adaptive. Outputs one verbatim question per bucket; the ideal defines territory + signals + per-candidate entry points. Fix: change planner output contract to territories, not fixed questions.
Under-fed planner (known: "didn't get job input", "no web research").
4The systemic shift: eval the generators, not the prompt
One hand-fixed prompt does not scale across personas, companies, and roles. We need to score every generated artifact automatically. Proposed three layers — cheapest/fastest first.
L1 Deterministic generation checks
Per workflow step. Cheap, fast, runs in CI on every generation. Pass/fail.
Context completeness — did the step receive all required inputs? (e.g. planner got job spec + org theses + persona)
Hygiene — block-count cap, no duplicate / empty memory blocks, OCEAN present, buckets ≥ N, rubric present
Substrate:final-prompt-components.ts already segments the prompt by source — perfect input for these checks.
L2 LLM-judge on the final prompt
Static checks on the assembled prompt. Scored 0–100 per dimension. Runs on every prompt.
AI-tell scan (em dashes, triplets, buzzwords)
Rudeness / tone & greeting handling
Voice match to source persona
Bucket coverage vs job non-negotiables
Redundancy / verbosity of memory blocks
L3 Behavioral / transcript evals
Run a candidate simulator against the assembled agent; grade the transcript. The ground truth — slowest/costliest.
Graded on the PR #155 taxonomy: 9 pillars × 10 gestalts (Authenticity, Specificity, Rigor, Respect, Resilience…)
Catches what static checks can't: does it actually push back, adapt, stay in character, resist jailbreaks?
The open question we converged on: are L1 + L2 enough to be confident, or do we need L3 every time? Proposal to debate → L1 gates every generation (CI, blocking).L2 runs on every prompt (scored, soft-gate). L3 runs sampled / nightly + on any taxonomy or generator-prompt change. We use L3 to validate that L1+L2 correlate with real behavior — if they do, we lean on the cheap layers day-to-day.
5How this fits with PR #155 + what an eval set looks like
Two pieces that snap together
Piece
What it is
PR #155 taxonomy
The rubric — "what good looks like" (9 pillars, 10 gestalts). Powers L2 & L3 scoring.
Component breakdown
The harness input — final-prompt-components.ts splits the prompt by source. Powers L1.
This doc
The runner/gating direction that connects them (the layered model).
#155 is the taxonomy (intentionally not the framework). It's detailed; we adopt it as the L2/L3 rubric and keep L1 lean.
The eval set (so we fix the system, not one alt)
3–5 gold cases: persona × company × role (Meghana/Thumpn, Shreyas/…, +others)
Each has: blessed reference prompt + a few sample candidate transcripts (strong / mid / weak / adversarial)
Persona-generic by design (matches #155's genericity governance) so it scales to new personas via overlays
Re-run on every generator-prompt change → catch regressions before they ship
6Decisions to make together
Decision
Options
First fixes to ship (pick 2–3)
① comm-style source bias (build-alt) · ② memory dedupe + cap · ③ planner bucket coverage from job+org · ④ persona background summary
Gating policy
What blocks a release: L1 only? L1 + L2 soft? L1 hard + L2/L3 advisory?
Adopt #155 taxonomy as the L2/L3 rubric
Yes / yes-but-trim to a v0 subset / defer
Ownership & cadence
Who owns the gold set + graders; per-PR vs nightly vs on-demand
Candidate simulator for L3
Build now / stub for later / start with curated fixed transcripts
7Suggested first two weeks
wk 1 Stand up L1 deterministic checks on top of final-prompt-components.ts (context completeness + structure + hygiene). Blocking in CI. Cheapest, immediate value.
wk 1 Land the two highest-ROI generator fixes: comm-style source bias + memory dedupe/cap (build-alt).
wk 2 Build the gold eval set (3 cases) + a first L2 grader (AI-tell, voice-match, bucket-coverage) using the #155 rubric subset.
wk 2 Run L3 once manually on the 3 cases to check whether L1+L2 correlate with transcript quality → decide the gating policy from evidence.