Plan · alt.inc prescreening

Autoresearch for the Founder-Alt agent

Let an agent rewrite the interview prompt overnight and keep only what scores higher. Start with one facet, expand to the whole experience.

Core idea → we fix the score, the agent climbs it. Start narrow, expand after each win.

How it works — the core loop

From a real conversation to a better prompt. Example: the Thumpn growth-lead interview (agent = "Meghna").

1
Talk to the alt → generate a transcript
Meghna: "Walk me through what you built for acquisition at Finyman."
Madhur: "Cold calling, Meta + Google ads, referrals, events..."
→ full interview saved as one transcript
2
Annotate which responses were good vs bad
👍 good → "That's the list. I want to know what actually happened." (probed)
👎 bad → over-praising a vague answer instead of pushing
3
Turn the annotations into named metrics
# what separated good from bad:
Strictness · Politeness · Probing depth · Persona fidelity · Honesty
4
Write a 0–5 rubric per metric → the judge
Strictness — 0 = accepts vague answers · 5 = always demands a number or example
judge LLM scores every transcript on each metric
5
Autoresearch: edit prompt → replay → score → keep if better
+ "Demand a concrete number before moving on."
Strictness 3 → 4 · composite 3.8 → 4.2kept (else revert + log)
6
Ship the winning prompt
promote behind a flag → A/B on live interviews → keep if it wins
step 5 runs ~100× overnight → wake up to a ranked log + the winning prompt
That's the loop. Phases 0–1 build it; Phases 2–3 widen it to the full interview. Here's the rollout ↓

Rollout — week by week

What we'll doHowWhy
0Build the eval harnessWeek 1
Collect transcriptsPull ~30 real interviews, scrub PII, tag by candidate typeRealistic, varied cases to score against
Define the scoreLock metrics with Kinnari, write a 0–5 rubricOne number the loop can optimize
Build rig + judgeReplay answers to the agent; a strong model grades outputAutomates scoring so we can run it hundreds of times
1Optimize one facetWeek 2–3
Scope to personaFreeze the whole prompt except the persona sectionIsolate one variable so gains are clear
Run the loopEdit → replay → judge → keep if better, ~100× overnightFinds improvements we wouldn't hand-write
Human gateSpot-check the top 5 each morning before promotingCatch metric-gaming before it ships
2Add facets one by oneWeek 4–6
Expand scopeUnfreeze memory, depth, signal, "don't sound AI" in turnCompound gains while keeping attribution
Upgrade optimizerMove to DSPy/MIPRO once many sections tune togetherSmarter search when many parts move
Guard regressionsRe-score on a frozen eval set, block any dropProtect wins already banked
3Optimize the full interviewWeek 7+
Go end-to-endOptimize the whole-interview score, keep prompt variantsTune the experience, not just parts
ProductionizeNightly report; flag-gated A/B on live interviewsContinuous improvement with a safety net
Model → agent on a direct LLM API (no Letta): iterate on Claude/GPT, port to a cheap OSS model later to cut cost. Judge stays on a strong model — a noisy judge breaks the loop.
Reference papers (the techniques behind each phase)