Plan · alt.inc prescreening

Autoresearch for the Founder-Alt agent

Let an agent rewrite the interview prompt overnight and keep only what scores higher. Start with one facet, expand to the whole experience.

Core idea → we fix the score, the agent climbs it. Start narrow, expand after each win.

How it works — the core loop

From a real conversation to a better prompt. Example: the Thumpn growth-lead interview (agent = "Meghna").

Talk to the alt → generate a transcript

Meghna: "Walk me through what you built for acquisition at Finyman."
Madhur: "Cold calling, Meta + Google ads, referrals, events..."

→ full interview saved as one transcript

Annotate which responses were good vs bad

👍 good → "That's the list. I want to know what actually happened." (probed)
👎 bad → over-praising a vague answer instead of pushing

Turn the annotations into named metrics

# what separated good from bad:
Strictness · Politeness · Probing depth · Persona fidelity · Honesty

Write a 0–5 rubric per metric → the judge

Strictness — 0 = accepts vague answers · 5 = always demands a number or example

judge LLM scores every transcript on each metric

Autoresearch: edit prompt → replay → score → keep if better

+ "Demand a concrete number before moving on."

Strictness 3 → 4 · composite 3.8 → 4.2 → kept (else revert + log)

Ship the winning prompt

promote behind a flag → A/B on live interviews → keep if it wins

↺ step 5 runs ~100× overnight → wake up to a ranked log + the winning prompt

That's the loop. Phases 0–1 build it; Phases 2–3 widen it to the full interview. Here's the rollout ↓

Rollout — week by week

What we'll do	How	Why
0Build the eval harnessWeek 1
Collect transcripts	Pull ~30 real interviews, scrub PII, tag by candidate type	Realistic, varied cases to score against
Define the score	Lock metrics with Kinnari, write a 0–5 rubric	One number the loop can optimize
Build rig + judge	Replay answers to the agent; a strong model grades output	Automates scoring so we can run it hundreds of times
1Optimize one facetWeek 2–3
Scope to persona	Freeze the whole prompt except the persona section	Isolate one variable so gains are clear
Run the loop	Edit → replay → judge → keep if better, ~100× overnight	Finds improvements we wouldn't hand-write
Human gate	Spot-check the top 5 each morning before promoting	Catch metric-gaming before it ships
2Add facets one by oneWeek 4–6
Expand scope	Unfreeze memory, depth, signal, "don't sound AI" in turn	Compound gains while keeping attribution
Upgrade optimizer	Move to DSPy/MIPRO once many sections tune together	Smarter search when many parts move
Guard regressions	Re-score on a frozen eval set, block any drop	Protect wins already banked
3Optimize the full interviewWeek 7+
Go end-to-end	Optimize the whole-interview score, keep prompt variants	Tune the experience, not just parts
Productionize	Nightly report; flag-gated A/B on live interviews	Continuous improvement with a safety net

Model → agent on a direct LLM API (no Letta): iterate on Claude/GPT, port to a cheap OSS model later to cut cost. Judge stays on a strong model — a noisy judge breaks the loop.

Reference papers (the techniques behind each phase)

autoresearch — Karpathy · the overnight loop we're porting OPRO — DeepMind · "show scores, ask for better prompt" (Phase 1) DSPy — Stanford · prompts as compilable parameters MIPROv2 — Stanford · multi-section optimizer (Phase 2) TextGrad — Stanford · attribute regressions per section Promptbreeder — DeepMind · evolve prompts + mutators LLM-as-a-Judge — judge bias failure modes G-Eval — Microsoft · rubric scoring for the judge Demystifying Evals — Anthropic · grading discipline Character-LLM — persona eval protocol PersonaGym — persona-faithfulness metrics Reflexion — the "lessons learned" file