PassLane Coaching Engine — R&D Build Spec (Run 4)
The most optimized, efficient coaching engine · code-verified · plan only · 2026-06-19

THE PassLane Coaching Engine — R&D Build Spec (Run 4 — FINAL)

Principal synthesis. Plan only — no code. Every load-bearing primitive re-verified against /Users/arizona/CLAUDE CODE/passlane/app/index.html and questions.json this session. Honors the run-3 floor; integrates all five facet teams' designs and the full adversarial review.

Verification honesty note (correcting the prior draft). The Run-3 draft asserted "all facts verified" and then shipped three load-bearing errors. They are corrected here against the actual source, and a builder must treat the corrected facts as canonical:
- WEAK_MIN_ATTEMPTS does not exist. The only weakness constant is WEAK_THRESHOLD=70 (L1738). weakCategories (L1994–2008) filters acc < WEAK_THRESHOLD (L2006) with no minimum-attempts floor — a category with one wrong answer computes acc=0% and enters the weak set instantly. The honesty gate the design depends on is net-new code, not a reuse. (Closed in §3.)
- Audio is a remote, metered, network layer — not bundled/offline/sunk. Clips stream from AUDIO_CDN='https://passlane-5jv.pages.dev' (L4270), clipSrc${AUDIO_CDN}/audio/${name}.mp3 (L4293), gated by a fetched manifest audio/states-manifest.jsonaudioLib.ids (L4275–4279). The AZ -a answer-clip coverage is in fact complete today (323/323 present in the manifest, so the "45%/147" figure is itself inaccurate), but the engine is multi-state and hasClip(${q.id}-a)-gated per question (L2888), so the spec keeps the per-question branch. (Corrected in §5/§6.)
- warmthTail fires on STRICT equality: missStreak === 3 and rightStreak === 8 (L2875–2876), not . The handoff in §5 now keys on === 3. (Closed in §5.)
- Token counts re-measured from questions.json: stripped prefix ≈ 51k tokens, raw ≈ 73k (the draft's 43.7k/56k were ~15–25% low). Caching still survives any plausible floor; only the headline cents move. (Corrected in §6.)
Confirmed-correct anchors: bank 255,739 bytes / 323 questions; advanceSeconds=2 (L1735); LEITNER_INTERVALS (L1736); READINESS_THRESHOLD=70 (L1737); questionMastery (L1972); computeReadiness (L1980); weakCategories (L1994); isDue (L1916); progress[id] (L1887); the two feedback doors submitAnswer (L3006) / revealUnanswered (L2929); finishRound (L3103); finishExam (L3228); buildQueue (L2029, no difficulty param); isPro() reads spoofable localStorage az_pro==='1' (L2148).

1. The Optimal Coaching Model

The optimal coach for PassLane is a deterministic decision table that almost never calls a model — a pure on-device function decideMove(ctx) → Move that reads the learner's own already-computed local state (Leitner box via progress[id], questionMastery, computeReadiness, weakCategories, the exact wrong letter at submitAnswer, the live missStreak/rightStreak, isDue backlog, streak) and emits at most one coaching move per moment, rendered into the three surfaces the app already paints (feedback seam, Home readiness panel, round/exam debrief).

This is the most efficient design because the app already is an adaptive tutor: it has a knowledge tracer (the box+accuracy blend in questionMastery), a scheduler (Leitner), a weakness ranker (weakCategories), and a readiness scalar (computeReadiness) — all computed locally, telemetry-free, every render. The coach's intelligence is therefore a read/derive over existing primitives, not a parallel brain: it runs in <1 ms, offline, at $0/learner identical at 10k / 100k / 1M, and is byte-for-byte removable.

An LLM enters the system at exactly one point — authoring coach-az.json once, offline, by us (the novelty-judged per-distractor layer) — so judgment is paid for a single time in a batch job and amortized across every learner forever. The only metered path is a learner's own live spoken words (Pro voice), which cannot be precomputed and is the smallest, most-gated, most-cappable tail. Honesty is mechanical at every layer — cite the local data or stay silent — which is simultaneously the pedagogy, the trust moat, and the cost ceiling.

One correction to the "reuse only" claim: the design needs exactly two net-new derivations on top of the existing primitives — (a) an attempts-floor wrapper around weakCategories (because the raw function has no floor and will over-claim "weak" off a single miss), and (b) an untouchedCats detector (the documented blind spot). Both are folded into a single shared pool walk (§3) so the "no parallel pass" principle holds. Everything else is genuine reuse.


2. Coaching Policy & Move Catalog

The brain is a priority cascade, first-match-wins, one move per event, default silence. It fires at exactly three legal surfaces and nowhere else (never during reading/listening — it would fight TTS/mic; never in exam). The cascade is split by entry point because the two feedback doors carry different state.

Signal vector (all local, all already computed)

correct, letter, q.correct, q.id, q.category
Source (verified) submitAnswer L3006
Meaning the hinge; the ONLY site the wrong distractor letter exists
isBlank
Source (verified) revealUnanswered L2929 (no letter in scope)
Meaning didn't engage ≠ got it wrong
mastery = questionMastery(id)
Source (verified) L1972
Meaning per-item 0..1; bands drive teach/drill/leave
box, incorrect, consecutiveUnknown
Source (verified) progress[id] L1887
Meaning shaky vs slammed-to-1 vs blanking
missStreak, rightStreak
Source (verified) session.* L2974–2975, set in warmthTail-adjacent path
Meaning live frustration vs momentum
consecutiveNonAnswers
Source (verified) session L2914 (voice path only)
Meaning hands-free disengagement
dueCount, weakGated, unseenCount
Source (verified) isDue L1916 / gated weakCategories L1994 / untouchedCats (new)
Meaning backlog, worst qualified attempted, blind spot
readiness + history
Source (verified) computeReadiness L1980 + az_coach.rdHist
Meaning whole-mode + delta
streak.*, q.difficulty
Source (verified) L1929 / bank field
Meaning habit timing, scaffold lever

Bands (config constants in COACH_CONFIG, vertical-tunable): SHAKY 0.3–0.6 (teach-pays-most, the 85%-rule zone), FRAGILE <0.3, frustration missStreak ≥ 3, momentum rightStreak ≥ 8. WEAK_MIN_ATTEMPTS = 3 is a net-new constant (see §3) — no "weak" claim renders below it.

The move catalog

0
Move STAY SILENT (~60–70%)
Trigger nothing below fires, or only content would paraphrase q.explanation
Content (deterministic unless noted) provenance chip only
Status at launch LIVE
0b
Move ORIENT (cold-start)
Trigger first ~10 answers of a life, or category attempts < WEAK_MIN_ATTEMPTS
Content (deterministic unless noted) true-at-n=1 framing from pool structure ("First time on Insurable Interest — 6 more in this topic"); claims no mastery
Status at launch LIVE
1
Move MISCONCEPTION REPAIR
Trigger !correct && !isBlank && coachAz[id].distractors[letter] != null
Content (deterministic unless noted) the pre-generated, novelty-passed refutation for that letter; never a runtime LLM call
Status at launch GATED — inert at launch (coach-az.json null-by-default)
2
Move RE-TEACH
Trigger (a) fragile wrong with no repair cell; (b) any blank (the default at the revealUnanswered door — repair is structurally impossible, no letter)
Content (deterministic unless noted) coachAz[id].reteach only if it passed the novelty judge; else silent (no echo)
Status at launch LIVE for blank-routing; prose GATED on content
3
Move SPACED-RECALL NUDGE
Trigger Home, dueCount ≥ 5
Content (deterministic unless noted) "N cards due — a 5-min pass locks them in" → startStudy(null)
Status at launch LIVE
4
Move DRILL THIS TOPIC
Trigger Home weakGated[0] exists; post-answer q.category ∈ weakGated
Content (deterministic unless noted) Phase 1a "Drill your weak topics" → startStudy('weak'); Phase 1b "Insurable Interest — 40% over your last N — drill this topic" → single-category queue
Status at launch LIVE (label gated on engine branch)
5
Move DIFFICULTY GOVERNOR
Trigger frustration missStreak ≥ 3; momentum rightStreak ≥ 8 && readiness ≥ 70
Content (deterministic unless noted) frustration: apply a difficulty ceiling to the next queue (drop difficulty≥4, pad from ≤3) — a real selection change orderForLearning doesn't make. momentum: Home opt-in "want the 12 hardest cards?"
Status at launch LIVE (silent ceiling)
6
Move ENCOURAGE (rationed)
Trigger streak near goal / at risk; whole-mode readiness milestone; first correct after a frustration patch
Content (deterministic unless noted) whole-mode delta only (always true). Per-topic copy branch CUT (the per-cat attribution it needed is cut in §7 — see inefficiency fix).
Status at launch LIVE
7
Move QUIZ NEW
Trigger Home, dueCount < 5 && unseenCount > 0
Content (deterministic unless noted) "N topics you haven't touched" → startStudy(null)
Status at launch LIVE

The decision policy (two feedback doors + Home)

Frequency governor (its own mechanism, not warmthTail reuse): a move fires at most once per N=4 feedback eventsexcept wrong-answer REPAIR, always allowed (the teachable moment is non-negotiable). Across a 10–15-card round the visible coach surfaces 2–3 times. Bookkeeping in az_coach.mute (a field of the single key, §3). Where judgment is needed (per-distractor refutation phrasing) the LLM is used once at authoring time; everywhere else the policy is deterministic.


3. The On-Device Personalization Engine

The personalization brain is a pure function of local state — not a stored or learned model. It derives one typed Intent per moment and reuses the engine's primitives (never a parallel SM-2). Hard rules: reuse questionMastery/isDue/computeReadiness/difficultyScore; wrap (not replace) weakCategories with an attempts floor; the only path into a queue is startStudy/buildQueue; never re-call recordAnswer/recordUnknown (observe box deltas, don't write them).

The honesty gate that does NOT exist today (critical, net-new)

weakCategories (L1994–2008) ranks by acc < WEAK_THRESHOLD with no attempts floor, so a single miss → acc=0% → instant "weak." Every "soft spot / drill this 0%" claim would over-fire off one answer — exactly the over-claim the honesty moat forbids. The fix is a coach-side wrapper, leaving the engine function untouched:

attempts(mode,cat) is computed in the same shared pool walk below (no extra pass). This is net-new code, not a reuse — the spec is explicit about it so a builder doesn't look for a constant that isn't there.

Learner-state model + the ONE new key

Read entirely from existing state; the entire net-new storage is one versioned, fail-safe key (the earlier draft's az_coach_rdx / az_coach_seen are NOT separate keys — they are nested fields of az_coach, resolving the three-key inconsistency the review flagged):

Fail-safe: on load, try{parse}catch → fresh default; then if (v !== COACH_SCHEMA_V || !shapeOK) → freshCoach(). Because the coach is additive, discard-to-default can never destabilize the launch loop — so no migration logic, ever.

Batched writes (no amplification): catMiss and mute mutate in memory per answer (O(1), no I/O) and persist only at round boundaries, piggybacking the existing saveProgress() in startStudy/finishRound. The feedback paint path gains zero synchronous setItem, so "removing the coach can't change timing" is literally true. (Verified: submitAnswer already does two writes/answer via recordAnswersaveProgress and recordStudyActivitysaveStreak; the coach adds none.)

Selection logic — the one core-engine touch

When an Intent routes into study, the only entry is startStudy/buildQueue. buildQueue at L2029 has no difficulty parameter today, so this is a genuine additive object-filter branch (honest scope — not "one line"):

This expresses all three modes through orderForLearning (Leitner bookkeeping + session resets preserved): DRILL single-category = {category}; RECOVER easier = {category, difficultyMax:2}; ADVANCE last-mile = {difficultyMin:4}. String branches ('weak'/'flagged'/null) stay byte-for-byte unchanged. Within any queue, "next item" = lowest questionMastery among due — reused verbatim; the brain never sorts session.queue.

One shared pool walk (folds the duplicate pass the review flagged)

computeReadiness, weakCategories, the attempts floor, untouchedCats, and dueCount all reduce over the same pool. They are computed in one walk per Home render — coachScan(mode) — emitting {readinessPct, weakGated[], untouchedCats[], attempts{}, dueCount} in a single pass:

Cost: one sub-1 ms reduce over 323 items, replacing the prior draft's two passes.

Cold-start (a real n=1 move, not blank provenance)

With empty progress, every data-claim move is evidence-gated and REPAIR is null — so without a floor move, a brand-new user's first session (the highest-intent beat) personalizes nothing. ORIENT is the floor: built from pool counts + isDue + category structure, true at n=1, claims no mastery, teaches the strip affordance once via the seen.firstStrip flag (one lifetime pause). Home at session 0: "Not started — let's find your baseline," one CTA → startStudy(null). No fake diagnostic.

Evidence thresholds (honesty-as-moat, all in COACH_CONFIG): "weak" only at WEAK_MIN_ATTEMPTS ≥ 3 (net-new); "tripped you twice" only at incorrect ≥ 2; a readiness delta only with ≥ 2 in-window snapshots. Below threshold → ORIENT or silent, never a claim. A new vertical tunes via the config pack, no code fork.


4. The Intelligence & Content Pipeline

Three zones push ~99% of coaching value to $0 / 0 ms / offline by exploiting the fixed bank; the live path handles only the one thing that can't be precomputed — a learner's own words.

Zone 0 — the readiness verdict is DETERMINISTIC (no model)

The Pro headline ("are you ready?") is a Zone-0 render, not an Opus call — the aggregation is already done locally (computeReadiness → 0–100, readinessLabel/READINESS_THRESHOLD=70 L1737 buckets it, gated-weakCategories ranks the miss pattern). Template:

≥70 → "Tracking to pass — {weakCat} is the last soft spot ({n}%). Book when you've cleared it." | 45–69 → "Biggest lever: {weakCat} at {n}%. Drilling it moved you +{delta}." | <45 → "Early days — {unseen} topics still untouched."

Free, offline, instant, already correct. Opus is removed from the pipeline entirely — open-ended reasoning over a scalar is the anti-pattern; it earned no justified call. Optional Haiku warmth-phrasing of the pre-computed facts is online+Pro only, fail-closed to the deterministic string.

Zone 1 — the authored layer (offline, one-time), with the pre-filter sizing the paid step

Why two redundancy tests at different stages (the review's fair question): the pre-filter is a cheap deterministic string/concept check run before paying for generation, so the GENERATE step (the only paid step) is sized to survivors, not the full matrix. The post-gen novelty judge is an LLM semantic check that catches paraphrastic echoes the string pre-filter misses. They are complementary, not duplicative — and crucially the pre-filter's purpose is cost sizing, the novelty judge's is final quality. Build instruction: measure the pre-filter kill-rate on a 30-cell sample first; if it kills <60% of distractors, the GENERATE budget must be re-estimated upward before committing (the $0.74 figure assumes a high pre-filter kill-rate). Containment stops fabrication; novelty stops redundancy; a cell ships only if it clears both.

Model routing, WHEN-justified

Authoring coach-az.json (once, by us)
Model Haiku 4.5 + Batch
Why correctness comes from Citations grounding + containment judge + SME, not parameter count; Sonnet/Opus on a gated Batch pass is waste
Every live grounded talk turn (95%+)
Model Haiku 4.5
Why the cited source carries correctness; a down-route rule drops chit-chat over already-cited material back to Haiku
Open-ended multi-turn warmth over the learner's own wording
Model Sonnet 4.6
Why the conversational wedge — warmth is the only thing a bigger model buys here, routed server-side by turn-type
~~End-of-session diagnostic~~
Model ~~Opus~~
Why removed — the verdict is Zone-0 deterministic

Caching, offline, and the data-flow with per-path cost/latency

Cache the whole-bank token-stripped prefix (id+stem+choices+correct+explanation as compact text ≈ 51k tokens measured, ~30% cheaper than raw JSON's ~73k), never per-question (~134 tok can't clear the cache floor). The 51k prefix clears any plausible cache floor (1,024 / 2,048 / 4,096), so caching is robust regardless of the unresolved floor value. TTL: default the 5-min (1.25×) write, escalate to 1-hour only past 2 turns; for a one-shot PTT follow-up, paying uncached input once can win.

Infra minimalism: reuse the existing Worker (worker/src/index.js — bearer-auth + KV rate-limit + CORS + fail-closed); no vector DB — the learner is always on a known id, retrieval is a direct q.explanation lookup, so the bank IS the index. Fail-closed-to-local on timeout: a failed turn costs no budget and degrades to the free vetted explanation.


5. Visibility & Surfacing

PULL by default; the unprompted PUSH is a strict per-surface budget. Lane is silent until the learner's own data crosses a deterministic threshold the flat content provably can't address, and even then spends from a tiny fire-once budget. No new screens — three existing surfaces only.

A. Feedback seam
Anchor (verified) cx-lane sibling, painted in appState==='feedback'
Shows provenance footer + teaching strip + (rarely) ONE in-seam beat
Mode passive + budgeted
B. Readiness panel
Anchor (verified) renderReadiness() L2565/2578, every Home open
Shows the readiness number AS proof — delta sigil + shared "what to do next" + Pro discovery
Mode passive (pull)
C. Round/exam debrief
Anchor (verified) finishRound L3103 / finishExam L3228
Shows ONE earned post-round beat (coverage/repair-recap) + next action
Mode budgeted

The split budget (anti-nag, mechanical): at most ONE in-seam beat (session.cxSeenBeat) AND, independently, ONE round-complete beat (session.cxEndBeat) per round — both in-memory flags reset every startStudy like missStreak. They never collide (one in feedback, one on the debrief). This fixes the single-flag flaw where an early mid-round repair beat silently suppressed the calm end-of-round nudge.

The warmth-clip handoff — corrected for STRICT equality and per-question clip presence. warmthTail plays its lift clip on exactly session.missStreak === 3 (L2875) and _roll on exactly rightStreak === 8 (L2876) — strict equality, not . The handoff is therefore:

Sharper drill trigger: prefer 2 misses within one category this round (from session.byCategory/sessionMissed) over raw missStreak === 3 — decouples from the generic streak clip and yields a better drill prompt.

Cut from v1: the momentum-stretch text beat (overlaps the _roll clip at L2876) and the readiness-milestone in-round beat (a 323-item-pool mean effectively can't cross a label boundary inside one short round).

Transparency is structural, not optional: every proactive beat carries a required one-clause "why" that can only cite local data — "flagging this because your last two Liability answers missed." It's a required field of the beat object: a beat literally cannot render without it. This is the surfacing analog of cite-or-refuse, pinned in the object contract so a future authored line can't silently degrade "your data" → "trust me."

The readiness "proof it works" surface

The readiness number already renders; the only gap is that it's point-in-time. Close it with az_coach.rdHist (a field of the one key, not a separate az_coach_rdx) — a per-mode ring (~5 modes × ~8 snapshots × ~20 bytes ≈ 0.8–1 KB, capped) of {ts, pct}. (Per-category snapshots are not written — §7 cuts per-cat attribution as un-buildable, so the cats:{} field stays empty rather than carry storage nothing consumes.)

Unified throttle key (fixes the double-count the review flagged): snapshots are written in two sitesrenderReadiness() (captures paused/abandoned/exam/reviewMissed sessions a finishRound-only write would drop) and finishExam (mock exams genuinely move mastery via recordAnswer(schedule:false)). Both use ONE idempotency key: dayKey = ${mode}:${UTCDate}` — a snapshot writes only if the ring's latest entry has a different dayKey`. This makes the renderReadiness day-throttle and the finishExam write share one rule, so a round-then-exam on the same day cannot double-count the delta.

The panel shows three additive elements from one shared helper cxReadinessDelta(mode): (1) a delta sigil ▲6/▼3/; (2) the shared nextBestTarget(mode) "what to do next" line — ranks weakest-gated-attempted vs largest-unseen-block by which moves readiness most, so the unseen blind spot closes on both B and C; (3) an honest-floor guard: <2 in-window snapshots → show with "first read today — check back after a session," never a fake ▲0.

Surfacing legality (the autoadvance contract)

advanceSeconds=2 (verified L1735) wipes the feedback card fast, so Phase 1 LEADS with the Home surface (no 2 s guillotine) — provenance chip + Home moves + the silent Move 5 ceiling carry launch value. The passive strip never touches the timer; only an explicit CTA tap calls clearAdvanceTimer. The single sanctioned interrupt is a one-shot pause on the first DATA-bearing card (not the literal first card, which shows only provenance). Feedback-strip prose (Moves 1/2) is a fast-follow gated to dwell (manual/Instant pacing or the first-session pause) — and gated further by the §8 reachability measurement. Accessibility: the passive strip is aria-hidden until engaged (the explanation is already aria-live=polite); only a tapped beat or a focused Home delta announces.


6. Efficiency & Cost Architecture

The cheapest coaching engine is the one that almost never calls a model. Three planes; only the live tail costs money.

Cost-per-learner at scale (arithmetic, re-derived on measured tokens)

Free plane (P0 local strip + P1 authored layer) = $0.00/learner at 10k / 100k / 1M. Both ship inside the binary like questions.json (255,739 bytes, verified) — no network, no model, no per-user server state, no scaling term.

One-time authoring (the only "content" model cost): AZ 323 cells = generate (survivors only) ~$0.37 + novelty/containment judge ~$0.37 ≈ $0.74 one-time, ever (Haiku 4.5 Batch; assumes a high pre-filter kill-rate — re-estimate if the §4 sample shows <60%). Six states ≈ $7.80. The binding cost is not the model — it's SME review of the ~6–26 non-null AZ cells × ~12 min ≈ 1.5–6.5 hrs (~$90–780 AZ; ~$540–4,700 all states), which dwarfs the model pass 100–1000×. Null-by-default minimizes but doesn't remove it — this is the real go/no-go (OQ1).

Live Pro plane (the only variable cost), re-derived on the measured 51k prefix: Sonnet warm turn ≈ ~51k cache-read (×$0.30/M = $0.0153) + ~200 fresh in + ~250 out ≈ ~2.0¢/turn (was 1.7¢ at the under-counted 43.7k). Haiku warm ≈ ~0.7¢/turn.

10k
Learners 10,000
Pro 5% 500
Turns/Pro/mo 60
Live turns/mo 30,000
Sonnet/mo (≈2.0¢) ~$600
Cost/total learner/mo ~$0.060
100k
Learners 100,000
Pro 5% 5,000
Turns/Pro/mo 60
Live turns/mo 300,000
Sonnet/mo (≈2.0¢) ~$6,000
Cost/total learner/mo ~$0.060
1M
Learners 1,000,000
Pro 5% 50,000
Turns/Pro/mo 60
Live turns/mo 3,000,000
Sonnet/mo (≈2.0¢) ~$60,000
Cost/total learner/mo ~$0.060

Expected Sonnet ≈ ~$1.20/Pro seat/mo against $8.99/mo. The guarantee is the cap, and the cap value is now SET:

COACH_CONFIG.LIVE_TURNS_PER_DAY_CAP = 12 (was unset). At 12 turns/day → ~360/mo → worst-case ~$7.20/seat/mo at ~2.0¢/turn, below the $8.99 price (positive margin even at the ceiling). On hitting the daily cap, the seat degrades to the free Zone-0/Zone-1 grounded fallback for the rest of the UTC day — no further metered spend. (The prior draft's 20-turn cap implied ~$10.47/seat, underwater; 12 is chosen specifically to keep the worst case below price. The owner may raise it as a deliberate loss-leader, but the default ships in the black.)

All per-turn cents above are ±~15% on token-count and June-2026 pricing uncertainty (OQ5); the structural conclusion ("almost never calls a model → $0-dominated") is robust to any plausible drift.

Latency budgets

Zone 0 strip <16 ms (sync DOM); readiness verdict sub-ms; Zone 1 lookup 0 ms; Zone 2 first-token ≤1.5 s p50 / ≤3 s p95 warm, 8 s timeout → fail-closed to local (costs no turn).

Biggest win / biggest waste / the free–metered line


7. Measurement & Self-Improvement

Three questions, zero analytics.

Does it work for THIS learner (local effectiveness instrument): snapshot computeReadiness into az_coach.rdHist at the two sites (§5) under the unified dayKey idempotency rule, then show one honesty-gated mode-level delta sentence via the shared cxReadinessDelta() helper on Home, round-complete, and exam-debrief. The gates are cite-or-refuse applied to measurement: coverage floor (n ≥ 25 answered + history ≥ 2 snapshots), no-claim band (|delta| < 3 → silent), sign honesty (a dip is reframed toward action, never hidden — "Readiness dipped 4, you're seeing harder cards — drill {weakCat} to recover?"), and attribution honesty — the sentence claims the learner improved (mode-level, true and motivating), never that the coach caused it.

Per-category attribution is CUT (whole-mode delta is the buildable, honest claim). Consequently — and this closes the review's "dead branch" finding — Move 6's per-topic copy branch is removed (it was gated on a per-cat snapshot the measurement section won't trust), and rdHist.cats is left unwritten (empty), saving the storage nothing consumes. Move 6 ships whole-mode only.

The content self-improvement gates (offline CI, null-by-default): paired CONTAINMENT (any introduced fact → drop) AND NOVELTY (overlap < NOVELTY_MAX && adds_new_info, else null) judges run as a vm/assert/exit-1 build gate on every coach-az.json change — modeled on the codebase's own voice-sandbox/harness.js "correctness is asserted, not vibed" precedent. On a bank edit (src_hash mismatch) or an az_reports entry, affected cells re-run; failing cells revert to null (graceful silence). The corpus monotonically improves without ever watching a user. recall@k ≥ 0.95 on a golden set is the trigger that would authorize any future vector-DB spend — a measurement that prevents premature complexity.

Kill-switch: a fail-open remote per-question_id denylist neutralizes a slipped cell in hours (falls back to the vetted explanation), not releases.

Analytics-free: the only optional population signal is a paper-spec, OFF-by-default, separately-consented, min-aggregated readiness-delta bucket — built only if the owner greenlights it (OQ3); the default holds the no-tracking line, and the local delta already proves it per-learner.

Tunables calibrate-then-freeze: NOVELTY_MAX against a 20–40 hand-labeled cell fixture (report-only until calibrated); WINDOW/WEAK_MIN_ATTEMPTS against synthetic study traces.


8. Build & Optimization Roadmap

Highest-leverage / cheapest first, tied to the run-3 phases. Every item is additive and kill-switch suppressible; removing the layer leaves the study loop byte-for-byte unchanged.

Phase 1 — ships with launch ($0, offline, no model):

  1. Provenance chip (always-on, the genuinely-new trust element).
  2. az_coach versioned single-key scratchpad + the {category, difficultyMin, difficultyMax} buildQueue branch (the one core-engine touch, harness-verified) + COACH_CONFIG constants including the net-new WEAK_MIN_ATTEMPTS=3 and LIVE_TURNS_PER_DAY_CAP=12.
  3. weakCategoriesGated wrapper + the shared single-pass coachScan (readiness sum, weak-gated tally, untouched-cats, attempts, dueCount in one walk).
  4. decideHomeMove — Moves 3 / 4a / 7 + the deterministic readiness verdict + delta sigil via rdHist (day-throttled under the unified dayKey, written in renderReadiness) + shared nextBestTarget.
  5. Move 5 frustration difficulty ceiling (silent queue modifier — felt, never needs dwell) + ORIENT cold-start move.
  6. Flip Move 4 to the single-topic label (4b) once the branch lands; place the two feedback hooks (L3006 / L2929) rendering only when dwell exists; the frequency governor in az_coach.mute.

Phase 1's visible value rides the Home surface + provenance chip + silent better queues — none of which depend on the 2 s feedback window or on any authored content.

Phase 1.5 — the reachability gate (NEW, blocks Phase 2 content spend):

  1. Instrument feedback dwell locally (no analytics): count manual-advance taps vs auto-advance over real sessions. This directly resolves the single biggest threat to Zone-1's value — at advanceSeconds=2 the card flashes, and Moves 1/2 are the only consumers of coach-az.json. Decision rule: if < X% (owner sets X, suggest 30%) of sessions dwell >2 s on feedback, scope the SME-reviewed authored layer to the manual/Instant-pacing cohort only (or add the one-shot first-session feedback pause). Do not fund SME review until this gate reports.

Phase 2 — fast-follow (gated on Phase 1.5):

  1. Size coach-az.json (run the pre-filter on a 30-cell sample → confirm kill-rate → the ~$0.74 pre-filter + judge pass) before funding SME review — get real non-null counts AND the dwell verdict first.
  2. Ship non-null cells as static data → Move 1 repair + Move 2 re-teach prose at the feedback doors (gated to dwell per the Phase-1.5 verdict); the az_reports buffer + denylist loop.
  3. Move 5 last-mile Home opt-in; Move 6 encouragement rationing (whole-mode only).

Phase 2+ / spike-gated:

  1. Zone 2 live voice (STT + proxy + server-verified receipt + per-day budget + the 12-turn cap with free-fallback), Haiku-default talk on the whole-bank 51k cached prefix — blocked by the privacy-copy flip (the moment any turn leaves the device, the disclosure must read "the Pro coach you opted into transmits").

9. Open Questions (genuine only)

  1. SME review funding (the real go/no-go). ~6–26 non-null AZ cells × ~12 min ≈ 1.5–6.5 hrs (~$90–780 AZ; ~$540–4,700 six-state) dwarfs the $0.74 model pass. Fund the authored layer now, or ship Phase 1 (fully valuable without it) and defer Zone 1? Now gated behind the Phase-1.5 dwell measurement — don't decide until reachability is known.
  1. Feedback-seam reachability (promoted from OQ to the Phase-1.5 gate). At advanceSeconds=2 the card flashes. The Phase-1.5 instrument answers it empirically. The remaining owner call: set the dwell threshold X, and decide whether a one-shot first-session feedback pause (one interrupt) is acceptable, or whether feedback prose is a manual/Instant-pacing feature only.
  1. The population beacon. Default off (no-tracking line held). Does the owner want the separately-consented, min-aggregated readiness-delta bucket — the only instrument that can make a defensible causal "PassLane works for people like you" marketing claim — or stay purely local-proof?
  1. Confidence channel. No "I guessed / I knew it" input exists, so the coach can't distinguish a lucky correct from a solid one beyond the box trajectory, and the calibration proxy stays soft-nudge-only. Is a 2-tap confidence chip worth the friction on the highest-intent beat, post-launch?
  1. API assumptions to re-confirm before build: exact cacheable-prefix floor (1,024 / 2,048 / 4,096 — the 51k prefix clears all three, so structurally moot), Citations token-free + ZDR eligibility, measured token counts (stripped ≈51k / raw ≈73k this session — re-confirm at build), and June-2026 pricing (Haiku 4.5 $1/$5, Sonnet 4.6 $3/$15, Batch −50%, cache-read 0.1×). Structural conclusions survive any plausible drift; the per-turn cents are ±~15%.

Files referenced (all absolute): the live engine at /Users/arizona/CLAUDE CODE/passlane/app/index.html, the bank at /Users/arizona/CLAUDE CODE/passlane/app/questions.json, the audio manifest at /Users/arizona/CLAUDE CODE/passlane/app/audio/states-manifest.json (clips stream from AUDIO_CDN, not bundled), run-3 truth at /Users/arizona/CLAUDE CODE/docs/passlane-lane-THIRD-DRAFT.md, run-2 wiring at /Users/arizona/CLAUDE CODE/docs/passlane-coach-interaction-wiring-SPEC.md, run-1 strategy at /Users/arizona/CLAUDE CODE/docs/passlane-ai-companion-MASTER-PLAN.md. New artifacts the build introduces: coach-az.json (authored layer, null-by-default), the single az_coach local key (nesting rdHist/catMiss/seen/mute), COACH_CONFIG constants (incl. net-new WEAK_MIN_ATTEMPTS=3, LIVE_TURNS_PER_DAY_CAP=12), the weakCategoriesGated wrapper, the shared coachScan single-pass derive, and one buildQueue object-filter branch.

Private working document — unlisted, not indexed. PassLane / Somos LLC.