# V3 Session Handoff (2026-05-08) > Personal handoff doc. Untracked. Paste this into a new Claude session as context when resuming V3 work in the new monorepo location. --- ## TL;DR - I'm 이승은 (Lee Seungeun) at GoodGangLabs. I authored all of V1/V2/V3. - **As of 2026-05-08 my working dir is `/dataset/kemix-engine/package/face/animasync-face-v3/`** (moved from `-se` on senior's instruction — single team repo). - `/dataset/kemix-engine-se/` is the OLD dir, kept temporarily as a backup. Once new-repo work is verified end-to-end, `-se` can be deleted. - `/dataset/kemix-engine/` was previously Dasol's clone. Senior asked me to switch to it directly so we both work in the same checkout. - Code, presets, small data → committed to GitHub. Big data (audio, checkpoints, avatars, training caches) → gitignored, on disk only. - Long-form monologue dataset (351 `long_*` scenarios) regenerated 2026-05-07 with **eleven_v3 model + smoothing pipeline + voice locking**. - **Open architectural question in flight:** how to handle dialogues vs monologues in expression-trajectory model training. (See §3 Active work.) --- ## 1. Where to work ### Working dir (new, as of 2026-05-08) ``` /dataset/kemix-engine/package/face/animasync-face-v3/ ``` This is the **shared monorepo working dir**. Same dir Dasol uses. Owned by `gpuuser`, but `se` has read/write on the V3 subtree. We sync via GitHub push/pull as before. ### Old dir — DO NOT use anymore ``` /dataset/kemix-engine-se/ # OLD self-clone, superseded 2026-05-08 ``` Kept as a safety backup. Delete after confirming new repo works. ### Even older dir — long superseded ``` /dataset/AnimaSync-mic-fix/ ``` The AnimaSync website repo (`mic-streaming` branch). V3 originally lived inside it but was extracted into the monorepo on 2026-05-04. Untouched since. ### Other repos / dirs that still matter - `/dataset/text-to-face-se/LAM_Audio2Expression/` — LAM teacher (external dependency, hardcoded in some scripts) - `/dataset/mead-expression-training/e2f/distill/` — V2 ONNX models (external dependency) - `/dataset/animasync-face-v1-staging/`, `-v2-staging/`, `-v3-staging/` — superseded - 3 separate GitHub repos `animasync-face-v1/v2/v3` still exist for history (kept intentionally per Dasol) --- ## 2. Git workflow ```bash cd /dataset/kemix-engine git pull # start of session # Edit V3 files under package/face/animasync-face-v3/ git add package/face/animasync-face-v3/ git commit -m "v3: short description" git push ``` **Conventions:** - Prefix V3 commits with `v3:` so Dasol can scan the log easily - Solo on V3 → committing direct to `main` is fine - For risky/experimental work → branch: `git checkout -b feat/v3-`, push branch, open PR - Pull at start of day (and any time Dasol pushes) - gitignore protects the big data dirs — `git status` stays clean even with several GB of audio/checkpoints in the working tree - **Never touch `package/motion/`** — that's Dasol's. He has uncommitted work there. --- ## 3. Active V3 work (state as of 2026-05-08) ### Recently completed (the long thread, 2026-05-06 → 2026-05-08) **Voice / TTS pipeline (tts.py, generate_audio.py):** - Switched all monologues to **female voices only** (no hash-based stray voice selection). - Added `FEMALE_BY_BASE` (one canonical female voice per base emotion: anger / sadness / joy / neutral / surprise). - `dominant_base_for_turns()` — picks the voice for a monologue based on the most-frequent base emotion across its turns. Voice character now matches the overall emotional arc. - Switched default model to **`eleven_v3`** (`DEFAULT_EL_MODEL`). - Verified audio tag map against ElevenLabs official list. Final mapping in `tts.py:tag_map`. Joy uses `[happily]` (not `[cheerfully]`). - Voice settings tuned for v3: `style = 0.20 + 0.25·|A|` capped 0.50; `stability = 0.50 - 0.10·|A|` floored 0.40. - `_pad_short_text()` — v3 fails on <12-char input, pads with `...` - `QuotaExhausted` exception class. Quota errors (401/402/403/billing) trigger non-retryable batch abort. 429/5xx/network errors retry with capped 60s backoff (max 6 retries). - `synth_all` latches abort flag on QuotaExhausted; queued workers exit cleanly. - Diagnostic dump to `/tmp/tts_quota_error.txt` on quota errors. **Manifest / disk-state hygiene (generate_audio.py):** - Manifest rebuild from disk every batch (overlay this batch's rows on top of disk-confirmed rows). Replaces old append-with-dedupe approach which broke on stray legacy rows. - Pre-flight log: `[plan] N turns will be skipped (already on disk)`, `M turns will be synthesized (~K chars / ~$X)`. - `--dry-run` flag. - `scenario_already_done()` filters `*.raw.mp3` orphans (mid-synthesis files left by killed processes). - `--scenarios` parser strips whitespace. **Smoothing pipeline for monologue inter-turn transitions (abc_experiment.py):** Goal: in monologues, transitions between consecutive turns should feel like one continuous emotional arc, not back-to-back separate utterances. Single-turn expressiveness must NEVER drop. Three-layer smoothing, all flags exposed via CLI: 1. **Causal running-mean VAD damping** (Step 1, real-time-safe): ``` damped_vad[i] = γ·raw_vad[i] + (1-γ)·running_mean[i-1] running_mean[i] = β·running_mean[i-1] + (1-β)·raw_vad[i] ``` First turn unchanged; only triggers when `len(turns) > 1`. Bug fixed: update order matters — compute `damped` BEFORE updating `running_mean`, otherwise current turn folds into "past" and pull weakens. - Default: `γ=0.3, β=0.7` 2. **Symmetric Gaussian VAD smoothing** (offline only, `σ=30` frames). 3. **Cosine-eased crossfade between turns** (default 96 frames, centered for offline). 4. **Brow pass-through-neutral** (channels 0-4): when `|brow_delta_across_boundary| > 0.40`, route brow values through 0 instead of linear blend (eased prev→0→hold→0→next). Avoids zigzag artifact when emotions like surprise→sad both have raised brows but in different positions. Trigger is on **delta**, not magnitude — similar-but-high values do NOT trigger (which was the previous bug). CLI flags added: `--vad-damp-gamma`, `--vad-damp-beta`. Defaults work. **Demos rendered (current settings):** 8 monologues with smoothing applied at `data/viewer/`: - long_002, long_004, long_013, long_018, long_028, long_035, long_046, long_067 There's also a "MIDDLE" parameter set tested for comparison (`γ=0.45, σ=25, fade=72`) — slightly less smoothing, slightly more snap. Both visually acceptable. **Final-settings decision pending.** **Long-form dataset regen:** - All 351 `long_*` scenarios regenerated end-to-end with eleven_v3 + voice locking. Sitting in `data/audio_preview/`. - ElevenLabs subscription was upgraded mid-flight; quota cliff handled cleanly via QuotaExhausted batching. ### Pending decisions (waiting on me) 1. **Lock in final smoothing parameters.** Current (`γ=0.3, σ=30, fade=96`) vs MIDDLE (`γ=0.45, σ=25, fade=72`). Both look fine. Need to pick one to bake into the training-target generation in `data_pipeline.py`. 2. **Architectural question about dialogues:** can the expression+lipsync model learn the difference between "monologue (smooth across turns)" vs "dialogue (snap allowed)"? - Confirmed: MicroALBERT (text→emotion+VAD) **cannot** learn smoothness — text-only stateless input. - Confirmed: expression+lipsync model (audio + emotion/VAD seq → blendshape trajectory) **can** learn smoothness because its training targets ARE the post-processed smoothed blendshapes. - **Open question:** if we train the trajectory model on both monologues (smoothed targets) and dialogues (split per speaker, also looking like "monologue with gaps") — does the model get confused? The dialogue-split data has random emotion shifts (because the speaker is reacting to an unseen interlocutor), but the structural shape is the same as a monologue. - **Three options on the table:** - **(a)** Let the input itself disambiguate — long inter-turn silences in audio + reactive VAD trajectory in dialogues vs continuous arcs in monologues. Make targets match the input regime: smooth monologue targets, independent per-turn dialogue targets. The model learns the pattern from data alone. - **(b)** Add an explicit `is_monologue` bit as model input. Cheap belt-and-suspenders. - **(c)** Train trajectory model on monologues only. Use dialogues for other tasks (single-turn fidelity, emotion-from-context) via a separate dataset/task. - **My instinct:** option (a) + maybe (b). User (me) is leaning toward "monologues-only" if it turns out option (a) signal is too weak. **Not yet decided.** Resuming this discussion is the natural next step. 3. **Then:** once smoothing settings + dialogue strategy are locked, regenerate the full dataset training targets via `data_pipeline.py` with the same smoothing settings as the viewer renderings (consistency between training data and runtime trajectory). ### Older threads (still open, lower priority) - **#18 — Port approved eye_motion into data_pipeline.py** (mechanical, ~30 min). The eye_motion module (blinks + iris drift) was built and visually approved in `abc_experiment.py`. Production path needs the same logic. - **Option E — channel-masked parametric overlay for happy-surprise mouth diversity.** Designed but not yet implemented. (See archived design notes in commit b0c0178 and earlier handoff revs if needed.) --- ## 4. V3 architecture ### What V3 is Not a single trained model — a **system** combining: 1. A learned emotion classifier (**MicroALBERT**) → emotion label + VAD coordinates from text 2. An **authored expression engine** → 52-dim ARKit blendshapes per frame 3. **LAM lipsync** integration → mouth shapes for speech ### Emotion taxonomy - 16 emotions = **5 Base + 11 Sub** (defined in `data/emotion/emotion_labels.json`) - 5 levels (L1–L5) along VAD intensity; authored anchors at L1, L3, L5 - 47 user-authored presets in `expression_presets.json` - VAD anchors: `data/emotion/emotion_vad_anchors.json` (16 emotions × 5 levels) ### Channel system (52 ARKit blendshapes) - `LIPSYNC_ONLY` — driven by LAM only - `EXPRESSION_ONLY` — driven by expression engine only - `SHARED_CHANNELS` — both touch; need merge strategy Defined in `scripts/compiler/constants.py`. Critical for `merge_lam_compiler()` in `data_pipeline.py`. ### Compiler stack (`scripts/compiler/`) | file | role | |---|---| | `parametric.py` | Closed-form VAD → blendshape rules | | `archetype.py` | RBF interpolation over preset anchors w/ emotion-family boosting | | `expressive.py` | Active path. Within-emotion RBF over L1/L3/L5 presets. `EXPRESSIVE_SIGMA = 0.35`. | | `blend.py` | Top-level compiler combining all layers | | `data_pipeline.py` | Audio + scenarios → training samples. Houses `merge_lam_compiler()` and `speech_gate()`. **Will need smoothing pipeline ported in once final params are locked.** | | `eye_motion.py` | Blinks + iris drift | | `lam_wrapper.py` | LAM lipsync model integration | | `abc_experiment.py` | A/B/C variant rendering. **Smoothing pipeline lives here for now.** | | `generate_audio.py` | ElevenLabs TTS pipeline (eleven_v3) | | `tts.py` | Voice / model / settings / tag-map / quota-handling | ### Compile philosophy (load-bearing) **"User's authored presets ARE ground truth. VAD chooses intensity (L1/L3/L5) within the same emotion family — NOT cross-emotion blend, NOT parametric."** Option E (mouth-channel parametric overlay) is a *targeted* relaxation, only on valence-coloring channels. ### MicroALBERT - 2-model on-device: MicroALBERT (text → emotion+VAD) + lipsync (audio → blendshapes) - Target: <20 MB on-device - Training stack at `models/microalbert/` - Teacher checkpoint: `checkpoints/klue_teacher_clean_ctx2/best.pt` (~2.7 GB, gitignored) - Known F1 plateau cause: context-utterance mismatch. Fix path = Option A (context concat + speaker tokens). ### V3 generator model (Phase 2, separate) **ExpressionVAE** for synthetic dataset generation. Not deployed in runtime. ### Trajectory model role (NEW thinking, 2026-05-08) The expression+lipsync model (the second of the on-device pair) is the natural place for monologue smoothing to be *learned* rather than post-processed. Training targets generated by `data_pipeline.py` should already be smoothed. The model internalizes the pattern from its targets. This makes runtime post-processing optional (or minimal). See pending decisions in §3. --- ## 5. Migration history ### 2026-05-04 Pushed V1, V2, V3 as 3 separate repos under hard 1-hour deadline. V3 pushed from `/dataset/AnimaSync-mic-fix/`. ### 2026-05-06 (move #1) Dasol consolidated the 3 separate repos into the monorepo `GoodGangLabs/kemix-engine`. V3 lives at `package/face/animasync-face-v3/`. I cloned my own copy to `/dataset/kemix-engine-se/` and moved 4.85 GB of gitignored on-disk artifacts into it. Used as personal working dir for two days. ### 2026-05-08 (move #2 — TODAY) Senior said we should both work in the same repo: `/dataset/kemix-engine/`. Migrated everything from `-se` into it via rsync: - All modified tracked code (9 files: compiler scripts, presets, gitignore, tools HTMLs) - All new untracked files (4: SESSION_HANDOFF.md, presets backup, 2 data files) - All gitignored on-disk artifacts (~5 GB): - `data/audio_preview/` (1.3 GB — paid eleven_v3 mp3s, the regenerated 351 monologues) - `data/viewer/` (28 MB — rendered demo bundles) - `data/wikipedia_ko/` (828 MB — KO Wikipedia HF cache) - `data/_pilot_2026-05-07/` (4.4 MB) - `data/v3_training/` (208 KB) - `avatar/` (110 MB — VRM/GLB files) - `checkpoints/` (2.7 GB — `klue_teacher_clean_ctx2`) - Skipped: `.claude/` (local IDE config), `__pycache__/` - **Did not touch `package/motion/`** at any point. `-se` is kept as a backup until everything in the new repo is confirmed working. --- ## 6. Reference: known issues & external deps ### Pre-existing dead reference in abc_experiment.py ```python ONNX_V2 = '/dataset/mead-expression-training/e2f/distill/emotion_face_v8_brow09.onnx' ``` File doesn't exist. Pre-existing bug, unrelated to migrations. Worth fixing eventually. ### External dependencies (hardcoded paths in scripts) | path | exists? | |---|---| | `/dataset/text-to-face-se/LAM_Audio2Expression/` | ✅ | | `/dataset/mead-expression-training/e2f/distill/emotion_face_int8.onnx` | ✅ | | `/dataset/mead-expression-training/e2f/distill/emotion_face_streaming_int8.onnx` | ✅ | | `/dataset/mead-expression-training/e2f/distill/emotion_face_v8_brow09.onnx` | ❌ | ### ElevenLabs notes - `eleven_v3` is GA as of Feb 2026. Audio-tag support requires it. - Free tier: pay-as-you-go API access requires a card on file. Without one, API returns 402. - Restricted API keys: subscription endpoint may return `missing_permissions` (404). Use the TTS endpoint to test functionality. ### Past leak warning On 2026-05-07 I (Claude) once echoed an API key into a Bash command via ``. User rotated the key. **Never echo or paste keys in commands.** Always use env-var indirection (`$ELEVENLABS_API_KEY`) and let the user export it themselves. --- ## 7. Personal & process notes - **Name:** 이승은 (Lee Seungeun). NOT 정승은. - **All V1/V2/V3 model authorship is mine.** Git log may show other names — don't infer authorship from log. - **Linux user:** `se`. Some files in older `/dataset/` paths owned by `gpuuser` from before my `se` account. - **Don't translate dark engineering idioms** ("hit by a bus" etc.) into Korean — neutral business terms. - **Today is 2026-05-08.** - **Current ElevenLabs status:** subscription upgraded; full quota available. --- ## 8. Quick-start checklist for a new session ```bash # 1. Confirm working dir cd /dataset/kemix-engine/package/face/animasync-face-v3 pwd # 2. Verify git state cd /dataset/kemix-engine && git status && git log --oneline -5 # 3. Pull anything Dasol pushed since last session git pull # 4. Smoke-check on-disk artifacts survived the migration ls package/face/animasync-face-v3/data/audio_preview/ | head -3 # should list MP3s ls package/face/animasync-face-v3/checkpoints/ # klue_teacher_clean_ctx2/ ls package/face/animasync-face-v3/avatar/ # .vrm/.glb files ls package/face/animasync-face-v3/data/viewer/ | head # demo bundles # 5. Resume open thread — most likely in this order: # a. Decide smoothing params (current vs MIDDLE) for production # b. Resolve dialogue-vs-monologue training data strategy (§3 pending decision #2) # c. Port smoothing pipeline into data_pipeline.py with chosen params # d. Regenerate full dataset training targets # e. (Lower priority) Task #18 eye_motion port; Option E happy-surprise mouth fix ```