# V3 Session Handoff (2026-05-08)

> Personal handoff doc. Untracked. Paste this into a new Claude session as context when resuming V3 work in the new monorepo location.

---

## TL;DR

- I'm 이승은 (Lee Seungeun) at GoodGangLabs. I authored all of V1/V2/V3.
- **As of 2026-05-08 my working dir is `/dataset/kemix-engine/package/face/animasync-face-v3/`** (moved from `-se` on senior's instruction — single team repo).
- `/dataset/kemix-engine-se/` is the OLD dir, kept temporarily as a backup. Once new-repo work is verified end-to-end, `-se` can be deleted.
- `/dataset/kemix-engine/` was previously Dasol's clone. Senior asked me to switch to it directly so we both work in the same checkout.
- Code, presets, small data → committed to GitHub. Big data (audio, checkpoints, avatars, training caches) → gitignored, on disk only.
- Long-form monologue dataset (351 `long_*` scenarios) regenerated 2026-05-07 with **eleven_v3 model + smoothing pipeline + voice locking**.
- **Open architectural question in flight:** how to handle dialogues vs monologues in expression-trajectory model training. (See §3 Active work.)

---

## 1. Where to work

### Working dir (new, as of 2026-05-08)
```
/dataset/kemix-engine/package/face/animasync-face-v3/
```
This is the **shared monorepo working dir**. Same dir Dasol uses. Owned by `gpuuser`, but `se` has read/write on the V3 subtree. We sync via GitHub push/pull as before.

### Old dir — DO NOT use anymore
```
/dataset/kemix-engine-se/    # OLD self-clone, superseded 2026-05-08
```
Kept as a safety backup. Delete after confirming new repo works.

### Even older dir — long superseded
```
/dataset/AnimaSync-mic-fix/
```
The AnimaSync website repo (`mic-streaming` branch). V3 originally lived inside it but was extracted into the monorepo on 2026-05-04. Untouched since.

### Other repos / dirs that still matter
- `/dataset/text-to-face-se/LAM_Audio2Expression/` — LAM teacher (external dependency, hardcoded in some scripts)
- `/dataset/mead-expression-training/e2f/distill/` — V2 ONNX models (external dependency)
- `/dataset/animasync-face-v1-staging/`, `-v2-staging/`, `-v3-staging/` — superseded
- 3 separate GitHub repos `animasync-face-v1/v2/v3` still exist for history (kept intentionally per Dasol)

---

## 2. Git workflow

```bash
cd /dataset/kemix-engine

git pull              # start of session
# Edit V3 files under package/face/animasync-face-v3/
git add package/face/animasync-face-v3/<files>
git commit -m "v3: short description"
git push
```

**Conventions:**
- Prefix V3 commits with `v3:` so Dasol can scan the log easily
- Solo on V3 → committing direct to `main` is fine
- For risky/experimental work → branch: `git checkout -b feat/v3-<thing>`, push branch, open PR
- Pull at start of day (and any time Dasol pushes)
- gitignore protects the big data dirs — `git status` stays clean even with several GB of audio/checkpoints in the working tree
- **Never touch `package/motion/`** — that's Dasol's. He has uncommitted work there.

---

## 3. Active V3 work (state as of 2026-05-08)

### Recently completed (the long thread, 2026-05-06 → 2026-05-08)

**Voice / TTS pipeline (tts.py, generate_audio.py):**
- Switched all monologues to **female voices only** (no hash-based stray voice selection).
- Added `FEMALE_BY_BASE` (one canonical female voice per base emotion: anger / sadness / joy / neutral / surprise).
- `dominant_base_for_turns()` — picks the voice for a monologue based on the most-frequent base emotion across its turns. Voice character now matches the overall emotional arc.
- Switched default model to **`eleven_v3`** (`DEFAULT_EL_MODEL`).
- Verified audio tag map against ElevenLabs official list. Final mapping in `tts.py:tag_map`. Joy uses `[happily]` (not `[cheerfully]`).
- Voice settings tuned for v3: `style = 0.20 + 0.25·|A|` capped 0.50; `stability = 0.50 - 0.10·|A|` floored 0.40.
- `_pad_short_text()` — v3 fails on <12-char input, pads with `...`
- `QuotaExhausted` exception class. Quota errors (401/402/403/billing) trigger non-retryable batch abort. 429/5xx/network errors retry with capped 60s backoff (max 6 retries).
- `synth_all` latches abort flag on QuotaExhausted; queued workers exit cleanly.
- Diagnostic dump to `/tmp/tts_quota_error.txt` on quota errors.

**Manifest / disk-state hygiene (generate_audio.py):**
- Manifest rebuild from disk every batch (overlay this batch's rows on top of disk-confirmed rows). Replaces old append-with-dedupe approach which broke on stray legacy rows.
- Pre-flight log: `[plan] N turns will be skipped (already on disk)`, `M turns will be synthesized (~K chars / ~$X)`.
- `--dry-run` flag.
- `scenario_already_done()` filters `*.raw.mp3` orphans (mid-synthesis files left by killed processes).
- `--scenarios` parser strips whitespace.

**Smoothing pipeline for monologue inter-turn transitions (abc_experiment.py):**

Goal: in monologues, transitions between consecutive turns should feel like one continuous emotional arc, not back-to-back separate utterances. Single-turn expressiveness must NEVER drop.

Three-layer smoothing, all flags exposed via CLI:
1. **Causal running-mean VAD damping** (Step 1, real-time-safe):
   ```
   damped_vad[i] = γ·raw_vad[i] + (1-γ)·running_mean[i-1]
   running_mean[i] = β·running_mean[i-1] + (1-β)·raw_vad[i]
   ```
   First turn unchanged; only triggers when `len(turns) > 1`. Bug fixed: update order matters — compute `damped` BEFORE updating `running_mean`, otherwise current turn folds into "past" and pull weakens.
   - Default: `γ=0.3, β=0.7`
2. **Symmetric Gaussian VAD smoothing** (offline only, `σ=30` frames).
3. **Cosine-eased crossfade between turns** (default 96 frames, centered for offline).
4. **Brow pass-through-neutral** (channels 0-4): when `|brow_delta_across_boundary| > 0.40`, route brow values through 0 instead of linear blend (eased prev→0→hold→0→next). Avoids zigzag artifact when emotions like surprise→sad both have raised brows but in different positions. Trigger is on **delta**, not magnitude — similar-but-high values do NOT trigger (which was the previous bug).

CLI flags added: `--vad-damp-gamma`, `--vad-damp-beta`. Defaults work.

**Demos rendered (current settings):** 8 monologues with smoothing applied at `data/viewer/`:
- long_002, long_004, long_013, long_018, long_028, long_035, long_046, long_067

There's also a "MIDDLE" parameter set tested for comparison (`γ=0.45, σ=25, fade=72`) — slightly less smoothing, slightly more snap. Both visually acceptable. **Final-settings decision pending.**

**Long-form dataset regen:**
- All 351 `long_*` scenarios regenerated end-to-end with eleven_v3 + voice locking. Sitting in `data/audio_preview/`.
- ElevenLabs subscription was upgraded mid-flight; quota cliff handled cleanly via QuotaExhausted batching.

### Pending decisions (waiting on me)

1. **Lock in final smoothing parameters.** Current (`γ=0.3, σ=30, fade=96`) vs MIDDLE (`γ=0.45, σ=25, fade=72`). Both look fine. Need to pick one to bake into the training-target generation in `data_pipeline.py`.

2. **Architectural question about dialogues:** can the expression+lipsync model learn the difference between "monologue (smooth across turns)" vs "dialogue (snap allowed)"?
   - Confirmed: MicroALBERT (text→emotion+VAD) **cannot** learn smoothness — text-only stateless input.
   - Confirmed: expression+lipsync model (audio + emotion/VAD seq → blendshape trajectory) **can** learn smoothness because its training targets ARE the post-processed smoothed blendshapes.
   - **Open question:** if we train the trajectory model on both monologues (smoothed targets) and dialogues (split per speaker, also looking like "monologue with gaps") — does the model get confused? The dialogue-split data has random emotion shifts (because the speaker is reacting to an unseen interlocutor), but the structural shape is the same as a monologue.
   - **Three options on the table:**
     - **(a)** Let the input itself disambiguate — long inter-turn silences in audio + reactive VAD trajectory in dialogues vs continuous arcs in monologues. Make targets match the input regime: smooth monologue targets, independent per-turn dialogue targets. The model learns the pattern from data alone.
     - **(b)** Add an explicit `is_monologue` bit as model input. Cheap belt-and-suspenders.
     - **(c)** Train trajectory model on monologues only. Use dialogues for other tasks (single-turn fidelity, emotion-from-context) via a separate dataset/task.
   - **My instinct:** option (a) + maybe (b). User (me) is leaning toward "monologues-only" if it turns out option (a) signal is too weak. **Not yet decided.** Resuming this discussion is the natural next step.

3. **Then:** once smoothing settings + dialogue strategy are locked, regenerate the full dataset training targets via `data_pipeline.py` with the same smoothing settings as the viewer renderings (consistency between training data and runtime trajectory).

### Older threads (still open, lower priority)

- **#18 — Port approved eye_motion into data_pipeline.py** (mechanical, ~30 min). The eye_motion module (blinks + iris drift) was built and visually approved in `abc_experiment.py`. Production path needs the same logic.
- **Option E — channel-masked parametric overlay for happy-surprise mouth diversity.** Designed but not yet implemented. (See archived design notes in commit b0c0178 and earlier handoff revs if needed.)

---

## 4. V3 architecture

### What V3 is
Not a single trained model — a **system** combining:
1. A learned emotion classifier (**MicroALBERT**) → emotion label + VAD coordinates from text
2. An **authored expression engine** → 52-dim ARKit blendshapes per frame
3. **LAM lipsync** integration → mouth shapes for speech

### Emotion taxonomy
- 16 emotions = **5 Base + 11 Sub** (defined in `data/emotion/emotion_labels.json`)
- 5 levels (L1–L5) along VAD intensity; authored anchors at L1, L3, L5
- 47 user-authored presets in `expression_presets.json`
- VAD anchors: `data/emotion/emotion_vad_anchors.json` (16 emotions × 5 levels)

### Channel system (52 ARKit blendshapes)
- `LIPSYNC_ONLY` — driven by LAM only
- `EXPRESSION_ONLY` — driven by expression engine only
- `SHARED_CHANNELS` — both touch; need merge strategy
Defined in `scripts/compiler/constants.py`. Critical for `merge_lam_compiler()` in `data_pipeline.py`.

### Compiler stack (`scripts/compiler/`)
| file | role |
|---|---|
| `parametric.py` | Closed-form VAD → blendshape rules |
| `archetype.py` | RBF interpolation over preset anchors w/ emotion-family boosting |
| `expressive.py` | Active path. Within-emotion RBF over L1/L3/L5 presets. `EXPRESSIVE_SIGMA = 0.35`. |
| `blend.py` | Top-level compiler combining all layers |
| `data_pipeline.py` | Audio + scenarios → training samples. Houses `merge_lam_compiler()` and `speech_gate()`. **Will need smoothing pipeline ported in once final params are locked.** |
| `eye_motion.py` | Blinks + iris drift |
| `lam_wrapper.py` | LAM lipsync model integration |
| `abc_experiment.py` | A/B/C variant rendering. **Smoothing pipeline lives here for now.** |
| `generate_audio.py` | ElevenLabs TTS pipeline (eleven_v3) |
| `tts.py` | Voice / model / settings / tag-map / quota-handling |

### Compile philosophy (load-bearing)
**"User's authored presets ARE ground truth. VAD chooses intensity (L1/L3/L5) within the same emotion family — NOT cross-emotion blend, NOT parametric."**
Option E (mouth-channel parametric overlay) is a *targeted* relaxation, only on valence-coloring channels.

### MicroALBERT
- 2-model on-device: MicroALBERT (text → emotion+VAD) + lipsync (audio → blendshapes)
- Target: <20 MB on-device
- Training stack at `models/microalbert/`
- Teacher checkpoint: `checkpoints/klue_teacher_clean_ctx2/best.pt` (~2.7 GB, gitignored)
- Known F1 plateau cause: context-utterance mismatch. Fix path = Option A (context concat + speaker tokens).

### V3 generator model (Phase 2, separate)
**ExpressionVAE** for synthetic dataset generation. Not deployed in runtime.

### Trajectory model role (NEW thinking, 2026-05-08)
The expression+lipsync model (the second of the on-device pair) is the natural place for monologue smoothing to be *learned* rather than post-processed. Training targets generated by `data_pipeline.py` should already be smoothed. The model internalizes the pattern from its targets. This makes runtime post-processing optional (or minimal). See pending decisions in §3.

---

## 5. Migration history

### 2026-05-04
Pushed V1, V2, V3 as 3 separate repos under hard 1-hour deadline. V3 pushed from `/dataset/AnimaSync-mic-fix/`.

### 2026-05-06 (move #1)
Dasol consolidated the 3 separate repos into the monorepo `GoodGangLabs/kemix-engine`. V3 lives at `package/face/animasync-face-v3/`. I cloned my own copy to `/dataset/kemix-engine-se/` and moved 4.85 GB of gitignored on-disk artifacts into it. Used as personal working dir for two days.

### 2026-05-08 (move #2 — TODAY)
Senior said we should both work in the same repo: `/dataset/kemix-engine/`. Migrated everything from `-se` into it via rsync:
- All modified tracked code (9 files: compiler scripts, presets, gitignore, tools HTMLs)
- All new untracked files (4: SESSION_HANDOFF.md, presets backup, 2 data files)
- All gitignored on-disk artifacts (~5 GB):
  - `data/audio_preview/` (1.3 GB — paid eleven_v3 mp3s, the regenerated 351 monologues)
  - `data/viewer/` (28 MB — rendered demo bundles)
  - `data/wikipedia_ko/` (828 MB — KO Wikipedia HF cache)
  - `data/_pilot_2026-05-07/` (4.4 MB)
  - `data/v3_training/` (208 KB)
  - `avatar/` (110 MB — VRM/GLB files)
  - `checkpoints/` (2.7 GB — `klue_teacher_clean_ctx2`)
- Skipped: `.claude/` (local IDE config), `__pycache__/`
- **Did not touch `package/motion/`** at any point.

`-se` is kept as a backup until everything in the new repo is confirmed working.

---

## 6. Reference: known issues & external deps

### Pre-existing dead reference in abc_experiment.py
```python
ONNX_V2 = '/dataset/mead-expression-training/e2f/distill/emotion_face_v8_brow09.onnx'
```
File doesn't exist. Pre-existing bug, unrelated to migrations. Worth fixing eventually.

### External dependencies (hardcoded paths in scripts)
| path | exists? |
|---|---|
| `/dataset/text-to-face-se/LAM_Audio2Expression/` | ✅ |
| `/dataset/mead-expression-training/e2f/distill/emotion_face_int8.onnx` | ✅ |
| `/dataset/mead-expression-training/e2f/distill/emotion_face_streaming_int8.onnx` | ✅ |
| `/dataset/mead-expression-training/e2f/distill/emotion_face_v8_brow09.onnx` | ❌ |

### ElevenLabs notes
- `eleven_v3` is GA as of Feb 2026. Audio-tag support requires it.
- Free tier: pay-as-you-go API access requires a card on file. Without one, API returns 402.
- Restricted API keys: subscription endpoint may return `missing_permissions` (404). Use the TTS endpoint to test functionality.

### Past leak warning
On 2026-05-07 I (Claude) once echoed an API key into a Bash command via `<bash-input>`. User rotated the key. **Never echo or paste keys in commands.** Always use env-var indirection (`$ELEVENLABS_API_KEY`) and let the user export it themselves.

---

## 7. Personal & process notes

- **Name:** 이승은 (Lee Seungeun). NOT 정승은.
- **All V1/V2/V3 model authorship is mine.** Git log may show other names — don't infer authorship from log.
- **Linux user:** `se`. Some files in older `/dataset/` paths owned by `gpuuser` from before my `se` account.
- **Don't translate dark engineering idioms** ("hit by a bus" etc.) into Korean — neutral business terms.
- **Today is 2026-05-08.**
- **Current ElevenLabs status:** subscription upgraded; full quota available.

---

## 8. Quick-start checklist for a new session

```bash
# 1. Confirm working dir
cd /dataset/kemix-engine/package/face/animasync-face-v3
pwd

# 2. Verify git state
cd /dataset/kemix-engine && git status && git log --oneline -5

# 3. Pull anything Dasol pushed since last session
git pull

# 4. Smoke-check on-disk artifacts survived the migration
ls package/face/animasync-face-v3/data/audio_preview/ | head -3   # should list MP3s
ls package/face/animasync-face-v3/checkpoints/                    # klue_teacher_clean_ctx2/
ls package/face/animasync-face-v3/avatar/                         # .vrm/.glb files
ls package/face/animasync-face-v3/data/viewer/ | head             # demo bundles

# 5. Resume open thread — most likely in this order:
#    a. Decide smoothing params (current vs MIDDLE) for production
#    b. Resolve dialogue-vs-monologue training data strategy (§3 pending decision #2)
#    c. Port smoothing pipeline into data_pipeline.py with chosen params
#    d. Regenerate full dataset training targets
#    e. (Lower priority) Task #18 eye_motion port; Option E happy-surprise mouth fix
```
