# V3 Pipeline Decisions — 2026-04-20

Summary of decisions made today. Captures the architecture and data strategy for V3 expression/lipsync model training.

---

## 1. What was done today

| | Status |
|---|---|
| MicroALBERT dataset cleanup (labels + VAD) | ✅ Complete |
| MicroALBERT teacher training | ✅ Complete (macro_f1=0.66, best @ epoch 11) |
| Context-aware input (`[SELF]`/`[OTHER]`, `context_window=2`) | ✅ Complete |
| 47 expression presets authored (16 emotions × L1/L3/L5, neutral single) | ✅ Complete |
| V2 emotion crossfade PoC | ✅ Validated (no retraining needed) |
| V3 pipeline architecture | ✅ Decided (this doc) |

---

## 2. V2 Crossfade PoC — Key Findings

**Confirmed via emotion-demo:**
- V2's FiLM layer handles **input-space emotion blending** gracefully even though it was only trained on pure one-hot emotion vectors.
- Mid-sentence emotion tags work: `[happy:80] ... [sad:40] ...` produces smooth transitions.
- Transition window: **±800ms smoothstep** around each tag boundary (total ~1.6s).
- V2 does **not** need retraining for crossfade. Input blending suffices.

**Bug fixed:** `processFileWithTimeline()` added to wrapper (demo was only using the first emotion tag).

**Conclusion:** V3 can use the same input-blending approach. The crossfade works at model-input level, not model-output level.

---

## 3. V3 Runtime Architecture

```
Text (multi-sentence with emotion flow)
         ↓
    MicroALBERT → [(emotion_1, VAD_1), (emotion_2, VAD_2), ...]
         ↓
    Per-frame interpolation (smoothstep crossfade between segments)
         ↓
    V3 model input: (audio_features, 19-dim conditioning)
         where 19-dim = [one_hot_16_emotion + VAD(V,A,D)]
         ↓
    V3 model (FiLM-conditioned) → 52 blendshapes per frame
         ↓
    Avatar
```

- V3 takes **19-dim conditioning** (V2 used 5-dim).
- Crossfade mechanism identical to V2 PoC — per-frame interpolation of the input vector.
- No "intensity level" discrete mapping needed at runtime. VAD is continuous.

---

## 4. Rule-Based Compiler (Training Target Generator)

Generates 52 blendshape values for any `(emotion, VAD)` input. Uses a **hybrid layer approach** per `docs/research/vad-to-arkit-blendshape-mapping.md §6`:

```
Input: emotion, VAD=[V, A, D]
         ↓
    [Parametric layer] — continuous functions of V, A, D
        mouthSmile = sigmoid(V_sensitivity * V + ...)
        browDown   = sigmoid(-V_sensitivity * V + ...)
        eyeWide    = sigmoid(A_sensitivity * A + ...)
        etc.
         ↓
    [Archetype layer] — RBF interpolation over 47 hand-authored presets
         ↓
    Blend: 60% parametric + 40% archetype
         ↓
Output: 52 blendshapes
```

**Why hybrid:**
- Parametric captures smooth VAD-to-blendshape gradients (e.g., "higher valence → wider smile").
- Archetype locks in the specific emotional character of each of the 47 presets (sulk's pout vs apology's press are visually distinct despite similar VAD).
- Neither alone is sufficient; together they complement.

**Preset format:** 47 blendshape vectors authored via the blendshape editor tool, exported as JSON.

---

## 5. Training Data Strategy

### 5.1 Channel split (V3 training targets)

| Channel type | Count | Target source |
|---|---|---|
| **LIPSYNC** (jaw, mouthClose, mouthFunnel, tongueOut, etc.) | 14 | **LAM model inference** (authoritative) |
| **EXPRESSION** (eyes, brows, cheeks, nose) | 22 | **Rule-based compiler** (your presets) |
| **SHARED** (mouthSmile, mouthFrown, mouthStretch, etc.) | 16 | **LAM base + gated emotion delta** |

SHARED channel gating per `V3_IMPLEMENTATION_PLAN_v2.md §3.4`: when LAM's mouth is actively articulating (e.g., `/u/` sounds with high `mouthFunnel`), attenuate the emotion overlay to preserve lipsync accuracy.

### 5.2 Data sources

| Data | Source | Cost |
|---|---|---|
| **Audio** | `edge-tts` on ~7,000 seed scenario turns (already labeled with emotion+VAD) | $0 |
| **Emotion+VAD conditioning** | Already labeled in seed scenarios | $0 |
| **LIPSYNC targets** | LAM model inference on TTS audio | Compute only |
| **EXPRESSION targets** | Rule-based compiler (emotion + VAD → 52 blendshapes) | Compute only |
| **SHARED channel targets** | LAM output + compiler delta with speech gate | Compute only |
| **Additional real data (base emotions)** | MEAD dataset (4,488 clips, 5 base emotions) | Already available |

**Total training triples:** ~15-20K `(audio_features, 19-dim conditioning, 52 blendshapes)`.

**Total cost:** ~$0 (no ElevenLabs / no manual data collection / no recording).

### 5.3 Key decisions

- **Audio does NOT need to be emotionally expressive.** FiLM conditioning drives expression; audio drives lipsync. Edge-tts neutral output is sufficient for training data generation.
- **Seed scenario reuse:** the 7,000 turns in `seed_train_final.jsonl` (already emotion+VAD labeled) are the primary training source. No new data authoring needed.
- **MEAD inclusion:** Real emotional audio (MEAD's 4,488 clips) is mixed in during training for the 5 base emotions, especially early epochs, to teach the SHARED channels real prosody-expression correlation.

### 5.4 Training smoothness (per-frame interpolation during data generation)

To ensure training targets have natural transitions (not abrupt switches at turn boundaries):

```
For each multi-turn scenario:
  1. Build per-frame emotion+VAD timeline with smoothstep crossfade at each boundary
  2. Run rule-based compiler per frame (on interpolated emotion+VAD)
  3. Merge with LAM output per channel-type rules
```

This bakes crossfade smoothness into the training targets themselves, so V3 learns to produce smooth transitions naturally.

---

## 6. Fine-Tuning Plan

### 6.1 Starting point
- V2 checkpoint (audio features → 52 blendshapes, 5-dim FiLM)
- Replace FiLM: `Linear(5, 64)` → `MLP(19, 64, 64)` (+704 parameters)

### 6.2 Loss composition
- Supervised loss on generated training targets (primary)
- Anchor-snap loss (per `V3_EXPRESSION_PRESET_FEASIBILITY §anchor-snap`): every training step includes a batch of `(anchor_VAD[i], preset_blendshapes[i])` pairs with high weight. **This guarantees preset reproducibility at canonical VAD coordinates**, regardless of training randomness.
- Consistency loss on ±0.05 VAD jitter (smoothness in VAD space).

### 6.3 Curriculum
Per `V3_IMPLEMENTATION_PLAN_v2.md §6`:
- Epoch 1-20: 50% MEAD / 50% compiler-generated
- Epoch 41+: 20% MEAD / 80% compiler-generated

### 6.4 Backbone
- Freeze or low-LR (×0.1) during initial epochs.
- Unfreeze for final epochs once FiLM is stable.

### 6.5 Expected cost
- ~8 hours on one A100 (~$30).
- Total end-to-end: 3 days, <$80.

---

## 7. Anchor File (v1.5)

`data/emotion/emotion_vad_anchors.json` is **frozen at v1.5**. Any further changes invalidate downstream compiler output and training targets.

Changes in v1.5:
- Gratitude D lowered (`-0.15` to `+0.05`) to separate from joy in VAD space (L2 distance 0.24 → 0.37).

---

## 8. Next Steps (Pipeline Build Order)

1. **Build rule-based compiler** (Hybrid: parametric + archetype, 60/40 blend).
   - Parametric layer: load V/A/D sensitivity tables from research docs.
   - Archetype layer: RBF interpolation over 47 presets JSON.
   - Input: `(emotion, VAD)` → Output: 52 blendshapes.
2. **Verify compiler output** on each of the 47 preset VAD coordinates — should return blendshapes close to the original presets (anchor reproducibility test).
3. **Generate training data pipeline:**
   - Iterate over seed scenario turns
   - `text → edge-tts → audio`
   - `LAM inference → lipsync channels`
   - `compiler(emotion, VAD) → expression channels`
   - Merge per channel-split rules, save as `.npy` triples.
4. **Fine-tune V2 → V3** with anchor-snap loss and MEAD curriculum.
5. **Validate:**
   - Preset reproducibility test (same VAD coords → same face, variance < 0.01).
   - Visual-distinctness test (all 16 emotions at L3 look meaningfully different).
   - End-to-end smoothness test (multi-emotion text → smooth face animation).

---

## 9. Related Files

| File | Purpose |
|---|---|
| `data/emotion/emotion_vad_anchors.json` | 47-preset VAD coordinates (v1.5, frozen) |
| `docs/research/vad-to-arkit-blendshape-mapping.md` | Parametric layer sensitivity specs (§5-6) |
| `docs/research/emotion-blendshape-patterns.md` | Reference expression patterns per emotion |
| `docs/V3_DECISION_SUMMARY.md` | Previous architecture decisions |
| `docs/V3_IMPLEMENTATION_PLAN_v2.md` | Channel split rules (§3.4), curriculum (§6) |
| `docs/V3_EXPRESSION_PRESET_FEASIBILITY.md` | Anchor-snap loss mechanism |
| `tools/blendshape-editor.html` | Tool used to author the 47 presets |
| `examples/guide/lipsync-wasm-wrapper.js` | V2 inference wrapper (19-dim setEmotion update needed) |
| `/dataset/AnimaSync/examples/emotion-demo/` | V2 crossfade PoC (edge-tts + processFileWithTimeline) |
| `/dataset/mead-expression-training/e2f/distill/precomputed_features/` | MEAD 4,488 clips (base emotion training data) |
