# V3 Expression Preset Feasibility Review

**Date:** 2026-04-14
**Context:** Crisis-of-confidence check on whether V3's VAD-driven architecture can produce consistent facial expression presets / baselines / checkpoints.
**Audit method:** Three specialist agents in parallel (avatar/blendshape expert, VAD/dimensional-emotion expert, architecture planner), synthesized into this document.

---

## TL;DR

**Pure VAD → face will fail.** 3-dimensional VAD is theoretically insufficient to uniquely determine 52 ARKit blendshapes. The existing crisis of confidence is **correct** and validated by published literature, prior internal analysis, and architecture theory.

**BUT**: the already-documented V3 architecture is a **hybrid** (16 emotion label + 3 VAD = 19-dim conditioning), which is the correct approach. The team has not been overreaching — just **working on the wrong subproblem** (text→VAD paraphrase augmentation) while the real blocker (48 hand-authored archetype blendshape vectors) has been sitting unauthored.

**Recommendation:** finish the in-flight text augmentation run (data still useful for MicroALBERT), then pivot hard to archetype authoring for 1 week. Do NOT ship pure-VAD conditioning. Do NOT train ExpressionVAE yet.

---

## The core question

> Can we use VAD values as the driver for consistent, well-thought-out facial expression "presets" / "baselines" / "checkpoints" in the V3 avatar system?

The concern: given the chain
```
text → MicroALBERT → VAD [V, A, D] → AnimaSync lipsync → 52 blendshapes
```
…if the VAD bottleneck is too narrow, presets will be inconsistent (similar VAD produces different-looking faces) or collapsed (different emotions averaged into a single bland face).

---

## Expert consensus

### 1. Avatar/Blendshape Expert — "Pure VAD will produce mean-collapsed faces"

- **Architectural underdetermination**: VAD is a 3D manifold; ARKit's expression subspace (excluding lipsync channels) is ~30D. FiLM cannot invert a 3→30 mapping deterministically. It will learn the **conditional mean** of all expressions sharing a VAD coordinate.
- **Confirmed empirically**: prior internal analysis (`project_v3_vad_intermediary_analysis.md`) showed VAD-only explains only ~40% of correct expression. Contempt: 15%. Embarrassment: 10%. Relief: 25%. This is architectural, not a hyperparameter issue.
- **Predicted failure modes**:
  1. Mode collapse at VAD centroids → mushy default face for mid-V mid-A
  2. Drift at coordinate boundaries → nearby VAD points produce *similar but not identical* faces, breaking "preset reproducibility"
  3. Asymmetric / temporal features destroyed (contempt smirk, embarrassment gaze oscillation, relief phase transition)
  4. Annotation noise in VAD (inter-rater 0.5–0.7 on V, worse on D) poisons the learned map — retraining yields different outputs at the same coordinates
  5. Audio prosody dominance → FiLM ignored, high-bandwidth audio features drown out 3-dim VAD signal
- **Verdict:** Don't ship pure VAD. It will look averaged, drift between retraining runs, and the preset guarantee fails.

### 2. VAD / Dimensional Emotion Expert — "VAD is provably lossy for faces"

- **Information theorem, not measurement quibble**: Cowen & Keltner (2017, *PNAS*); Cowen et al. (2019, *Nature*) demonstrated facial and vocal expressions occupy a ~**28-dimensional** semantic space. Valence + arousal captures ~30–40% of variance; adding dominance recovers another ~5–10%. The remaining variance corresponds to categorical distinctions (awe vs surprise, amusement vs joy, sympathy vs sadness) that **VAD cannot separate**.
- **Classic nuance-collapse examples**:
  - *Shy joy* vs *peaceful joy* vs *mischievous joy*: all cluster at V≈+0.6, A≈0.2, D≈0.1, yet AU12+AU6+gaze-aversion vs AU12+AU6+AU14 are visually distinct.
  - *Controlled anger* vs *explosive anger*: differ on regulation/inhibition (Gross's process model), not captured by arousal alone.
  - *Sad/angry/joy tears*: share lacrimation but completely different upper-face AUs.
- **Annotation reliability**: LLM-generated or hand-assigned VAD is *categorically different* from psychophysical norms (ANEW / Warriner 2013 / NRC-VAD). LLM-VAD shows compression toward neutral and D is the noisiest dimension (Mohammad 2018, ACL).
- **Prior art that actually worked** (VAD-driven facial animation):
  - **EMOCA / EMOTE** (Daněček et al. 2022, 2023): emotion *category* + intensity. Convincingly expressive.
  - **MEAD** (Wang et al. 2020): 8 discrete categories × 3 intensities.
  - **DiffPoseTalk / EmoTalk** (Peng et al. 2023): one-hot emotion + intensity scalar.
  - **Pure VAD-driven facial synthesis**: no convincingly successful published system. The closest (Ringeval et al. AVEC) uses VAD for *recognition*, not *generation*.
- **Recommendation**: Option (b) — emotion label as primary conditioning, VAD as continuous intensity/modulation modifier. This matches EMOTE/EmoTalk/MEAD and your existing V3 design in `V3_DECISION_SUMMARY §2.1`.

### 3. Architecture Planner — "You're solving the wrong subproblem"

- **Key finding**: The team has been focused on text→VAD paraphrase augmentation for 2+ weeks. This is NOT the V3 blocker. The V3 blocker is **48 hand-authored archetype blendshape vectors** that have never been committed.
- **Your existing plan is already correct**: `V3_DECISION_SUMMARY`, `V3_GENERATOR_MODEL_PLAN.md`, `emotion_vad_anchors.json v1.4`, and `docs/research/vad-to-arkit-blendshape-mapping.md` together specify a hybrid architecture with:
  - 16-class emotion embedding + 3-dim VAD = 19-dim FiLM conditioning
  - 80 pre-existing VAD anchors in `emotion_vad_anchors.json`
  - Rule-based compiler spec in `vad-to-arkit-blendshape-mapping.md §6.1`
  - Anchor-snap regularization loss (not yet implemented)
- **Critical gap**: the 48 archetype blendshape vectors (16 emotions × 3 intensities) that make presets *deterministic by construction* have never been authored.

---

## The architecture that actually works

### Three layers, each with a clear job

1. **Anchor Bank (deterministic, zero ML)**
   - 48 hand-authored archetype blendshape vectors
   - One per (emotion, intensity) combination: 16 × 3 = 48
   - Stored as `ARCHETYPES: dict[tuple[str, int], np.ndarray[52]]`
   - **This is the "VAD = [0.6, 0.3, 0.2] always produces this face" guarantee.** Presets work because the archetype IS the preset, not because we hope FiLM learned it.

2. **Rule-based Parametric Compiler** (`vad_to_blendshapes_rule()`)
   - Deterministic function: VAD + emotion label → 52-dim blendshape vector
   - Spec already exists in `docs/research/vad-to-arkit-blendshape-mapping.md §6.1`
   - Sigmoid-scaled quadratic per channel with conflict resolution rules (§5.4)
   - Blends smoothly between archetypes via RBF interpolation in VAD space
   - **Fills continuous space between the 48 anchors.**

3. **FiLM Lipsync Model (current V2 backbone)**
   - Neural model adds audio-driven temporal realism on top of the deterministic base
   - Conditioned on 19-dim (16-class emotion label + 3 VAD)
   - Trained with **anchor-snap regularization**: at every training step, a batch of 48 (VAD_anchor, ARCHETYPES[i]) pairs is included with high loss weight
   - Trained on rule-generated targets + MEAD real clips (4,488 available at `/dataset/mead-expression-training/e2f/distill/precomputed_features/`)
   - **This is what makes the model provably match the preset bank at canonical VAD coordinates** while still learning dynamic lipsync behavior from audio.

### Why this gives consistent presets

- **Same emotion label → same "face family"** (identity is discrete, not collapsed through a lossy VAD bottleneck)
- **Similar VAD within the same label → smooth modulation** (FiLM learns intensity/modulation, not identity)
- **Anchor-snap loss guarantees preset reproducibility** at canonical coordinates
- **Rule-based compiler provides a deterministic fallback** — if FiLM output drifts, the compiler result is always available

### Why pure VAD fails

- Different emotions at the same VAD coordinate collapse to the **conditional mean face**
- Preset reproducibility across retraining runs is not guaranteed
- 3D is not enough degrees of freedom to parameterize 52 blendshapes meaningfully
- Audio prosody has 10-100x the bandwidth of VAD → model learns to ignore VAD entirely

---

## Action plan (2–4 weeks)

### Week 1 — Freeze the deterministic layer

- **Days 1–2**: Hand-author `ARCHETYPES[48]` — one 52-dim blendshape vector per (emotion, intensity). Use tables in `docs/research/vad-to-arkit-blendshape-mapping.md §4` + `emotion-blendshape-patterns.md`. One afternoon per 4 emotions with avatar viewer (`run_all_emotions.py`).
- **Days 3–4**: Implement `vad_to_blendshapes_rule()` per research §6.1 (parametric + archetype RBF blend). Unit tests:
  - Same VAD input → identical output (determinism)
  - ±0.05 VAD → ≤±0.05 L1 difference (smoothness)
  - Archetype VAD coordinate → archetype vector ± ε (anchor reproducibility)
- **Day 5**: Visual QA — render all 48 archetypes on VRM avatars. This is your **"preset bank"** deliverable, shippable on its own.

### Week 2 — Generate training target dataset

- Run `generate_v3_targets.py` (spec in `V3_SENIOR_BRIEFING §3 Step 4`) over MEAD 4,488 + TTS synthesis
- Output: `teacher_outputs_v3/*.npy` — (audio, 19-dim conditioning, 52-dim blendshape) triples
- Target: ~15–20K triples (5 base emotions from MEAD + 11 sub-emotions from rule compiler on TTS audio)
- **This is the missing training data that has prevented the lipsync model from being V3-ready.**

### Week 3 — Fine-tune FiLM (5 → 19 dimensions)

- Initialize from V2 checkpoint (`checkpoints_anime_v2i_smooth_e/epoch_230.pt`)
- Loss: `L_supervised (w_real=3, w_syn=1) + L_consistency (±0.05 VAD jitter) + L_disentangle + 3-way channel weights` per `V3_DECISION_SUMMARY §5.3`
- **Critical**: include **anchor-snap loss** — at every training step, a batch of `(anchor_VAD[i], ARCHETYPES[i])` pairs with high weight, forcing the model to match the preset bank at canonical points

### Week 4 — Validate and ship V3-alpha

- Tests from `V3_DECISION_SUMMARY §8`
- Add **preset reproducibility test**: same (audio, VAD) input 100 times → L1 variance < 0.01
- Add **visual-distinctness test**: 16 emotions at same intensity should look meaningfully different (human-rated or automated AU extraction)

---

## Stop vs Keep

### STOP

- **Text paraphrase augmentation past ~10K turns.** Current canary + full run (~$3) should finish since it's already in motion and the data is useful for MicroALBERT. But do NOT chase 15K, 20K. Diminishing returns, and it's upstream of the wrong problem.
- **Any plan to train lipsync on "text → face" directly (bypassing VAD).** The VAD bottleneck is fine IF the preset bank is deterministic. Removing VAD removes the intensity/modulation control surface.
- **Tuning `emotion_vad_anchors.json` coordinates.** Freeze as v1.4-final. Every adjustment invalidates downstream archetype authoring.
- **ExpressionVAE (Approach B).** Deferred to V3.1. The research doc explicitly warns 20K+ clips minimum; you do not have them. Ship Approach A first.

### KEEP

- `emotion_vad_anchors.json v1.4` — freeze and tag as final
- 847 seed scenarios + test/val splits
- MEAD 4,488 precomputed features — the only real (audio, blendshape) ground truth available
- V3 channel classification (LIPSYNC / EXPRESSION / SHARED) per `project_v3_channel_classification.md`
- MicroALBERT plan — train on external datasets (KOTE + AI Hub ≈ 88K) with seed scenarios as eval set
- Current paraphrase augmentation pipeline (`augment_openai.py`) — finish the canary and full run, produce 10K rows, then stop

---

## Risk register

1. **Archetype authoring is harder than it looks.** Budget 1 person-week with visual iteration, not 2 days.
2. **Rule compiler + FiLM may "rule-replicate"** (neural just copies rule output). Mitigation: style modes / multi-gain vectors per emotion from `V3_SENIOR_BRIEFING §2.4`.
3. **MicroALBERT VAD regression quality** (~0.65–0.75 correlation) means face engine receives ±0.15 VAD noise at inference. **This is why anchor-snap loss matters** — nearby VAD queries round to nearest archetype instead of drifting.
4. **Overreach risk**: attempting Approach B (ExpressionVAE) in parallel with Approach A (archetype bank + FiLM) will stall both. Defer B to V3.1.
5. **MEAD teacher_outputs** may not preserve `speech_attenuation` behavior on SHARED channels. Verify before Week 2, else real MEAD data pulls FiLM toward "open-mouth always" and contaminates closed-mouth presets.

---

## Critical files for implementation

- `/dataset/AnimaSync-mic-fix/data/emotion/emotion_vad_anchors.json` — freeze at v1.4
- `/dataset/AnimaSync-mic-fix/docs/research/vad-to-arkit-blendshape-mapping.md` — rule compiler spec
- `/dataset/AnimaSync-mic-fix/docs/V3_GENERATOR_MODEL_PLAN.md` — generator architecture
- `/dataset/AnimaSync-mic-fix/docs/V3_DECISION_SUMMARY.md` — hybrid conditioning spec
- `/dataset/AnimaSync-mic-fix/docs/V3_SENIOR_BRIEFING.md` — style modes, gain vectors
- `/dataset/mead-expression-training/e2f/config_e2f.py` — lipsync model config
- `/dataset/mead-expression-training/e2f/distill/model.py` — FiLM backbone to extend

---

## Bottom line

**Your instinct to question the VAD bottleneck was correct.** Had you trained a pure-VAD lipsync model on the 10K text dataset, the result would have been mean-collapsed and inconsistent, and you would have spent 2–3 weeks discovering this empirically.

**But the V3 architecture you already designed is correct.** Hybrid conditioning (16 emotion label + 3 VAD) + deterministic archetype bank + rule-based compiler + FiLM with anchor-snap loss is the right answer. It matches what works in the literature (EMOTE, EmoTalk, MEAD) and is consistent with on-device constraints.

**The fix is 1 week of archetype authoring, not an architecture rewrite.** Finish the text augmentation currently in flight, then pivot immediately to Week 1 of this plan.