# V3 Face — Remaining Work TODO

> Living TODO for the rest of V3 face. Update inline as items complete.
> Authored 2026-05-13, end of lipsync-jitter session.

---

## Current status (2026-05-13)

- **Phase 1 (lipsync) — DONE** (best_lipsync.pt @ epoch 66, val L1 = 0.0147).
- **Lipsync jitter** — resolved via inference-time `crisp_mouth` filter (V1-style soft-threshold gate). See `V3_LIPSYNC_JITTER_DECISION.md`. Visually approved in viewer.
- **Filter location decision** — Option 1 (export-time wrapper). Deferred until export stage. See §3.
- Files touched this session:
  - `scripts/compiler/data_pipeline.py` — added `--lipsync-smooth` (V2 jitter-gate) flag for optional GT smoothing. Not used yet; available if Option F resurfaces.
  - `models/v3_face/infer.py` — added `--crisp`, `--crisp-threshold`, `--crisp-scale`, `--crisp-sigma`. Use these for any future inference check.

---

## 1. Phase 2 — Train expression branch (NEXT)

Freeze lipsync, train expression branch on top.

```bash
cd /dataset/kemix-engine/package/face/animasync-face-v3
PYTHONPATH=. python3 -m models.v3_face.train \
    --focus expression --freeze_lipsync \
    --resume models/v3_face/checkpoints/best_lipsync.pt \
    --epochs 150 --wandb --wandb_run_name "v8_phase2" \
    --device cuda:0
```

What to watch:
- Val L1 on expression channels (target < 0.020, similar order as lipsync L1).
- Brow trajectory smoothness in viewer — if brows jitter, that's a separate problem (prosody-driven brow movement, not lipsync). Velocity warmup + brow velocity weight (0.45) already in config.
- Make sure lipsync output is bit-identical to phase 1. `freeze_lipsync()` should guarantee this; spot-check on `long_001` by diffing pred lipsync channels against phase-1 prediction.

When done:
- Run inference with `--crisp` to confirm lipsync still smooth and expression branch is producing meaningful brow/cheek/eye motion.
- Visual check on dialogue + monologue variety (`long_001`, `long_046`, `long_001_p0`, plus some `solo_` and `daily_` from the viewer dropdown).

---

## 2. MicroAlbert distillation (still pending from earlier)

Text → VAD student model. Teacher at `checkpoints/klue_teacher_clean_ctx2/best.pt`. No student trained yet. Out of scope for this session — pick up after phase 2 is locked.

Reference: `docs/V3_EMOTION_DETECTION_PLAN.md`.

---

## 3. Export wrapper — Option 1 (DEFERRED until export stage)

**Decision (2026-05-13):** the long-term answer for lipsync smoothness is to bake the filter into the deployed ONNX graph, not run it as separate post-processing code. Defer until V3 is ready to export.

### Why Option 1 (export wrapper) over alternatives

| Approach | Pros | Cons |
|---|---|---|
| **Option 1 — export wrapper** (chosen) | One artifact, no retrain, decoupled iteration, reversible, standard practice (NVIDIA Audio2Face etc.) | Filter wasn't part of training loss |
| Train-time bake (Option B) | Loss sees the filter | Retrain phase 1 + phase 2, slight collapse risk |
| VQ-VAE rewrite | Smoothness from codebook manifold, SOTA direction | ~1 week, real architectural change |
| Diffusion | Smoothest by construction | ~2 weeks, hard to fit 20MB cap |
| GT smoothing (Option F) | V2-style "fix the data, no post-process" | Uncertain — regression variance is structural, not learned from teacher HF. May not help. |

### What to build when we get there

A torch wrapper module exporting `model + crisp_mouth` together to ONNX:

```python
class V3FaceDeployable(nn.Module):
    def __init__(self, model, alpha=0.5, edge0=0.09, edge1=0.36, scale=1.0):
        super().__init__()
        self.model = model
        self.alpha = alpha
        self.edge0 = edge0
        self.edge1 = edge1
        self.scale = scale
        self.register_buffer("crisp_idx",
            torch.tensor(sorted(set(LIPSYNC_ONLY) | set(SHARED_CHANNELS)), dtype=torch.long))

    def forward(self, audio, cond, prev_smoothed):
        """
        prev_smoothed: (B, 31) — per-channel EMA state from previous frame.
                       Pass zeros on the first frame. Caller threads this
                       through across frames for streaming.
        Returns: (out, next_smoothed)
        """
        out = self.model(audio, cond)            # (B, T, 52)
        # streaming-safe crisp_mouth on the 31 lipsync channels
        lip = out.index_select(-1, self.crisp_idx)   # (B, T, 31)
        smoothed = []
        prev = prev_smoothed
        for t in range(lip.shape[1]):
            cur = self.alpha * lip[:, t, :] + (1.0 - self.alpha) * prev
            smoothed.append(cur)
            prev = cur
        smoothed = torch.stack(smoothed, dim=1)      # (B, T, 31)
        # smoothstep gate (stateless)
        u = ((smoothed - self.edge0) / (self.edge1 - self.edge0)).clamp(0.0, 1.0)
        gate = u * u * (3.0 - 2.0 * u)
        crisped = (smoothed * gate * self.scale).clamp(0.0, 1.0)
        out = out.index_copy(-1, self.crisp_idx, crisped)
        return out, prev
```

Key differences from V1's offline `crisp_mouth` for streaming:
1. EMA instead of Gaussian pre-smooth — causal, stateful via `prev_smoothed`.
2. Absolute `edge0`/`edge1` (no per-utterance max normalization) — stateless gate.
3. State threaded as input/output of forward so callers can chunk frames freely.

Calibrate `alpha`, `edge0`, `edge1` against the current offline `--crisp` output on `long_001` / `long_046` so the streaming version looks the same. Single-utterance sanity check: with full audio passed in, deployable wrapper output should match `infer.py --crisp` output within ε.

ONNX export:
```python
deploy = V3FaceDeployable(model).eval()
dummy_audio = torch.zeros(1, T, 80)
dummy_cond  = torch.zeros(1, T, 19)
dummy_prev  = torch.zeros(1, 31)
torch.onnx.export(deploy, (dummy_audio, dummy_cond, dummy_prev),
                  "v3_face.onnx", dynamic_axes={"audio": {1: "T"}, "cond": {1: "T"}},
                  input_names=["audio", "cond", "prev_smoothed"],
                  output_names=["blendshapes", "next_smoothed"], opset_version=17)
```

Quantize to int8 if size > 20 MB after export.

### If Option 1 doesn't pan out

Fallback plan: Option B (train-time fixed low-pass layer). Same wrapper concept but the filter is registered as a model layer from epoch 0 of phase 1, so the loss saw it during training. Slightly cleaner conceptually; only worth it if Option 1's output diverges visibly from what we see now in the offline `--crisp` viewer.

---

## 4. Loose ends

- **Curate viewer subset** for senior demo once phase 2 is done. Current viewer has phase-1-only predictions; refresh after phase 2.
- **Smoothing params lock-in** (from earlier session): `γ=0.3, σ=30, fade=96` are the current defaults baked into `data_pipeline.py`. Decision pending whether to keep these or switch to MIDDLE (`γ=0.45, σ=25, fade=72`). Visual check after phase 2 with current settings; switch only if expression brows feel wrong.
- **`data/v3_training_smoke/`** was offered for an F smoke test but never run. Safe to delete if it exists.

---

## Quick resume command

```bash
cd /dataset/kemix-engine/package/face/animasync-face-v3
git pull
# Then phase 2:
PYTHONPATH=. python3 -m models.v3_face.train --focus expression --freeze_lipsync \
    --resume models/v3_face/checkpoints/best_lipsync.pt \
    --epochs 150 --wandb --wandb_run_name "v8_phase2" --device cuda:0
```
