// on_device_ml · narration · the alignment problem

Proportional highlight today. Forced alignment is the whole point.

SvelteKit + foliate-js · FastAPI · Kokoro TTS · ZAI (ask-the-book) · GPU with CPU fallback
A narrated reader is easy to fake and hard to do right. The gap is one word-timing alignment.
13×
faster on GPU — but the same script runs unchanged on a CPU-only box

Kokoro TTS runs ~13× faster on CUDA (~1.3s per 400-char chunk vs ~16s on CPU). The interesting decision isn't the speedup — it's that the same run.sh runs unchanged on both. It sets ONNX_PROVIDER=CUDAExecutionProvider, and ONNX Runtime silently falls back to CPU if CUDA isn't usable. No system CUDA toolkit needed; the nvidia-* pip wheels supply the runtime libs.

So the dev box with a GPU and the CPU-only VPS run identical code paths. The capability detection is pushed into the runtime, not the operator — which means a deploy never breaks because the target lacks a GPU.

// the TTS contract — audio out, timings as best-available
POST /tts {text, voice?, speed?}
  → {sample_rate, audio(base64 WAV),
     words:[{i,start,end}]}   # timings = the alignment problem
# kokoro-onnx create() returns audio only.
# timings.py distributes duration by word length — a placeholder.
# true karaoke needs a forced-aligner run over the rendered audio.

// structural decisions worth knowing

  • the alignment gap
    The defining limitation, stated honestly: highlight timings are proportional, not phoneme-accurate. kokoro-onnx's create() returns audio only; timings.py distributes the duration by word length. Good enough to prove highlight tracking. The whole roadmap's hard sub-problem — true karaoke — is swapping in a forced-aligner over the rendered audio. The placeholder is named, not hidden.
  • DRM guard, not DRM removal
    Tome detects DRM-locked files, badges them, and blocks narration. It does not attempt removal. This is a deliberate legal line — the reader works on files you can lawfully open, and it says so mechanically rather than as a disclaimer.
  • license isolation via vendoring
    foliate-js (LGPL-2.1) is vendored as separate files under static/ rather than bundled or depended upon. That keeps the LGPL isolated from the rest of the app's license. A small decision that respects a license boundary without fighting it.
  • per-book narration profiles
    Voice and speed are saved per book and restored on reopen, alongside reading position. So the voice you chose for a dense technical book doesn't bleed into the bedtime novel. State is keyed to the artifact, not the user session — the right place for reader preferences.
  • sentence chunking + prefetch
    A 5000-char section would synthesize ~5 min of audio (~14MB WAV) in one inline call — too slow. P3 chunks by sentence and prefetches so the whole chapter narrates continuously. The streaming architecture exists because the latency math demanded it, not because streaming is fashionable.
  • OK... go ahead and ask...
    What if you wanted instant access to every book on the planet?