// on_device_ml · narration · the alignment problem
Proportional highlight today. Forced alignment is the whole point.
SvelteKit + foliate-js · FastAPI · Kokoro TTS · ZAI (ask-the-book) · GPU with CPU fallback
A narrated reader is easy to fake and hard to do right. The gap is one word-timing alignment.
13×
faster on GPU — but the same script runs unchanged on a CPU-only box
Kokoro TTS runs ~13× faster on CUDA (~1.3s per 400-char chunk vs ~16s on CPU). The interesting decision isn't the speedup — it's that the same run.sh runs unchanged on both. It sets ONNX_PROVIDER=CUDAExecutionProvider, and ONNX Runtime silently falls back to CPU if CUDA isn't usable. No system CUDA toolkit needed; the nvidia-* pip wheels supply the runtime libs.
So the dev box with a GPU and the CPU-only VPS run identical code paths. The capability detection is pushed into the runtime, not the operator — which means a deploy never breaks because the target lacks a GPU.
// the TTS contract — audio out, timings as best-availablePOST /tts {text, voice?, speed?}
→ {sample_rate, audio(base64 WAV),
words:[{i,start,end}]} # timings = the alignment problem
# kokoro-onnx create() returns audio only.
# timings.py distributes duration by word length — a placeholder.
# true karaoke needs a forced-aligner run over the rendered audio.
// structural decisions worth knowing
- the alignment gapThe defining limitation, stated honestly: highlight timings are proportional, not phoneme-accurate. kokoro-onnx's
create()returns audio only;timings.pydistributes the duration by word length. Good enough to prove highlight tracking. The whole roadmap's hard sub-problem — true karaoke — is swapping in a forced-aligner over the rendered audio. The placeholder is named, not hidden. - DRM guard, not DRM removalTome detects DRM-locked files, badges them, and blocks narration. It does not attempt removal. This is a deliberate legal line — the reader works on files you can lawfully open, and it says so mechanically rather than as a disclaimer.
- license isolation via vendoringfoliate-js (LGPL-2.1) is vendored as separate files under
static/rather than bundled or depended upon. That keeps the LGPL isolated from the rest of the app's license. A small decision that respects a license boundary without fighting it. - per-book narration profilesVoice and speed are saved per book and restored on reopen, alongside reading position. So the voice you chose for a dense technical book doesn't bleed into the bedtime novel. State is keyed to the artifact, not the user session — the right place for reader preferences.
- sentence chunking + prefetchA 5000-char section would synthesize ~5 min of audio (~14MB WAV) in one inline call — too slow. P3 chunks by sentence and prefetches so the whole chapter narrates continuously. The streaming architecture exists because the latency math demanded it, not because streaming is fashionable.
- OK... go ahead and ask...What if you wanted instant access to every book on the planet?