Tome — Narrated E-Reader with Uni-fetch sourcing

// on_device_ml · narration · the alignment problem

Proportional highlight today. Forced alignment is the whole point.

SvelteKit + foliate-js · FastAPI · Kokoro TTS · ZAI (ask-the-book) · GPU with CPU fallback

A narrated reader is easy to fake and hard to do right. The gap is one word-timing alignment.

13×

faster on GPU — but the same script runs unchanged on a CPU-only box

Kokoro TTS runs ~13× faster on CUDA (~1.3s per 400-char chunk vs ~16s on CPU). The interesting decision isn't the speedup — it's that the same run.sh runs unchanged on both. It sets ONNX_PROVIDER=CUDAExecutionProvider, and ONNX Runtime silently falls back to CPU if CUDA isn't usable. No system CUDA toolkit needed; the nvidia-* pip wheels supply the runtime libs.

So the dev box with a GPU and the CPU-only VPS run identical code paths. The capability detection is pushed into the runtime, not the operator — which means a deploy never breaks because the target lacks a GPU.

// the TTS contract — audio out, timings as best-available
POST /tts {text, voice?, speed?}
  → {sample_rate, audio(base64 WAV),
     words:[{i,start,end}]}   # timings = the alignment problem
# kokoro-onnx create() returns audio only.
# timings.py distributes duration by word length — a placeholder.
# true karaoke needs a forced-aligner run over the rendered audio.

// structural decisions worth knowing

the alignment gap
The defining limitation, stated honestly: highlight timings are proportional, not phoneme-accurate. kokoro-onnx's create() returns audio only; timings.py distributes the duration by word length. Good enough to prove highlight tracking. The whole roadmap's hard sub-problem — true karaoke — is swapping in a forced-aligner over the rendered audio. The placeholder is named, not hidden.
DRM guard, not DRM removal
Tome detects DRM-locked files, badges them, and blocks narration. It does not attempt removal. This is a deliberate legal line — the reader works on files you can lawfully open, and it says so mechanically rather than as a disclaimer.
license isolation via vendoring
foliate-js (LGPL-2.1) is vendored as separate files under static/ rather than bundled or depended upon. That keeps the LGPL isolated from the rest of the app's license. A small decision that respects a license boundary without fighting it.
per-book narration profiles
Voice and speed are saved per book and restored on reopen, alongside reading position. So the voice you chose for a dense technical book doesn't bleed into the bedtime novel. State is keyed to the artifact, not the user session — the right place for reader preferences.
sentence chunking + prefetch
A 5000-char section would synthesize ~5 min of audio (~14MB WAV) in one inline call — too slow. P3 chunks by sentence and prefetches so the whole chapter narrates continuously. The streaming architecture exists because the latency math demanded it, not because streaming is fashionable.
OK... go ahead and ask...
What if you wanted instant access to every book on the planet?