Most "personalized" children's books aren't actually personal. My son loved stories where he was the hero — not a character that kind of looked like him, but him. Creating those stories took real effort, and nothing on the market did it properly. A weekend script using LLMs proved parents wanted this. Getting it to production quality — truly personalized, visually consistent, reliably good — turned out to be a genuinely hard engineering problem.
Parents upload a photo, pick a theme, and receive a fully illustrated 12-page book in under 4 minutes — with their child as the hero, consistent likeness across every page, voice narration, and a PDF. They see a story preview before they're charged, can refine it with page-level feedback, and only commit once they're happy.
How It Works
The system decomposes book generation into discrete pipeline stages, each with one job: character analysis extracts a visual description from the photo; story generation builds an outline then writes each page sequentially; illustration generates 12 images using that character description as a persistent reference; validation checks narrative logic before illustrations run and visual identity after each image; assembly produces the final PDF and narration.
Every step's output is stored. If anything fails mid-run, the pipeline resumes exactly where it left off.
The Hard Parts
Consistency. Keeping the same child recognizable across 12 independently generated illustrations required a specific approach to how character context is structured and passed between steps. Getting it wrong produces a beautiful first page and an unrecognizable character by the end — I got it wrong several times before getting it right.
Validation layers. Narrative logic is checked before illustrations are generated; visual identity is checked after each image. Catching failures early — before the most expensive operations — was a core design principle. A failed validation before image generation costs a fraction of one caught after.
Feedback loops. Users can refine the story before committing, with global or page-specific feedback injected into a re-run. After completion, individual pages can be regenerated selectively, and every generation attempt is stored — users can activate any variant or roll back for free.
Model assignment. Creative steps use the most capable models. Validation and classification use faster, cheaper ones — a model optimized for rule-checking produces more reliable validation than a creative model asked to follow rules.
Evaluation
The hardest quality dimension to measure was character identity fidelity. I built a quantitative evaluation framework using vision models as judges, scoring generated images against source photos — enabling systematic A/B testing of prompt strategies with results that could be compared, not just eyeballed.
This drove threshold calibration: the identity score that decides whether a page passes or triggers a retry. Set too high, you waste money regenerating images parents would have accepted. Set too low, you ship pages that break the magic. Prompt changes that looked like obvious improvements sometimes made things measurably worse — the framework was what made it possible to tell the difference.
What I Learned
Decomposition is the core skill. The gap between "ask an AI to make a book" and "produce a consistently high-quality personalized book" isn't filled by better prompting — it's filled by better problem decomposition.
Evaluation unlocks iteration. Without objective measurement, every change is a guess. Building the eval framework was slower than shipping features, but it's what made meaningful iteration possible.
Design for the distribution, not the happy path. AI output quality is a distribution. What matters is that the distribution is good enough that retry logic handles edge cases without exploding costs.
The UX of uncertainty is part of the product. Showing a preview before charging, letting parents refine before committing, surfacing progress during generation — this is inseparable from the technical architecture.
