Guess What
A voice-driven AI guessing game for children where the AI teaches the rules, generates the puzzles, and adapts to each player in real time
Problem
Children are the most honest users in the world. If an interface is confusing, they don’t politely persevere — they disengage, talk over it, or declare it broken. I wanted to build a real-time AI voice experience that was genuinely fun and self-contained enough that a child could play it without an adult explaining the rules. The AI had to teach, respond, adapt, and recover — all through natural conversation.
Experience design
I designed an animal guessing game where an AI voice assistant teaches the rules, dynamically generates visual puzzles, provides real-time feedback, and adapts to each child’s pace and progress. The image starts blurred and gradually reveals itself, so that children who struggle get natural hints without the AI having to break character and explain that it’s giving them a hint.
The core design challenge was conversational authority: who leads? In every prototype session, children talked over the AI, missed its questions, and occasionally ignored it entirely when something on screen caught their attention. The game had to be resilient to this without feeling like it was controlling them. I gave the AI explicit awareness of conversation state — whether a child was engaged, struggling, or had gone quiet — and let it decide when to wait, when to prompt, and when to gently redirect.
Prototyping with children
Early testing sessions surfaced problems I hadn’t anticipated. Children would talk over the AI mid-sentence, and when both voices were active simultaneously the children became confused about who was leading — they’d answer the AI’s question before it finished, then get a response that didn’t match what they’d said. I had to implement voice activity detection to hold the AI’s response until the child had genuinely finished speaking.
A subtler problem: the AI occasionally changed its mind mid-explanation. It would start describing the puzzle one way, then revise itself — and children noticed immediately. “It said it was a bird but now it’s saying something different.” This wasn’t a factual error; it was a tone and commitment problem in the prompt. I tightened the prompts so the AI committed to a direction before speaking rather than thinking out loud. The third issue was the hardest to catch: the AI was giving away the answer inside its own prose. A description like “this large, pink bird is something you might see at a zoo” contained too many clues. I had to engineer the AI to generate genuinely ambiguous descriptions that guided without revealing.
Technical architecture
The most critical technical constraint was latency. Children tolerate silence less than adults — any gap over a second in a voice interaction feels broken to a five-year-old. I chose WebRTC with OpenAI’s Realtime API specifically because it achieves sub-100ms voice response times without a round-trip through a server. Image generation was the other bottleneck; I built Netlify blob-based caching so repeated plays of the same puzzle didn’t wait for a fresh DALL-E call.
- Stack: Next.js + TypeScript — rapid iteration and easy Vercel deployment
- Voice: WebRTC + OpenAI Realtime API — sub-100ms voice response, essential for natural child interaction
- Image generation: DALL-E + GPT-IMAGE-1 — unique puzzle images generated on demand per game session
- Audio feedback: Custom FFT analysis — visual waveform so children can see the AI is listening
- Game state: AI function calling — autonomous state transitions without hard-coded game logic
- Caching: Netlify blob storage — repeated puzzle images served instantly without re-generating
Result
Children naturally conversed with the AI to play the game. The progressive reveal mechanic created genuine anticipation — children leaned forward as the image sharpened, and the ones who struggled got hints embedded in the visual rather than in the AI’s words. The voice interaction felt responsive enough that children treated it as a real interlocutor rather than a button to press.
The project demonstrated that multi-modal AI systems — voice, image generation, reasoning, and function calling working in concert — can create experiences that feel coherent and magical when the orchestration is invisible. The design work was as much about failure modes as features: knowing what the AI does when it gets confused, when a child goes quiet, or when the image takes longer than expected made the difference between a demo and something children actually wanted to play again.
Prototyping with children in the room: