[Tokyo Tech Translated] google patents llm latency hiding
small model fills the silence while the big model thinks
today's selection orbits a single, clever idea: hiding latency. not by making the model faster, but by making the wait invisible. a patent filing from google shows how, and the surrounding japanese tech commentary picks apart the mechanics with a mix of admiration and unease.
@itarutomy, speculative execution for llms
google filed a patent for a technique that pairs a small on-device model with a large server-side model to achieve zero perceived latency in ai responses. the goal is to eliminate the awkward pause while the ai "thinks." when a user asks "explain general relativity," a lightweight model on the phone (under 100b parameters) instantly starts speaking a fact-free preamble like "general relativity is a complex topic." meanwhile, a heavy server model (200b+ parameters) generates the real answer in parallel. the moment the preamble ends, the large model cleanly takes over with "a theory proposed by einstein..."
the system adapts to network conditions. if server delay is estimated to be high, the text-to-speech rate of the preamble is automatically slowed to buy time. the small model is fine-tuned exclusively to generate these filler intros, so it never hallucinates actual facts. the design mirrors speculative execution in cpus, hiding latency the way a processor hides pipeline stalls. source: https://x.com/itarutomy/status/2070723799629119497
the patent reads like a pragmatic fix for a real product bottleneck. voice assistants that pause for two seconds feel broken. this makes them feel instantaneous, even when they aren't. the unease comes from the fine print: a model trained to speak authoritatively while saying nothing at all. the filler is designed to sound like the start of a real answer. the user is being managed, not informed, for that first breath. japanese tech twitter noticed the elegance, and the sleight of hand.
more at falsifylab.com

