Text-To-Speech (TTS) is the tech-du-jour for most voice assistants. It makes no difference if someone interacts with Alexa, Siri, Google, or others; the responses are typically TTS audio playing out of a smart speaker, mobile phone, or automobile speaker. The current voice assistant paradigm of speaking to a black box and receiving a disembodied voice response works with the interaction models of today, but this doesn’t translate well to the metaverse we see on the horizon.

Enter a host of new start-up companies all in a race to develop “Virtual Humans” or “Digital Twins.” They are creating what will most likely be the next generation of conversational interfaces based on more natural, authentic, and humanistic digital interactions. So why Virtual Humans, and why now? A few technology drivers and socioeconomic factors have created the perfect storm for real-time video synthesis and Virtual Humans.

TECHNOLOGY DRIVERS
When compared to conversational TTS responses, there is no doubt that video synthesis solutions require higher workloads (CPU+GPU) to generate video and higher payloads (file size) to deliver video. However, ever-increasing CPU and GPU performance and increased availability speed up the video synthesis process in the cloud and on edge. Also, advances in batch processing and smart caching have enabled real-time video synthesis that rivals TTS solutions for conversational speed. So, the bottleneck of generating ultra-realistic video on the fly has been primarily addressed. This leads to delivering video in real-time, which, thanks to broadband speeds over both Wi-Fi and 5G, is now readily available to most homes, businesses, and schools. You can see the comparison in the video below.

HELP (AND CONTENT) WANTED
Businesses that require employees to engage with customers, such as hotels, banks, or quick-service restaurants, are having trouble hiring and retaining new employees. A lack of available and qualified employees can damage the customer’s perception of the brand and create a real drain on revenue. Enter the Virtual Humans that can handle basic requests quickly and consistently. In Korea, both 7-11 and KB Bank have installed AI Kiosks that rely on a Virtual Human to interact with customers. The 7-11 implementation supports a man-less (or woman-less) operation.

Another promising vertical for Virtual Humans is media, both broadcast media and social media (influencers). Whether streaming news 24 hours a day or staying relevant on TikTok, the need is the same: generate more video content and make it faster. Once again, Asia has taken the lead with Virtual Humans. Television stations such as MBN and LG HelloVision both supplement their live broadcasts with Virtual Human versions of their lead anchors that provide regular news updates throughout the day. Using either API calls or an intuitive “what you type is what you get” web interface, videos with Virtual Humans can be made in minutes without needing a camera, crew, lights, make-up, etc. A time-saving and cost-saving tool that can be intermixed throughout the day to keep content fresh.

“What is our strategy for the Metaverse?” That question is being asked within conference rooms across all sectors. It is easy to imagine how brands leveraging the 2D Virtual Humans of today for taking orders, helping, sharing news will quickly evolve to be the early pioneers of the 3D world and the metaverse. Watch throughout the year for some big announcements within this space.