The Alexa Conference was hosted in Chattanooga, TN in January. COLAB sent a team to learn more about what was happening in the voice community. We discovered some interesting Skills, some fascinating concepts, and some amazing stats about the market for voice experiences and voice-enabled devices. Here are the highlights:
The Future Is Multimodal
So much of the thinking that goes into building an Alexa Skill or a Google Action is the “voice first experience” – or how the application works when there is ONLY a voice-directed interface. This is how it should be given that any application that has voice-directed features should be entirely usable by voice navigation.
That being said, a clear take-away from the conference is that the future of most (if not all) user interfaces will be multimodal – meaning both visual and voice direction will not only be supported but will function interchangeably. Perhaps it seems like an obvious outcome of our modern technologies, but this vision of a multimodal future is a really big deal that will significantly impact many aspects of our lives, likely in ways we cannot yet conceive.
This is clearly the direction Amazon is headed. Last September they unveiled the Alexa Presentation Language (APL), which according the Amazon developer site is, “a new design language and tools that make it easy to create visually rich Alexa skills for tens of millions of Alexa devices with screens.” Essentially, APL will enable developers to build multimodal Skills for Alexa that will work on devices with screens.
Thinking ahead, it seems clear that multimodal experiences with devices like Amazon’s Echo Show and Echo Spot will raise the expectations among users that other digital applications and interfaces offer similarly dynamic experiences. As website designers and developers, this has us thinking about how the future of the Web will arc towards these types of multimodal experiences.
It wasn’t that long ago that we built websites for personal computers, and then built an entirely separate website for mobile smartphones. Not until concepts like responsive design were introduced did we conceptualize how those very different modalities and experiences may be integrated into one system that works everywhere.
Today, we build websites to accommodate accessibility concerns, structuring our content and code to be better handled by screen readers and other assistive devices. A future Web that is designed for multimodal experiences may make tools like screen readers obsolete. It may make keyboards obsolete – especially when you start to think “voice first” about application and device development. And let’s be honest, desktop mice should be obsolete by now.
As our Programming Lead, Francis Yaconiello, points out, “If Amazon were to build an Alexa phone, it would be a voice first, multimodal experience.”
We’re pretty sure Amazon is already thinking about this.
Smart Speaker Adoption
Another thing that caught our attention at the conference was the rate of consumer adoption of smart speaker technology, especially compared to adoption rates of other technologies. 2018 data presented by VoiceBot.ai shows some amazing statistics related to different technologies and the number of years from commercial introduction each took to be adopted by 50% of the population of U.S.
- Computers took over 20 years to be adopted by 50% of the US population
- Radio took over 15 years to be adopted by 50% of the US population
- Television took over 10 years to be adopted by 50% of the US population
- Smartphones took about 7 years to be adopted by 50% of the US population
- Smart Speakers have taken only 5 years to reach 50% adoption in the US
We found this data to be very surprising, despite the fact that a majority of Americans now carry around smartphones that serve integral daily functions for us, paving the way for other smart devices. The data speaks to the speed with which new technologies are emerging, serving a useful function for consumers, and finding a place in our daily lives.
Throughout the conference we learned about some incredible Skills that have been developed, each story interesting in different ways.
MyPetDoc is “the world's first veterinarian-driven AI” – an Alexa Skill which aggregates pre-defined veterinarian advice based on the symptoms you describe. One of the interesting twists is that if you want additional advice and guidance beyond what the app can answer, it can connect you with a live veterinarian 24 hours/day. The Skill itself is free, but there is a fee for speaking to a live vet. MyPetDoc has built a network of vets and vet techs who can answers questions and take calls, creating something of a niche contractor-based economy that can help connect underemployed veterinary professionals with a distributed customer base. Another interesting monetization twist is that the Skill will suggest products to help with your pet’s ailments, which are cross-sold with Amazon. Yes, the singularity is near.
Multimodal Makes Tech Inclusive
Shanthan Kesharaju integrated AWS DeepLens so that users without the vocal ability to interact with his 1-2-3 Math Alexa Skill could still benefit from the Skill. The DeepLens integration allows the user to answer questions from the Skill by placing numbered or lettered blocks in front of the device’s camera lense. Shanthan’s video submission for the AWS DeepLens Challenge shows a family with an autistic teen-aged son who could not interact with the Skill due to communication limitations, but was able to answer questions by holding up numbered blocks to show the correct answer. This amazing integration of several technologies (voice interaction, image recognition AI, and smart device) demonstrates the potential for what these types of multimodal experiences can offer.
J. Edgar Hoover and Branded Voice Fonts
Noelle LaCharite, former AWS engineer and current Director, Developer Evangelism at Microsoft, used publicly available vocal recordings of J. Edgar Hoover to create a voice font as a conversational interface to The JFK Files. The Hoover Bot is the “voice” she created – just like Alexa or Siri but sounds like the former FBI Director. Outside of the novelty of using Hoover’s voice to navigate the FBI files related to JFK’s assassination, this is not so interesting in and of itself, but the concept of creating fonts out of human voices and how that can impact brand perception is fascinating ground to consider.
For anyone who is familiar with “deepfakes” or Mark Zuckerberg’s use of Morgan Freeman’s voice for his smart home AI system, the idea of creating a voice from a well-known person is not new. It does raise some interesting questions about how the voice you hear in an interaction will shape your impression of that interaction. The choice of the word “font” is telling because “voice font” has the same application in a voice experience as a “graphic font” has in a visual user experience. Voice interfaces are and will be shaped and designed by the language we choose and the voice we apply to convey that language, just as graphic interfaces are now designed by the colors, fonts, and graphic styling we apply to them.
Some testing of custom voice fonts compared to the default Alexa voice has shown that people respond better to custom voices. This makes a lot of sense intuitively in thinking about traditional design practices: context matters. Whether it is a graphic font choice or a voice font choice, users will prefer interface designs that appropriately pair content and style in the same meaningful context.
In the near future, brands will need to make decisions about the appropriate voice style for their identity. Brand’s already develop clear guidelines around “brand voice,” which dictate language style and usage, but new considerations around the literal, auditory tone of a brand’s voice are on the horizon. Services like Amazon Polly already provide some options for different voices (even in different languages), but our guess is that brands will likely move towards establishing their own voices. You may soon be talking to George Clooney in the morning as you ask him to pour you a Nespresso Caramelizio.