Voice-driven features – Advanced technologies for added-value services

The emergence of e-commerce platforms combined with the mobile revolution has changed the way millions of users around the world discover, select and order goods. These platforms are evolving at an incredibly fast pace to become a complete environment in which end-users (i.e., customers) are able to place orders within a few seconds. For example, it is now possible with some e-commerce platforms to fill-in your cart, remove items, and order them by simply asking your voice assistant to do so. This incredible user experience has been imagined, conceived and deployed to provide users with the possibility to buy more items in a shorter amount of time, hence a shortened customer journey.

From a technical point of view, these voice-driven features rely on advanced technologies which require rare and expensive skills to use, such as Speech-to-Text, Text-to-Speech, Spoken Language Understanding, and Dialogue management. Consequently, building solutions based on such technologies is currently reserved to the few companies that have internally acquired these skills. Moreover, voice technologies are tightly dependent on the targeted language. The effort required to provide a feature in one language is repeated when the feature is added to new language, as if the feature was developed from scratch. It is disappointing to see that such sophisticated, expensive and complicated technologies that could add greater user-value in high priority sectors such as education, health and security are used to build applications whose usefulness for the majority of end-users is not totally demonstrated.

It is very common in the history of computer science, and high-tech in general, that smart technologies are not leveraged to deliver concrete added value for the end-users. A well-known example that has impacted Internet users is the CAPTCHA program popularized by Luis von Ahn, who realized that he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of the most precious resource: human brain cycles. He subsequently founded a company that provides a login security layer that relies on human text extraction from blurred images. This security layer asks people to transcribe two words, where the first word is known and used to conduct the actual security verification whereas the second word is unknown and, in most cases, cannot be extracted with existing OCR techniques. Unconsciously, users passing through this security layer are giving possible transcriptions “for free”. The most proposed transcription is thus adopted as the potentially correct one. This was the origin of the enhanced version called reCAPTCHA. In the same fashion, Luis von Ahn also created Duolingo, a language training application that relies on the community to translate text while learning a foreign language in a playful manner.

While defining the COMPRISE use cases, we prioritized these usefulness aspects. The consortium was not eager to build fancy applications with little or no benefits to the end-users. This was challenging since the market of voice-based applications was still in its infancy, with very few concrete examples. Six different applications have been designed and are being developed, including:
⦁ a Notes app,
⦁ an e-Health app,
⦁ and the e-Commerce app described below.

As a use case provider in the COMPRISE project, Netfective decided to build upon mature technologies (i.e., e-commerce platforms and voice technologies) to propose an innovative e-commerce experience that lets end-users improve their language skills through a gamified approach. Concretely speaking, Netfective is building a multi-platform, privacy-aware, multilingual mobile application for e-commerce platform owners. The development of the application should be clear and feasible so that it does not require rare or expensive experts, the operation costs must be transparent and measurable and, last but not least, platform owners should not be locked into any vendors’ specific environment, technology or cloud.

The main idea behind our use case is to provide a layer of building blocks for a new online experience which combines health education (thanks to the Open Food Facts API), gamification through simple games and quizzes, and language education thanks to Machine Translation, Speech-to-Text, and Text-to-Speech.

The use case is COMPRISE-based, in other words, it eliminates third-party services as much as possible and consequently allows users to master the whole process, including voice features.

If you wish to get more information about the COMPRISE tools, check our Software page. Step-by-step instructions are provided to show you how to integrate your app into the COMPRISE ecosystem and get access to all voice-based components.

Written by:
Dr. Youssef RIDENE
Netfective Technology
https://www.linkedin.com/in/ridene/

Developer survey: Since you are here and interested in our project, could you please spare a moment to share your concerns and answer 12 questions related to developing voice-enabled apps.