We have all been there. Standing in front of a huge shelf of products wondering where your favorite item is. Or needing to buy a gift for someone but unsure what they would like best. Wouldn’t it be great to have some help right at that very moment? That’s now possible thanks to retail brand investments in voice assistants. Many B2C focused industries are now looking at ways to incorporate “voice” into their customer engagement plans leveraging intelligent automation solutions.
Statistics by Voicebot show that U.S. smart speaker owners increased by 40% in 2018, and the total number of smart speakers in use grew to 133 million. The number of applications developed with the ability to accept voice inputs has also increased exponentially to meet rising demand. As of Sep 2019, there are 67,000 skills published for Alexa in the US alone followed by UK, India & Canada.
These applications need to listen and interpret voice commands given in a wide variety of accents and intonations. Think about a voice enabled solution at a retail store in a mall. Can you imagine the number of languages/accents it would have to listen to, interpret and respond to accurately in a day? Mind boggling, isn’t it? For an application to succeed in such a scenario, it must be tested thoroughly for various accents, and that’s known as VUI (Voice User Interface) testing.
VUI testing mainly focuses on testing the voice application (a.k.a. skill or action) to evaluate its overall user experience, its ability to understand varied accents, and how it interprets voice commands, the user intent and more. In a nutshell, VUI can help organizations to:
- Validate if the skill will provide an excellent user experience.
- Verify if it understands the end user’s intent and provides helpful responses.
- Review the intent schema, utterances, and list of custom slots (various attributes of an intent).
Testing a voice app is quite different from testing a mobile or web application. Most testers typically perform unit testing, system testing, usability testing , performance testing and regression testing for these apps, which can also be executed for voice applications. The difference here is that the testing experience focuses on voice as a major portion of the overall testing process.
For Voice App testing, it is important to verify the following factors in addition to standard testing protocols:
- The skill should understand the language, accent and way a request is phrased.
- The skill should understand all requests within the context of a skill’s functionality and should be able to handle an out-of-context request appropriately.
- If the skill doesn’t understand the end user’s request, then it should be able to handle this scenario gracefully. Rather than throwing an unexpected error, it should give proper messages conveying what is expected from the end user or give hints that might help the user to continue the interaction in a more meaningful manner.
- Since there is no limitation to how users can use Voice apps, a user’s response is uncertain. Therefore, all positive and negative scenarios should be tested to ensure the system is able to respond appropriately.
As Edwin Catmull at Pixar said, “If you aren’t experiencing failure, then you are making a far worse mistake. You are being driven by the desire to avoid it.”
It is extremely important to test any voice app rigorously to ensure it works correctly. After all, you want to be the one to find the bug before your customer does.
VUI Testing – Manual or Automated?
Like most web or mobile applications, voice apps can also be tested manually using:
- Physical devices
- Simulators (Alexa skill or Google action)
Although it’s possible to test these apps manually, manual testing alone isn’t enough due to the following challenges:
- The composition of a phrase or sentence (a.k.a. utterances) can differ with each user. Therefore, large amount of data set/inputs are needed to test these scenarios. This can mean an infinite number of test cases to test, a realistically impossible situation for manual testing.
- Manual testing is quite time consuming.
- All languages, voices and accents can’t be tested using manual testing.
- An important component of Voice app testing is regression testing. Regression testing involves repeatedly running a test suite whenever the program is under test or the program’s execution environment changes. In voice apps, we constantly update the Natural Language Processing (NLP) models. Regression testing will play a critical role here to make sure existing functionality is not broken. Executing the same regression test suite again and again is time consuming which makes it wise to automate the test suite.
In summary, automation not only makes it easier to test the maximum amount of possible utterances in just a few minutes, it also expedites the release process with much higher quality.
VUI Test Automation – Tool Analysis
Choosing the right VUI test automation tool is essential. There are various automation tools available for testing Voice apps, both open-source and paid. Tool selection will primarily depend on your specific requirements and other factors including:
- Tool Setup time
- Features supported
- Technical support
- Level of accuracy
- Scripting and reporting capabilities
Let’s look at two commonly used tools, Botium and Bespoken with the above factors in mind.
Botium is an open source framework designed to automate chatbot applications. Botium can be used for Voice app testing as well in two ways. Note that the connectors can be used with both Botium-Cli or Botium Box.
Option 1: Botium Framework using connector as web-speech-api:
This option allows users to test the application using conversation files(*.convo). Botium framework looks for all files with the extension *.convo and uses those files to test the app.
It is very easy to create a conversation file, either manually or using an emulator. The following command can be used to record and save the convo file:
$ botium-cli emulator browser –convos=./spec/convo –config=./botium.json
Once the conversation file is created, the Web Speech API is used to output the conversation audio through desktop / laptop speakers, which acts as the input for Alexa or Google Home. Based on inputs, these devices provide responses which are pushed as inputs into Botium through the desktop / laptop microphone. Responses are then converted into text and compared with the response text specified in the convo file. The comparison result is then generated using Mochawesome.
Although this option is easy to set up and create conversation files, it has some major limitations. This approach depends upon quality external speakers and can result in high test case failures if the speakers aren’t great. It also needs physical devices including microphone and speakers, resulting in some stability issues.
Option 2: Botium Framework using botium-connector-alexa-avs:
This approach uses a virtual device instead of a physical one. It reads the same conversation files and connects to the virtual device. It then converts text to speech (using Cloud Speech-to-Text API or Amazon Polly) which is given as inputs to Alexa with Amazon AVS. The answer is then converted to text (Cloud Text-to-Speech API, aka Cloud Speech API or Amazon Transcribe) which is compared with the convo file to provide the final test result.
This approach is better compared to option 1 above as:
- Convo files can be created easily.
- It is useful for verifying workflows.
- Reports can be generated using Mochawesome.
- CD/CI is possible.
- Support is available on GitHub.
- It allows initiating conversation using a simulator, which can be also saved as a convo file.
Although we feel this is a better approach, it does have some drawbacks:
- It needs Google Cloud or AWS APIs to convert Speech-to-Text or Text-to-Speech. Both are paid (Free for 1 year). Note: Google Cloud works better with Botium and is recommended over AWS.
- If there is a long description in the application response, it may have problems accurately converting the text to speech.
- The installation and setup processes are tedious.
- Only a defined number of utterances can be verified. A list of utterances needs to be created manually and then those can be used in the script.
Bespoken is a paid tool (with a 30-day free trial) that uses Amazon Polly included in the package. It can perform three types of testing: Unit testing, End to End testing, and Usability testing. It also has a feature to monitor the application. The test script is created as an YAML file, so the user needs to have knowledge of the YAML structure before creating the script.
Once the YAML file (which also includes the trigger word) is created, the statements from that file are converted into audio using Amazon Polly text-to-speech. The audio is then sent to virtual Alexa or Google Home devices as inputs. The responses from these devices include an audio (the vocal response) and the metadata (to be used as display, if the device can provide a visual output as well). These responses are converted into text using speech-to-text and compared with the expected output specified in YAML Script to generate test report.
Considering it’s a paid tool, Bespoken has a comparatively bigger list of pros, which are:
- Easy installation and faster execution with higher accuracy.
- Chat and email support are available.
- Reports can be generated using Jest.
- If any words are misinterpreted during conversion (TTS / STT), that can be rectified by defining homophones in testing.json file. This feature is not available in Botium.
- CD/CI is possible.
- Dynamic content can be handled using RegEx.
- For checking utterances, the Bespoken team performs usability testing with different accents and provides a report.
- Unit test cases can also be created using Bespoken.
The fact that it’s a paid tool is a slight disadvantage for Bespoken, along with need for the tester to understand YAML. However, compared with Botium, it is much easier for testers to create the convo files than the YAML script.
Comparison (Bespoken Vs Botium)
Here are our observations in a tabular format for a quick analysis:
|Tool Setup time||Low||High|
|Ease of use||Easy||Hard|
|Cost||Low||Low||Botium is an open source tool. However, Google Cloud API is paid.|
|Features supported||High||Low||Botium offers less features with more limitations.|
|Accuracy||High||Low||Botium is less accurate comparatively.|
Although Bespoken appears to be the winner over Botium in our analysis, your tool selection will be driven by your overall testing requirements along with other parameters like testing time, number of testing resources available and so on. If you are evaluating voice and NLP for client engagement and interaction, we can help! Contact us to get started.