Shannen Dorothee Tioniwar
To educate users regarding the data collecting process, I am inspired to use an offline method to allow users to be able to give consents on different types of available data. To expand my options in placing boundaries towards Alexa, I decided to explore further into how Alexa voice systems process user utterances, and what data are collected from users.
Since Alexa is a cloud based software, a clone can be produced in a separate hardware which operates as a smart speaker. Moreover, Alexa also operates through skill sets, “The Alexa Skills Kit” which is accessible from the Amazon developer site, to perform a specific task or voice command. This skills kit lets you build new skills for Alexa and allows users to train Alexa with new types of abilities.
How it works: The process starts by feeding your Echo device with a command for it to process. According to Shelly Palmer (2017), a tech consultant: “Alexa works through an Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) engines that enable a system to instantly recognize and respond to voice requests.” This means that the device saves these commands in the form of recordings, keeping “approximately 60 seconds of audio (both before and after the wake word is mentioned) in memory for pre-processing for fast responses (Palmer, 2017).” These recordings then get processed through feature extraction (see fig 2.5) within the Amazon Alexa services. This process would then allow the recordings to be classified into the available skill set through the Cloud. A dialog with the user is a conversation with multiple turns in which Alexa asks questions and the user responds. The conversation is tied to a specific intent representing the user’s overall request. Then, Alexa sends your extracted code user requests to the AWS lambda through the trigger, takes any necessary actions (such as looking up information online) and then sends back a response. AWS Lambda offered by Amazon Web Services is a service that lets you run code in the cloud without managing servers. Commands then operate through a cloud enabled service provider, the AWS Lambda (pre-determined API). Users may add on a custom skill, which would be stored in the AWS Lambda function or a web service. Alexa sends your code user requests and your code can inspect the request, take any necessary actions (such as looking up information online) and then send back a response or action. This intrigued me to search more on the importance of these APIs and how they relate to the data collecting process.
Up to this point, there are still some fundamental questions that need answers. With regard to where these recordings are stored, are they stored in the Echo Dot memory space? or the main Alexa voice services system? Another growing concern mentioned by Matthew Hamilton is also “the ability for these data to be hacked.” This is supported by Shelly Palmer’s (2017) quote : “Anything that can be hacked will be hacked.” This non-transparent behavior not only endangers the privacy of user’s data, but also their trust in the device. Without the in depth investigation into the amazon Alexa command and skill model, users would not know Alexa’s information architecture and its vulnerability.
It starts with a user command, which is in the form of a sentence, being recorded and converted from speech to text. The conversion of each word is tied to a specific intent representing the user’s overall request as shown in fig 3.5. In this case it asks, “What is the weather like in Seattle today?”
The feature extraction then turns the dialogue model from the previous row, to the second row, highlighting intents (or keywords) to a user’s command. This transforms the keywords into executable sentences accepted by the Alexa service, called a searchAction.object property value. This allows the sentence to search for related APIs, which also means that it needs more underlying data to communicate this command. Here, as users ask Alexa for a weather forecast at a given area and time for example, this weatherforecast keyword(called a class) is highlighted as a command to search for its specified API library.
This executable sentence then reaches into the API (an open-source documentation that is accessible to developers), in which there are underlying data, “location” and “start date”, that are directly collected from users. To cater to different users, slots are used to fill in data into these properties to basically answer a user command. Note that each one of the attributes, “location” for example, stores more sub-data such as ZIP code, longitude, and latitude in order to determine the specified “location”. In addition to that, each “location” attribute for different commands pulls different sub-data from the APIs. For instance, the location API has multiple properties that map to slot types which may look like the following:
WeatherForecast.location.addressLocality.name: A city (AMAZON.US_CITY).
WeatherForecast.location.addressRegion.name: a region, such as a US state (AMAZON.US_STATE).
WeatherForecast.location.addressCountry.name: a country name (AMAZON.Country).
It means that the device needs to know where Seattle is, city, state, and country is data that the device collects. These properties are intended to gather, validate, and confirm slot values. The conversation between the user and Al¬exa continues until all slots required for the intent are filled and confirmed accord¬ing to the rules defined in the dialogue model. Thus, informing and reminding users of the voice recognition systems with regard to these underlying raw data that are being collected prove beneficial.
Therefore, one may argue that Amazon’s blackboxed data collecting activity is intended to achieve an increase in their API collection where each command word can possess predetermined datasets that were collected from users. Since these APIs are open source and made for developers, this intermediary device should act in a transparent manner, informing users about the data collecting activities in voice recognised commands. Hence, I think an offline voice recognition as well as a dedicated storage system is better suited for this intermediary device in order to ensure trust and transparency in the flow of information.