Training Conversational AI Agents on Noisy Data
Chatbots, virtual assistants, robots, and more: is already highly visible in our daily lives. Companies looking to increase engagement with customers while reducing costs are investing heavily in the space. The numbers are clear: the conversational AI agents industry is expected to grow through at least 2025. By that time, predicts that organizations that leverage AI in their customer engagement platform will increase operational efficiency by 25%.The global pandemic has only accelerated these expectations, as conversational AI agents have been critical to businesses navigating a virtual world while still wanting to remain connected with customers. Conversational AI helps companies overcome digital communication鈥檚 impersonal nature by providing a tailored, humanized experience for each customer. These changes redefine the way brands engage and will undoubtedly become the new normal, even post-pandemic, given the successful proof-of-concept.Building conversational AI for real-world applications is still challenging, however. Mimicking the flow of human speech is extremely difficult. AI must account for different languages, accents, colloquialisms, pronunciations, turns of phrase, filler words, and other variability. This effort requires a vast collection of high-quality data. The problem is, this data is often noisy, filled with irrelevant entities that can misconstrue intent. Understanding the role data plays and the mitigation steps to manage noisy data will be essential toward reducing errors and failure rates.
Data Collection and Annotation for Conversational AI Agents
To understand the complexities of creating a conversational agent, let鈥檚 walk through a typical process for building one with voice capabilities (such as Siri or Google Home).
- Data Input. The human agent speaks a command, comment, or question captured as an audio file by the model. Using speech recognition machine learning (ML), the computer converts this audio to text.
- Natural Language Understanding (NLU). The model uses entity extraction, intent recognition, and domain identification (all techniques for understanding human language) to interpret the text file.
- Dialogue Management. Because speech recognition can be noisy, statistical modeling is used to map out distributions over the human agent鈥檚 likely goal. This is known as dialogue state tracking.
- Natural Language Generation (NLG). Structured data is converted into natural language.
- Data Output. Text-to-speech synthesization converts the natural language text data from the NLG stage into the audio output. If accurate, the output addresses the human agent鈥檚 original request or comment.
Let鈥檚 explore NLU a bit further, as this is a critical step in managing noisy data. NLU typically requires the following steps:
- Define Intents. What is the human agent鈥檚 goal? For example, 鈥淲here is my order?鈥 鈥淰iew lists鈥 or 鈥淔ind store鈥 are all examples of intents or purposes.
- Utterance Collection. Different utterances working toward the same goal must be collected, mapped, and validated by data annotators. For example, 鈥淲here鈥檚 the closest store?鈥 and 鈥淔ind a store near me鈥 have the same intent but are different utterances.
- Entity Extraction. This technique is applied to parse out critical entities in the utterance. If you have a sentence like, 鈥淎re there any vegetarian restaurants within 3 miles of my house?鈥, then 鈥渧egetarian鈥 would be a type entity, 鈥3 miles鈥 would be a distance entity, and 鈥渕y house鈥 would be a reference entity.
Given these steps, what are the challenges in designing dialogue? First, there鈥檚 no straightforward way to collect human intents in a way that鈥檚 universal for everyone. Second, it鈥檚 difficult to model real-world conversation flow, which will vary by geography, age, person, and other individual factors. Finally, data collection can be noisy and costly.A lot of automatic speech recognition (ASR) data contains noise, where the machine misunderstands specific words or phrases in the audio file. An example is, 鈥淚 would like one,鈥 becomes, 鈥淚 would like I鈥檓 on,鈥 which is meaningless. Human speech is natural and unscripted; we often use filler words that are irrelevant to our intent. 鈥淥h yeah, I think, yeah, this is better,鈥 has many unneeded filler phrases that can cloud the interpretation of meaning. Humans also have a high variability of phrasing, depending on their location, upbringing, and experience.When we look at the stats on noisy data, we find that AI is either correct or making minor errors in an average of 53% of cases. In 30% of cases, AI makes minor errors. In 17% of cases, AI is making significant errors, demonstrating that noisy data is still a problem for businesses launching conversational AI agents.
Designing Dialogues for Social Robots

In many cases, a conversational agent鈥檚 goal is to enable them to interact with humans as peers, not as devices. This means communicating using speech and gesture, providing useful services, and leveraging natural language to maintain a natural conversation flow. How do we then develop social robots that can interact with people?One way to approach creating a social robot with personality is through flowchart-based visual programming. Flowchart blocks represent back-end functions, such as talking, shaking hands, and moving to a point. They catalog the flow of interaction. Content authors can use the flowchart to easily combine speech, gesture, and emotion to build engaging interactions.Erica (the ERATO Intelligent Conversational Android) was built using this method. Her content authors iteratively added content over months to develop her as a character and not just a question-answering device. She can now complete over 2,000 behaviors and over 50 topic sequences.Another approach to designing a social robot is teleoperation. The Nara Experiment employed a robot at the Nara, Japan, tourist center to act as a tour guide for visitors. Human tour guides created offline content for the robot (for example, background information on the local Todaiji Temple), and engineers programmed the robot with the information ahead of time. The team contrasted this method with teleoperation.When a teleoperator controlled the robot remotely, results were more accurate than when the robot relied on offline data. The problem was the method wasn鈥檛 very scalable, content entry was slow and error-prone, and it was challenging to control multimodal behaviors.While interesting case studies, these experiments prompt questions around more scalable alternatives to dialogue design. Would it not be more efficient to collect in-situ data from real human-to-human interactions?
Learning by Imitation for Social Robots
If we could crowdsource human behaviors, we could collect higher-quality data more passively and cost-efficiently. We could observe human interactions, abstract typical behavior elements, and generate robot interactions based on this. One such team explored the validity of this idea by setting up a camera shop scenario. Let鈥檚 walk through their methodology:
- Data Collection. The team collected data on human customers鈥 multimodal behaviors and shopkeepers, including three critical categories of speech, locomotion, and proxemics formation.
- Speech: Using automatic speech recognition, the model captured the typical utterances (for example, how many megapixels does this camera have? Or what is the resolution?) and used hierarchical clustering to map these utterances intents.
- Locomotion: Sensors captured tracking data on typical locations where humans congregate, such as the service counter, and distinct trajectories, such as from the door to the camera display. Clustering was used to determine the frequencies of each position and trajectory.
- Proxemics Formation: Sensors captured typical formations of customer and shopkeeper; for example, face-to-face, or the shopkeeper presenting a product.In addition, when a customer spoke or moved, that interaction was discretized into customer-shopkeeper action pairs.
- Model Training. The team then trained the model using the customer action (including the utterance, motion, and proxemics) and of the shopkeeper鈥檚 expected response. For example, the customer action might include asking, 鈥淗ow much does this cost?鈥 while facing the shopkeeper; the shopkeeper would then reply, 鈥淚t鈥檚 $300.鈥
After the team trained the model, they tested the robot on the camera shop floor and accurately handled 216 various interactions. While a long way off from being a human replica, the robot in this case study demonstrates the complexities involved in attempting to mimic human speech and behavior.
Moving Forward with Conversational AI
What do we take away from these examples? Building conversational agents are difficult. Data is noisy and difficult to capture, and imitating human language is a formidable challenge. That鈥檚 why it鈥檚 essential to design data collection workflows to capture high-quality data. Using an in-situ approach for data collection is best for capturing natural conversation, although more progress is still needed to reduce the error rate further.The problem of noisy data continues to be constant. Using ML-assisted validation to reject noisy utterances from the onset and leveraging abstraction and data-driven techniques can reduce noise. Unlocking the business value of conversational AI agents will mean investing heavily in data and developing more accurate ML approaches to solving the natural language problem.At 色导航, we have been helping companies successfully create their conversational AI agents, getting them from experiment to full deployment by helping them navigate the complexities of data collection and annotation.