Diverse Intent Training Data Library
CHALLENGE
The biggest problem with conversational bots these days, is that they can only really listen to and understand a certain kind of english. The reason why is that they’re trained on biased data. We wanted to expand Salesforce bots ability to recognize diverse English varieties by feeding the bot linguistically inclusive training data.
SOLUTION
Leverage sociolinguistic field methods to responsibly collect diverse data, and create a library of the utterances for all teams to use at Salesforce.
PROCESS
-
Prioritizing historically underrepresented varieties
-
African-American English
-
Chicano English
-
Southern U.S. English
-
English as a second language
We also collected demographic data: -
gender identity, race, education level, age (which all affect language use)
-
-
Building a library
-
We went in with very explicit instructions about not over thinking spelling, or punctuation or grammar, because if they were to do that, that would erase a lot of the linguistic variety and variation that we would want to see in the data. And then we partnered with a 3rd party vendor to be able to help us reach these research participants and collect the data.
-
We started with defining what we mean by chatbot, provide a sample scenario and then ask the participants what they would type in that situation. Then we asked for a couple of variations for how else would they say that.
-
-
Setting a standard
-
By leveraging research and research methodology from other fields to be able to do this at scale.
-
-
Drafting a framework
-
In basing our work on sociolinguistics, we leverage a lot in terms of how to responsibly collect diverse data.
-
Starting from a language first approach, and asking open ended questions to elicit as much language as possible
-