top of page

Diverse Intent Training Data Library

Screen Shot 2022-05-16 at 4.09.43 PM.png

Link to presentation deck.

CHALLENGE

The biggest problem with conversational bots these days, is that they can only really listen to and understand a certain kind of english. The reason why is that they’re trained on biased data. We wanted to expand Salesforce bots ability to recognize diverse English varieties by feeding the bot linguistically inclusive training data.

SOLUTION

Leverage sociolinguistic field methods to responsibly collect diverse data, and create a library of the utterances for all teams to use at Salesforce. 

PROCESS

  • Prioritizing historically underrepresented varieties

    • African-American English​

    • Chicano English

    • Southern U.S. English

    • English as a second language
      We also collected demographic data: 

    • gender identity, race, education level, age (which all affect language use)

  • Building a library

    • We went in with very explicit instructions about not over thinking spelling, or punctuation or grammar, because if they were to do that, that would erase a lot of the linguistic variety and variation that we would want to see in the data. And then we partnered with a 3rd party vendor to be able to help us reach these research participants and collect the data. 

    • We started with defining what we mean by chatbot, provide a sample scenario and then ask the participants what they would type in that situation. Then we asked for a couple of variations for how else would they say that. 

  • Setting a standard

    • By leveraging research and research methodology from other fields to be able to do this at scale.

  • Drafting a framework

    • In basing our work on sociolinguistics, we leverage a lot in terms of how to responsibly collect diverse data.

    • Starting from a language first approach, and asking open ended questions to elicit as much language as possible

Marlinda Galapon
bottom of page