ELMo’s World: Using Deep Learning to Interpret Words with Many Meanings

by Isha Salian

Bank. Fair. Duck.

Out of context, each word has multiple meanings. A bank is where you keep your money or the side of a river. Fair is where you climb into a Ferris wheel or your assessment of the ride. Duck is how you avoid injury from an incoming, same-named waterfowl.

Humans typically don’t have a problem figuring out which meaning of a word applies where. But natural language processing models are a different story.

AI tools to parse text have been around for years, but they run into a stumbling block when it comes to words with multiple meanings.

Researchers from the Allen Institute for Artificial Intelligence and the University of Washington are getting past this hurdle with a neural network that determines the meaning of a word depending on the context in which it appears.

Reading Forwards and Backwards 

NLP models are typically trained with data structured in word vectors, which attach basic elements of language meaning and word syntax to each word. The algorithm assumes that each word has a single vector representation — but that’s not how the English language works.

The researchers broke this assumption with their neural network, called ELMo, which can create an infinite numbers of vectors per word. (ELMo stands for Embeddings from Language Models, not the furry red Sesame Street character, explained Matthew Peters, lead author on the paper.)

Elmo
ELMo loves to read: Not your toddler’s Elmo, but ELMo, the neural network using bidirectional language models.

To attach vectors for each potential meaning to the word, the team used a bidirectional language model. Regular language models try to predict the next word that will appear in a sentence. If a fragment reads “The people sat down on the …,” the algorithm will predict words like bench or grass.

Making the model bidirectional means it has a second, backwards-looking algorithm that takes the end of a sentence and tries to predict the word that came before it. This is useful when the word a model is trying to parse comes at the start of a sentence, with the relevant context coming later.

“It’s ‘He lies to his teacher’ versus ‘He lies on the sofa,’” said Peters.

To test ELMo’s skill, the team evaluated the algorithm on six different NLP tasks, including sentiment analysis and questions and answers. Compared to previous techniques using the same training data, ELMo got a new state-of-the-art result every time — in some cases with an improvement of 25 percent over the prior leading model.

“In NLP, it is significant to see a single method improve performance for such a wide variety of tasks,” Peters said.

ELMo Takes on the World of Semi-Supervised Learning

With natural language processing, the type of training data matters. A model used for a Q&A system, for instance, can’t be trained on any old text. It typically requires training on a large dataset of annotated question and answer pairs to learn how to properly respond.

Annotating data can be time-consuming and expensive. So the researchers first chose to train ELMo on a large, unlabeled academic dataset of around a billion words, then adapt it to a smaller, annotated dataset for a specific task like Q&A.This method of leveraging lots of unlabeled data in combination with a small portion of labeled data is known as semi-supervised learning.

Reducing the reliance on labeled, annotated data makes it easier for researchers to apply their NLP models to real-world problems.

“In our case, we chose a benchmark unlabeled academic dataset to train the language model,” Peters said. But researchers can modulate the algorithm to work on any other unlabeled dataset as well to suit a field of specialization, like biomedical articles, legal contracts or other languages.

ELMo language model results
ELMo enhanced the performance of neural models compared to previous state-of-the-art (SOTA) baselines across six benchmark NLP tasks. From left to right, the tasks are: textual entailment, named entity recognition, question answering, coreference resolution, semantic role labeling and sentiment classification.

The researchers powered their training and inference with NVIDIA Tesla V100 and K80 GPUs through Amazon Web Services.

In a follow-up paper, the researchers applied the ELMo model to answer geometry questions, using just a few hundred labeled examples. This labeling could take just a few hours of human work, but lead to significant improvements in the NLP model’s performance.

ELMo is available as an open source library. Peters says other NLP researchers are already adopting the model into their own work, including for languages other than English.