Zero-Shot Learning in NLP
Introduction
Natural Language Processing (NLP) has been progressing quickly, particularly with the advancement of neural architectures such as Transformers, and the emergence of large-scale pre-trained models like BERT. While transformers have enabled models with higher capacity, pre-training has facilitated their use in all types of NLP tasks. Recent transformer-based language models, such as RoBERTa, ALBERT, and OpenAI GPTx, have shown a powerful ability to learn universal language representations. However, in many real-world scenarios, the lack and cost of labeled data is still a limiting factor. Therefore, methods that reduce the dependence on large amounts of labeled data are encouraging alternatives in applied machine learning. Methodologies in transfer learning approach this problem by focusing on transferring knowledge from one domain to another.
Our application
In 2020 we built an application based on ZSL for Riksbanken (Swedish Central Bank) to track the real-time economic effects of COVID-19. It tracked and analyzed the content of all press releases published by the companies in the main Swedish stock markets, for signs of fear, mentions of COVID-19-related effects, and other indicators of the economic effects of the pandemic. Some examples of the output are shown further below.
What is Zero-Shot Learning?
Transfer learning finds its inspiration in human’s capacity to generalize from experience. Humans are very good at using previous knowledge to handle new situations. For instance, a person that speaks Swedish can use his experience to learn a similar language to Norwegian.
Zero-shot learning (ZSL) is a form of transfer learning that aims to learn patterns from labeled data in order to detect classes that were never seen during training. As the lack of labeled data and scalability is a regular problem in machine learning applications, ZSL has gained much attention in recent years thanks to its ability to predict unseen classes.
How is ZSL used?
In computer vision, ZSL is generally used for image classification where information from a novel class, commonly in the form of a textual description or attributes, is used to find the relation between the description of the class and its visual representation. For example, someone who has seen a horse, but doesn’t know zebras could likely recognize one knowing that a zebra looks like a horse with back and white stripes. Zero-shot learning assumes a semantic relationship between the seen and unseen classes.
In NLP, zero-shot tackles problems such as text classification. However, the relationship between unseen and seen examples is usually more complex when applied to text. Language is often ambiguous, context-dependent, and changes over time, which means that the way we express an emotion like “anger” might be very different from how we express “surprise”. Therefore, it’s difficult to match a novel class to a learned class since it can be hard to find their similarities. In this regard, ZSL classification models consider patterns between the semantic and feature space from a text and its label.
ZSL for text classification
In 2019, Yin et al. presented an approach to ZSL for text classification in which they formulate the problem as textual entailment by converting the labels into a sentence of hypothesis, then determining if it can be entailed or not by the input text. For example, if we want to evaluate whether a text can be classified as the topic “health”, we can state a hypothesis “this text is about health”. The objective is to find if the hypothesis is true (entailment) or false (contradiction). Taking advantage of natural language inference (NLI) pre-trained models, we can embed both texts from the premise and the hypothesis. This idea is not new, but with the popularity of transformer architectures, more works are adopting pre-trained models for embeddings. When used with large pre-trained models like BERT and RoBERTA, this method is very effective.
For NLI datasets, transformer architectures make use of inputs in sequence-pair structures. This means that the input of the model is composed of both premise and hypothesis as separate texts and the output indicates the probability of each category: entailment, contradiction, and neutrality. In the example below we show a code snippet where we use the BART model from HuggingFace’s library to calculate the probability of the premise “To mitigate the spread of Covid-19 everyone is advised to avoid social events” belonging to the class “pandemic”. Even if the model hasn’t been trained to specifically classify “pandemic” and the premise does not include the word “pandemic”, the model is able to recognize the relationship between the premise and hypothesis.
from transformers import BartTokenizer, BartForSequenceClassification # load BART pre-trained model model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli') tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli') premise = 'To mitigate the spread of Covid-19 everyone is advised to avoid social ' \ 'events' label = "pandemic" hypothesis = f"this text is about {label}" # encode and predict input_ids = tokenizer.encode(premise, hypothesis, return_tensors="pt") logits = model(input_ids)[0] # we use the probability of "entailment" [2] as the probability of the label # being true entail_logits = logits[:, [0, 2]] probabilities = entail_logits.softmax(dim=1) true_prob = probabilities[:, 1].item() * 100 print(f"The probability that the label is true is: {true_prob:0.2f}%")
The probability that the label is true is: 78.64%
Similar to the previous example, we used BART to analyze the number of press releases that were related to the COVID-19 pandemic between October 2019 and July 2017. The graph below shows the results of this analysis. As one can expect, the number of press releases related to the pandemic increased dramatically after the World Health Organization (WHO) declared the coronavirus outbreak as a pandemic on March 19th, 2020.
When we classify the same press releases by sentiment (see figure below), fearful or optimistic, we observe how the WHO pandemic declaration also marks an important change in the press sentiment. The tendency goes from a majority optimistic to a more fearful before and after the pandemic was declared. These examples demonstrate the practicality and power of ZSL methods to classify text without the need for additional training for specific classes.
Below, we show normalized graphs of optimistic sentiment in press releases and the OMXS30 stock index (tracking the performance of the 30 most traded companies on the Stockholm stock exchange).
Finally
In this blog post, we give a brief overview of ZSL and its application in NLP, particularly in text classification. Zero-shot learning is a technique that is growing rapidly and has become of great interest in NLP tasks. It has become a powerful technique when the labeled training data is limited, which is a common problem in real-world applications. We will continue working on this topic and are always open to discussions. Don’t hesitate to contact us!