Speech emotion recognition
The amount of digitally stored speech in the form of interviews, lectures, debates, radio talk show archives, and podcasts is increasingly available. Owing to this, Speech Emotion Recognition (SER) has grown to be one of the most trending research topics in computational linguistics in the last two decades . SER, as the name suggests, refers to the automatic detection of emotion from audio samples.
This post focuses on the work we have done in the field of SER in a project conducted with journalists from SvD, one of Sweden’s top daily newspapers. Read about the project as a whole in the introductory post: Empowering Journalists at SvD with AI Tools for Podcast Analysis.
What is sentiment analysis?
Sentiment analysis from audio refers to the process of analyzing spoken language in order to determine the emotional state or attitude of the speaker. It can be done in one of two ways: either by making use of text or by simply using the raw audio.
Sentiment analysis using text involves analyzing written language, such as social media posts, reviews, or emails. This can be done using a variety of NLP techniques, such as identifying positive or negative keywords or using pre-trained models to classify the overall sentiment as positive, negative, or neutral. The input for this type of sentiment analysis is written text and it can be performed on any written text regardless of the language it is written in.
Sentiment analysis using raw audio, referred to as SER, on the other hand, involves analyzing spoken languages, such as speech, podcast, or phone conversations. SER involves identifying human emotion and affective states from speech, making use of the fact that voice often reflects the underlying emotion via tone and pitch. This traditionally involves transcribing the audio data into text and then applying NLP and machine learning techniques to determine the sentiment. Additionally, this type of analysis can also include non-verbal cues such as tone of voice, intonation, and pauses. The input for this type of sentiment analysis is audio data, and it can be performed on any audio data regardless of the language spoken in it.
As an initial step, an attempt was made to build an emotion classification model. A classifier in machine learning is an algorithm that orders or categorizes data into one or more of a set of “classes”. In our case, we wanted the model to be able to classify the audio into different “emotion classes”.
In order to do this, we made use of four datasets commonly used in this field of work:
- 1. TESS – Toronto emotional speech set 
- 2. SAVEE – Surrey Audio-Visual Expressed Emotion 
- 3. RAVDESS – The Ryerson Audio-Visual Database of Emotional Speech and Song 
- 4. CREMA-D – Crowd-sourced Emotional Multimodal Actors Dataset 
All of these are labeled datasets, containing the audio samples annotated with the correct emotion. These labels available are disgust, happy, fear, angry, neutral, sad and surprise. The combined dataset consisted of 12162 audio samples.
As features, we used the VGGish  embeddings of the audio samples. VGG-ish is a VGG-like audio classification model that was trained on a large YouTube dataset (an initial version of YouTube-8M ). The audio embeddings obtained from the model are 128-dimensional vectors which can be used to represent the audio clip’s semantics.
In order to simplify the problem, we reduced the number of classes. We did so by grouping the emotions as
- 1. “Positive” – happy, surprise
- 2. “Negative” – disgust, fear, angry, sad
- 3. “Neutral” – neutral
With the new classes and the embeddings as features, we trained a XGBoost model.
We also trained models using the original classes, without any grouping. For this we used a neural network architecture.
We tested some of these models on Swedish speaking podcasts. We used episodes from SvD’s Alex & Sigge for these experiments. Instead of using the entire audio as input to the model, we broke the audio into consecutive, 3 second windows.
One observation from these results was that despite performing well on English speaking audio, the models were unable to generalize very well to Swedish speaking data. One obvious reason for this could be the difference in speaking styles and intonation patterns between the languages. Another reason for this could be that the datasets used for training were created in a lab setting, and possibly don’t capture real life conversation styles – as is often seen in podcasts.
This led us to exploring other datasets and models which are closer to “real life” conversations.
Laughter is one of the most fundamental forms of human expression and it is often associated with positive emotions. Using laughter as a substitute for “positive” emotion, therefore, seemed like a reasonable approach. The idea was to auto-detect segments of laughter from audio clips and calculate indicators like number of times the audio clip had laughter, the length of laughter segments etc.
The model used for laughter detection was a pre-trained ResNet model based on the paper “Robust Laughter Detection in Noisy Environments” . The model was trained on AudioSet  – which is a collection of over two million human-labeled 10-second sound clips drawn from YouTube videos. This model takes an audio sample as input, and extracts laugh segments from it.
This model was run on all the episodes of the shows Alex & Sigge and Fråga Agnes Wold.
Valence – Arousal – Dominance
When it comes to sentiment analysis, one approach (which has been discussed thus far) pertains to identifying/ categorizing emotions like “anger” and “happiness”. Another approach focuses on capturing the underlying dimensions of these emotions, where the dimensions are – Valence, Arousal and Dominance. 
Valence: Is a measure of how pleasant or unpleasant one feels about something. For example both sadness and fear are unpleasant emotions, and both score low on the valence scale. However, joy is a pleasant emotion and therefore has a high valence. 
Arousal: Is a measure of how energized one feels. It is, however, not the intensity of the emotion. While both anger and rage are unpleasant emotions, rage has a higher intensity, and therefore a higher arousal value. However boredom, which is also an unpleasant state, has a low arousal value. 
Dominance: The Dominance-Submissiveness Scale represents how controlling and dominant versus controlled or submissive one feels. For instance while both fear and anger are unpleasant emotions, anger is a dominant emotion, while fear is a submissive emotion.
In order to capture the three dimensions, we made use of Audeering’s wav2vector-based pre-trained model. This model was generated by fine-tuning Wav2Vec2-Large-Robust  on the MSP-Podcast dataset . The model takes an audio sample as input, and produces a score each of valence, arousal, and dominance in a range [0, 1].
Instead of using entire podcast episodes as input to the model, we used individual sentences spoken during the podcast. The podcast audio was broken down into sentences by making use of its metadata.
The blog post summarizes the work we have done so far in the field of SER. There is a great potential to further work and build on these, with the incentive being the plethora of use cases this can be applied to.
 Schuller, B.W., 2018. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5), pp.90-99.
 Gillick, J., Deng, W., Ryokai, K. and Bamman, D., 2021. Robust Laughter Detection in Noisy Environments. In Interspeech (pp. 2481-2485).
 Mehrabian, A., 1980. Basic dimensions for a general psychological theory: Implications for personality, social, environmental, and developmental studies.
 Islam, M.R., Moni, M.A., Islam, M.M., Rashed-Al-Mahfuz, M., Islam, M.S., Hasan, M.K., Hossain, M.S., Ahmad, M., Uddin, S., Azad, A. and Alyami, S.A., 2021. Emotion recognition from EEG signal focusing on deep learning and shallow learning techniques. IEEE Access, 9, pp.94601-94624.