University of Michigan Song and Speech Emotion Dataset
The University of Michigan Song and Speech Emotion Dataset (UMSSED) a corpus of emotional singing and speaking performances in English, collected to study the emotional expression and perception in song and speech. The UMSSED consists of 168 high-quality audio-visual recordings of three performers (1 female, 2 male) singing and speaking semantically neutral sentences in four emotions: angry, happy, neutral and sad. The emotional content (categorical and dimensional) of the recordings were evaluated using Amazon Mechanical Turk. The evaluations were provided for three types of stimuli: audio-only, video-only and audio-visual. Both the audio-visual recording and the evaluations are freely available for scientific-research and general use under a Non-commercial Creative Commons License. Please use the following reference when you cite the UMSSED: Biqiao Zhang, Emily Mower Provost, Robert Swedberg, Georg Essl. “Predicting Emotion Perception Across Domains: A Study of Singing and Speaking.” AAAI. Austin, TX, USA, January 2015.
All three performers have completed coursework in the School of Music, Theater & Dance that included training in spoken and sung theatrical production. The actors performed both domains in the same location under consistent visual and acoustic conditions. The vocal data were recorded via an Electro-Voice N/D 357 microphone and the video data were recorded using a high-definition Canon Vixia HF G10 camcorder.
Our dataset uses fixed lexical content. We identified seven semantically neutral sentences and embedded each sentence into four passages, each associated with a target emotion from the set of angry, happy, neutral and sad. This embedding allowed us to create an environment that would facilitate emotionally evocative performances. The consistency of the lexical content of the embedded target sentence allows for an analysis of emotion content while controlling for variation in lexical content. Seven stylistically neutral melodies were composed in a singable range to match the seven passages for the singing performances. The target sentences were accompanied by the exact same melody. The remainder of the passage included minor differences across the four emotional variations to allow for differences in the lexical information. This resulted in 168 (2 domains of vocal expression × 3 performers × 7 sentences × 4 target emotions) excerpts in total. We segmented out the target sentence from the remainder of the passage for both speaking and singing performances. The average duration of the target sentences in the singing and speaking recordings are 3.04 ± 0.87 and 1.57 ± 0.37 seconds respectively.
The recordings of the target sentences (referred to as utterances) using Amazon Mechanical Turk. The evaluation included the original audio-visual clips, only the audio information, and only the video information, which resulted in 504 stimuli (168 × 3). The evaluators assessed the emotion content across the dimensions of valence (positive/negative emotional states), activation (energy or stimulation level), and dominance (passive vs. dominant) using a 9-point Likert scale. The evaluators also assessed the primary emotion of the clips from the set of angry, happy, neutral, sad and other. A total number of 10,531 evaluations were collected from 183 unique evaluators. Each utterance was evaluated by 20.9 ± 1.7 participants.
Contact Biqiao Zhang: firstname.lastname@example.org
Biqiao Zhang, Emily Mower Provost, Robert Swedberg, Georg Essl. “Predicting Emotion Perception Across Domains: A Study of Singing and Speaking.” AAAI. Austin, TX, USA, January 2015. [pdf]