Corpus approach and Machine learning

Corpus approach and Machine learning

When starting out my investigation of musicality of speech I set out analyzing, extracting and abstracting all kinds of musical structures from speech recordings. Nevertheless I thought I should keep the time aspect of speech intact, feeling that it is not the bits and pieces – isolated sounds or rhythmical motifs – but the continuous gestures though time that makes speech what it is, and that it was somehow fundamental to keep this foundation. Repeat any short sound and it becomes a mechanical pulse or rhythmic motif no matter where the sound came from in the first place. So to force my self away from just recycling old musical preconceptions with these new sounds I decided to see how far I could get with continuous speech.
However after working with speech recordings as the sole material for creating music for some time, I felt that at some point sticking to the original time structure of speech was limiting which features I could explore and what the music was going to sound like. For one thing, the distribution of events in time has generally a much faster pace in speech than in typical musical structures, and is perhaps less varied as well. Maybe this is due to the larger and slower movements required for walking, dancing and expressing bodily movements in general, that somehow is closely connected to the expression and appreciation of music. In order to zoom in and focus on some particular quality I found interesting, I felt a need to slow down or stop the rushing by of ordinary time. In music and other sound based arts this can be done through repetition, variation, or juxtaposition of similar elements in a collage. Like Nelson Goodman points out about the role of theme and variation in art in “Languages of Art” (Goodman, 1976): that modification, elaboration, differentiation and transformation of motifs and patterns are processes of constructive search, and that such progressive variation is a typical way of advancing knowledge. This seems to be especially true of how artworks explore the world constructing their own languages through these processes of differentiation, but in a more general sense progressive variation is perhaps how knowledge is expanded even in science. Of course this can also describe how a topic might be explored through conversations as well, but then these processes relate to the exploration of thought ideas rather than the material sound structures of speech which I am interested in.

Corpus approach

Segments as coordinates in a space of mean pitch (vertical) speech rate (horizontal) and vocal effort (colour)

Segments as coordinates in a space of mean pitch (vertical) speech rate (horizontal) and vocal effort (colour)

To address these issues I needed to change the way in which I used speech recordings. I decided to organize recordings in a database – a corpus – automatically analyzing and labeling each recording with information about prosodic qualities like mean fundamental frequency (pitch), pitch slope, speech rate (tempo), mean amplitude (loudness), voice quality (vocal effort), as well as duration. This organization allows selecting files based on prosodic, i.e. musical qualities, without specifying exactly which file to use. In addition, when segmenting and analyzing each file into shorter durations based on phrases (breath groups) rhythmical motifs (stress groups) and syllables (vowels), this also allows a complete reorganization of the segments based on their prosodic/musical qualities rather than their original order in the recording. This opens up for a much more musical way of using this material, with variable degree of fragmentedness and removedness from the original speech structures. One possibility is the exploration of fragments that occupies the same area in the prosodic space, creating sequences that make more sense musically than based on the lexical content and thus shifting listening focus to their musical structures. This can involve repetition and progressive variation of shorter or longer segments, more in line with a typical musical exploration of this material.

Excerpt from database of analyzed segments

Excerpt from database of analyzed segments

Machine Learning

A corpus approach also allows the possibility of using statistical analysis of the original sequences of segments in order to create new sequences that is statistically probable. This results in sequences that can display some of the same overall structural features without using the exact same order of individual segments. There is a whole field of machine learning techniques preoccupied with different ways to generate new structures based on a given input, which in music has resulted in a range of different semiautomatic improvisation software. However, already from the outset many of these algorithms make some assumptions of what music is and can be, and how it can be structured. In line with my wish to explore the specific traits of speech structures, I did not want to introduce too many musical constraints at this stage. I therefore opted for one of the simplest techniques of machine learning known as Markov chains. In short, this technique analyses the likelihood of any transitions between different stages in a chain, and based on that model one can generate new chains that are equally probable.
When including file and segment indices in this analysis, it is possible to gradually move from the original order to more and more fragmented sequences of segments simply by weighting which features to use for sequence generation.

Example of Markov model trained with transitions between different pitches: the leftmost pitch numbers have been followed by the pitches on the same line.

Example of Markov model trained with transitions between pitches: the leftmost pitch numbers have been followed by the pitches on the same line, with repeated numbers being more likely.

A sequence of speech segments generated using a Markov model of likely transitions between pitches and durations. The result is phrase intonation structures which in length, rhythm and melody sounds quite probable and familiar while the semantic content is of course nonsense.

Interaction and Improvisation

Creating a model of the likelihood of transitions between each segment also opens the possibility to query the model with a live sound input. The system then finds the most likely continuation as a response to that input. Conceptually, making the system responsive in this way opens up for exploring the musical features of speech recordings directly with a piano or voice, based on improvisation and musical logic. In this sense the cache of speech recordings becomes a repertoar of learned language that can be used in a kind of dadaist speech recognition system that reacts to the prosodic content instead of lexical meaning.

The time perception in improvised interaction

In addition to truly integrating improvisation in my exploration of the musical traits of everyday speech, the introduction of the unknown response that an interactive and interpretive system represents somehow leads to a very different perception of time. I do not know if this is primarily from the performers point of view, but not knowing how the system will respond puts me in an very attentive state ready to interpret and respond, thereby accentuating the present. This is true of an unscripted real life dialogue as well, as opposed to listening to a monologue, a public speech or even the presentation of documented dialogue, creating an expectation of a dramaturgical forward motion through time. This quality is perhaps a central aspect of my search into how the dialogical aspect of improvised music seems to relate to improvised speech.


Goodman, N. (1976). Languages of art: An approach to a theory of symbols. Indianapolis: Hackett.