Design and development

This chapter offers some reflections on issues relating to the design, development and performance of the Orchestra of Speech software instrument system and performance concept developed in this project, as well as a discussion of its possibilities and limitations.

The system works as a playable musical instrument, but can also be thought of as an instrument in a wider sense of the word, like a prism or microscope that can zoom in and show different musical structures that is part of speech. It is not meant to represent the kind of ground-breaking technical innovation as, say, a new cutting-edge synthesis technique or instrument control paradigm, but must rather be viewed as a case of how to put together – how to compose – a selection of techniques in order to realise a particular artistic vision. The innovation lies in the personal combination of ideas and system, and in the musical outcomes made possible by this particular combination.

Design issues

As described earlier in the technical overview, the system was essentially developed from scratch with few initial specifications given. The features and the ideas about the instrument changed repeatedly in response to testing and performances, making any functional requirements a moving target. Nevertheless, over time it stabilised into a set of features that seems to fulfil the most important artistic needs identified during development. That includes functions to create several different polyphonic layers from the same speech source, the functions for indeterminacy and generating alternative combinations and sequences of segments, and the ability to use live sound input to provoke responses and engage in interplay. Though developed into a fully functional and stable performance system, a first design like this will still have the character of a prototype. In future work, this system might very well be developed further, or perhaps a new system can be built from scratch based on the experiences in this work.

Interface design and control intimacy

Early in the development of the software instrument, issues regarding performance was often of a technical or practical character – how to integrate the different functions into a manageable whole, and how to control the instrument through an appropriate interface. I knew that I needed a very intuitive way of controlling this complex instrument I was developing in order to achieve any meaningful performance results. The problem of control is a common issue when designing so-called digital musical instruments (DMI’s), and dependent on what Moore first conceptualized as control intimacy (Moore, 1988). The conventional personal computer interfaces of screens, keyboards and pointer devices are far too narrow, slow to use and single-task oriented to be used for interacting with a complex digital musical system. Even when using external controllers with physical knobs and switches, I find that there can be an unsatisfactory spatial and cognitive split of attention between using the physical controllers for input but getting visual feedback from a screen. In addition, one of the most important concerns with regard to control intimacy is to develop the most appropriate mappings between user gestures and instrument response. This is central to achieve the kind of embodiment of the instrument that is critical for performing successfully with any instrument (Fels, 2004). For many parts of the system this was achieved by assembling a setup of various general-purpose MIDI controllers and then gradually developing the most appropriate mapping strategies through several rounds of trials and revisions. But after experimenting with a lot of different ways to interact with the large collections of recorded speech segments, including using MIDI keyboards, tablet computers, touch screens etc., I found that to really be able to put the screen away and overcome the cognitive gap between hand and eye, I had to make some additional purpose-built physical controllers with just the right kind of layout and controls that I needed. When placing a combined setup of these controllers on the music stand of the piano, my experience was that I actually managed to integrate the act of performing with this new complex instrument into my existing musical relationship with keyboard instruments, and could seamlessly switch between playing the acoustic piano and the digital instrument even within the same musical gesture and the same musical line of thought. The additional way of interacting with the instrument through using actual musical utterances as sound input, either from voice or piano, made this cognitive integration even tighter. After an extended period of rehearsing, developing and internalizing a repertoire of musical possibilities, I was able to treat the system as an extension of my musicality.

This is not to say that this is a general musical instrument with an interface that will be intuitive and easy to use for any untrained person. Like any instrument it needs practice to master, and like most complex digital systems this instrument is based on certain preconceived notions about what kind of things will be interesting to do, and what kind of music will result from it, with aesthetic choices embedded in every step in the design and construction of the instrument. There is an obvious overlap of the way I already approach improvisation and think musically on the keyboard, and what kind of musical output is possible to create with the digital instrument system. The idea was never to design a general-purpose musical instrument, but to realize a personal artistic vision, an extension of musical ideas that I was unable to realize on keyboard instruments alone.

Performance issues

In addition to these technical aspects of control and instrument design, one overall concern relating to this instrument has been how to integrate the new approaches to music making presented by this project into a performance practice honed over many years in the role of an improvising keyboard player. During early system testing I was essentially just playing back individual recordings of speech from beginning to end, trying as best as I could to keep up and make some interesting musical arrangements or ‘translation’ on the fly. This felt insufficient. I was anxious that this whole approach to live “orchestration” of recordings was flawed, and that it was going to be more like superficial “remixing” than the creative process of musical exploration I associate with improvisation. One of the ways I addressed this was to bring the sound into the same sonic space as the piano. Using transducers attached to acoustic instrument as ‘acoustic’ loudspeakers, I found that the electronic computer instrument came closer to the sonic realm of physical acoustic instruments, and it became easier to relate to my role as a performer. During trials with transducers mounted on the sound board of the piano, I found that when digital instrument sounded through the piano and was controlled from sitting by the piano, I could actually draw on the close relation with the piano and somehow transfer this embodied way of thinking music into performing with the new instrument system.

Example: piano dialogue study, performing simultaneously with the software system and piano, one hand on each, in a kind of dialogical exploration of the musical possibilities suggested by the analyses coming out of the software:


But the system was still quite slow and impractical to operate, and not intuitive enough to use to be able to pursue new musical ideas appearing in the moment. For instance, to change musical character, I had to know exactly which recording to use. To speed up changes, I started using pre-composed lists of selected recordings that had the kind of character I wanted, but though this made navigation quicker it also narrowed the options considerably during performances. A turn came when I started to organise recordings into corpora, databases with whole collections of analysed and segmented recordings describing the character of every segment in the corpus. Then it became possible to sort and find recordings based on musical criteria like tempo and register, and much easier to navigate large collections of many different recordings. It was also much easier to juxtapose several recordings and make collages of similar segments on the fly, opening for other ways of organizing recordings based on association rather than narrative. However, the most significant change with this corpus approach was the added ability of making statistical models based on the descriptors of the segments in the corpus. Such a model describes the likelihood of any transition in the corpus, and can be used to create alternative sequences and patterns that are statistically as likely based on the original sequences, and therefore share the same overall characteristics.

Initially I had reservations about fragmenting the speech sounds too much, as the particular timing of how events combine into phrases is one of the most important characteristics of the speech gestures that I was trying to hold on to and explore. I had after all wanted to investigate the musical structures implied by these original speech gestures, and not just dressing up any old habitual ideas with fragmented speech sounds. However, with such statistically probable alternative sequences, the overall characteristics of the original timing and intonation patterns are actually preserved, even when the gesture itself is rearranged and made up by many shorter fragments. This is because no transition between any pitch, duration or other feature is used that does not already exist in the original sequences, and how often they occur is also determined by their statistical frequency in the original material. This way, typical patterns of successions of long and short syllables, of high and low pitches, phrase pauses etc., are actually carried over when new sequences are generated. If an alternative organisation less typical of speech is imposed, such as cycles of repeated segments or collages of similar sounds, then a gradual transition between such music-associated organisation and plain speech organisation is still possible. This way such formal distinctions can be dealt with musically and reflected upon in the music itself.

Another important feature made possible by using analysed corpora and statistical models, is the ability to trigger segments with live sound input, using speech, song or even a musical instrument as a kind of acoustic “controller”. Live sound input can be used to query the statistical models, returning the most likely continuation based on the sequences already the corpus. In effect, this creates a kind of rudimentary, dadaistic “speech recognition” system, listening and producing musically probable (while of course semantically nonsensical) utterances in response to live sound input. This way, the active interpretation and reaction to unexpected responses could also be placed right in the centre of the musical focus on improvisation.

Example: Improvised interaction with the Markov model:

With these features, the system became much more responsive and intuitive to relate to as a performance instrument, reducing the conceptual gap between controlling a computer program and interacting in the ‘here and now’ world of acoustic instruments. This meant that I could integrate both software instrument and piano into the same performance setup, and use them as extensions of each other.


Fels, S. (2004). Designing for Intimacy: Creating New Interfaces for Musical Expression. Proceedings of the IEEE, 92(4), 672–685.

Moore, F. R. (1988). The dysfunctions of MIDI. Computer Music Journal, 12(1).

← Previous page: Reflections Next page: Possibilities and limitations