Voice Quality

Voice Quality

In addition to intonation (pitch), intensity (amplitude) and speech rate (rhythm, tempo), voice quality (also referred to as phonation) is generally considered a part of the prosodic expression. Expressed a bit simplified as a continuum from whispering through normal to pressed voice, the voice quality is an important cue in speech for interpreting the overall message of an utterance.
After implementing ways of expressing the first three prosodic elements musically, I found that I really wanted to use some expression of the voice quality as well in order to really discern the different prosodic characters.

The phenomenon seems to be well understood through the source-filter model (Fant, 1960), describing how a tense voice results in a fundamental waveform with what in a sound engineer’s terms may be described as narrow phase width, while a relaxed voice produces waveforms with a full phase width approaching a sine waveform. The software Cantor Digitalis uses this model to synthesize a natural sounding voice that can utilize the whole spectrum from relaxed to pressed voice. Nevertheless it seems very hard to do a reliable analysis of this since the result from a pressed voice in effect creates a new formant in the source waveform that cannot effectively be discerned from the lower filter formants that in turn defines the vowels.

There are some advanced techniques involving phases that also have been implemented in Ircams AudioSculpt (Degottex & Obin, 2014), but these techniques turned out far too complicated for a musician like me to recreate so instead I have implemented a simpler analysis based on scripts available for the Praat software. This analysis measures the spectral slope of the two lowest harmonics by simply subtracting the second harmonic’s amplitude from the first. A relaxed voice tend to result in a decrease of about 12-20 dB while a pressed voice results in a flatter spectrum giving 0 dB or even negative values. I modified this technique to include the four lowest harmonics, subtracting the mean of harmonics 2-4 from the first harmonic, in order to make the analysis less prone to fluctuations from different speakers, vowels and registers.
This turned out to be a fairly lightweight, fast and not too error-prone method.

I was then faced with the challenge: how can this measure of voice quality be expressed musically in a meaningful way, using my existing set of instruments and software?
As for the mechanical piano there is no way at all to alter the sound quality other than as part of the soft-loud continuum. In the current synthesis software I am using either additive synthesis or noise based subtractive synthesis which cannot easily be adapted to a source style waveform manipulation.
I did some tests using a simple overdrive, but this does not convey the same harshness that a pressed voice does, and so becomes something else. For the future this is something I need to look into, possibly by reconsidering my synthesis techniques altogether.


Degottex, G., & Obin, N. (2014). Phase distortion statistics as a representation of the glottal source: Application to the classification of voice qualities. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 1633–1637).

Fant, G. (1960). Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. The Hague, Netherlands: Mouton.