When I started to look at musical traits of speech and how I could go about measuring and analyzing them, I soon realized I needed to be able to quantize some kind of fundamental unit from which I could extract basic information like pitch and speech rate etc. Even though linguists subdivide into speech into phonemes, the smallest unit that makes sense musically is the syllable which is more akin to the concept of a note in music. That syllables also makes deep sense as a fundamental unit in speech can be seen by the fact that a number of written languages have signs for syllables rather than phonemes.
Syllables are usually centered around a vowel, so in order to segment speech into syllables I needed the ability to detect vowels in a speech stream. Existing speech analyzing software like Praat already have the ability to detect syllables through offline analyses, but since I needed to process speech and generate sound in real time, I had to make something myself. This brought me straight into some heavy technical considerations (for a musician at least), which never the less was interesting to figure out.
I set out to program a real time syllable detector by looking at already existing techniques, using the IRCAM developed FTM and Mubu libraries for the Max/MSP programming environment. The task is not as trivial as it may seem, as standard envelope following and onset detection used for acoustic instruments are not very reliable when it comes to speech – plosives and fricatives can be quite loud and carry a lot of acoustic energy but we do not perceive them as fundamental musical units like we do with the vowels. Syllable detection is however called for in many tasks like automatic speech to text transcription, and several techniques have been proposed (Mermelstein, 1975; Howitt, 2000; Prasanna, Reddy, & Krishnamoorthy, 2009) etc. Common for the different approaches is a focus on the mid frequency region where the vowels’ strongest formants are present (roughly 300-3000Hz). In addition, unvoiced phonemes need to be ditched (typically s, f, p, t, k) and one way to that is to segment into voiced and unvoiced regions by measuring periodicity. But at the same time voiced consonants also needs to be filtered out (z, v, g, b, d, m, n, l) and that can be a bit more difficult, but by discarding lower frequencies these can also be somewhat suppressed.
I tried to model several existing techniques to compare their performance and to understand how they worked. Many different sources of speech were used for testing, but for the sake of comparison I will use the same clip of a spoken sentence in the illustrations, with a female speaking the sentence “and at last the north wind gave up the attempt” in english.
I started by looking at an unaltered amplitude envelope, by averaging the logarithmic amplitude over 150 samples (≈ 3.4 ms with samplerate at 44.1 kHz), and smoothed with a low pass filter with cutoff ƒ at 5 Hz.
The last ’t’ clearly present in the envelope is one of the elements we want to suppress, as well as the ’s’ in ‘last’ after the second syllable and ‘th’ in ‘north’ after the third.
I then applied a simple 5th order bandpass filter at 300-900 Hz to the source to only listen to the frequency region of the first formant, and used a simple differentiator to mark peaks in the resulting envelope.
The peaks are neat, and though some of the unvoiced phonemes are reduced somewhat there is still a detected peak for nearly every phoneme, including some loud ones (‘n‘ of word ‘wind’ in the 9th for instance). That led me to add some more features to the simple peak picking differentiator (which just measure the direction of envelope slope and mark a peak when transiting from increase to decrease). In addition to a level threshold I added a hold phase of 50 ms to avoid double triggers caused by level fluctuations within a vowel, and a requirement for the envelope to dip at least 3 dB below last peak before looking for the next peak. That did help filter out some of the soft unvoiced sounds, but the loud ones still remained.
I then turned to the frequency domain to better have a look at the sonic content of the different phonemes. In this context it makes most sense to look at the sound in a way which resembles how the ear perceive it, as this is not the same as computers do. Digitally, sound is usually represented in the frequency domain in a linear fashion with the short time Fourier Transform algorithm. But the ear senses loudness over a range of frequency bands that are not spaced linearly, and to model this the spectrum is usually expressed in a logarithmic scale like the bark or mel scales instead of the linear Hertz scale. For speech analysis, the mel scale is nice since the important formant region is so well represented.
As we can see, the bottom bands are quite heavy with the fundamental and strong lower partials, the middle bands varies distinctively with the different vowels and voiced consonants, and some high frequency sounds like ’s’ , ’t’ and ‘th’ can also clearly be seen at the top. To zoom in on the vowels, we can of course get rid of the top bands altogether to avoid the high frequencies of ’s’ and ’t’. As we can also observe that voiced consonants like ’l’ (between the first two syllables) and ‘n’ (at the end of ‘wind’ in the middle of the sentence) have all their energy concentrated in the bottom bands, we can also try to attenuate the lower bands completely.
The resulting envelope from averaging mel bands 5-16 (ca. 500-3200 Hz) is clearly more pointed than by simply using a bandpass filter for the 300-900 Hz region:
Peaks are still detected for some unvoiced sounds though. When tested with recordings of different speakers and languages, trigging blips on every peak detected, this solution actually sounded quite well, but too many false peaks of unvoiced sounds were detected. This led me further to look at the measure of periodicity to mask out the unvoiced sounds.
One type of widely used pitch detection algorithm like the “yin”-algorithm (de Cheveigné & Kawahara, 2002) available as modules in Max/MSP and the ftm/Gabor library, works by detecting periodic repetitions in a stream of sound in order to estimate a probable fundamental frequency. To detect a repetition, this operation requires a buffer of at least twice the length of the fundamental. That means that in order to detect low pitches down to say 50 Hz, this will introduce a delay of at least 40 ms (as one period of 50 Hz is 20 ms long). When designing a syllable detector for real time use one aim is to keep the latency at a minimum, but depending on how much better it will perform this still might be worth it.
I started out by using the periodicity measure simply as a multiplier for the mid mel band level envelope used above, along with a a pre-emphasis (high pass) filter with a factor of 0.97 used in speech processing to flatten the spectrum and represent formants as equally loud. The quick changes between different frequency regions in speech results in a very noisy periodicity measure, so this did not go without errors.
It reduced some unwanted sounds, and in the envelope below we can see that though there still are small peaks for some of the unvoiced sounds, we now have only vowels detected as peaks (some drawn together syllables like the first one (‘and at’, pronounced like one syllable ’n’at’) are now detected as one peak, but that might be the price to pay to reduce the false hits on unvoiced sounds.
I then based my further approach on an even more refined way of using the periodicity measure, described by Nicolas Obin in his “Syll-O-Matic”(Obin, Lamare, & Roebel, 2013). The algorithm is for offline analysis and includes several more stages, but at least one part appropriate for real time use is the way he looks to voiced regions to mask out the unvoiced sounds, proposing a multi band voiced/unvoiced mask approach where each mel band is determined voiced or unvoiced before summing the bands. To achieve this I used an ‘analysis by synthesis’ approach, by resynthesizing the harmonics of the detected fundamental and thus recreating a spectrum containing only the voiced parts. This spectrum is then compared to the original spectrum to determine how much of each band’s energy comes from the voiced harmonics, and is then summed and smoothed into an envelope.
When using the whole voiced spectrum like this it becomes important to take psychoacoustics into account, to get an envelope that resembles our perception of loudness, as this – like our perception of frequency – is not linear. In the implementation above the specific loudness is calculated like Obin proposes by using an exponent of 0.23, but this is only an approximation as the exponent has been shown to be different for the cochlear bands of the ear (Fletcher, 1933; Robinson & Dadson, 1956; Zwicker & Fastl 1999).
To make a more detailed perceptual modeling of loudness I made a multi band scaling filter where the level of each mel band is scaled according to the upper and lower hearing thresholds for each critical band as described in the updated ISO 226:2003 standard detailing the so-called “equal loudness curve”, which aims to reflect how loud the different frequencies actually have to be in order for us to perceive them to be equally loud. With this, the lower and upper bands are attenuated in about the same way as the ear does.
In the comparison below, we can clearly see how the masked envelope (bottom) finally lacks the unvoiced sounds altogether, and the peaks (represented by the grey bars at the bottom) only appear at vowels.
This solution performed really well when tested with a range of speech recordings, but can sometimes fail at very low creaky pitches typically at the end of sentences where the pitch detector cannot estimate a fundamental properly. Another weakness was stuttering or dropouts due to the fast changing sounds of speech, also causing sporadic errors in the pitch detection algorithm feeding the resynthesis stage.
Though I was pleased with the performance of this syllable detector, I also developed a solution that I have not seen described anywhere but which at first looked promising and also simpler to implement. In speech-to-text analysis, a process called the mel frequency cepstrum is used extensively for analyzing and recognizing different phonemes. In this process the non-linear but perceptually more correct mel frequency spectrum described above is transformed into a spectrum of a spectrum, into a this kind of imaginary domain which with a humorous anagram is dubbed the ‘cepstrum’ (Bogert, Healy, & Tukey, 1963) In short, the cepstrum describes the shape of the spectrum, and when looking at the lower cepstrum bands (mel frequency cepstrum coefficients, or mfcc’s for short), there is clearly a correlation between voiced sounds and the second coefficient. So by simply having the second mfcc (with an appropriate threshold) act as a gate for the loudness envelope of the formant region, we could have a nice sturdy syllable detector.
The loud (red) regions of the second mfcc can clearly be seen to correlate with the presence of voiced speech sounds. This solution also worked particularly well for the problematic creaky low last syllables mentioned above, and was also faster and less prone to fluctuations than the periodicity (pitch detection) measure. From preliminary testing this performed well, and about the only weakness seems to be some false detections of loud plosives. But since this solution is based on the spectrum shape, different recording environments affecting the spectrum (and different resolutions, like the narrow spectrum of telephone transmissions) also influenced its performance. I also had to set two thresholds – one for general level and another for the mfcc, so while it could work well for one recording I had to fine tune the settings for the next, and that turned out too cumbersome for real time performance.
Finally, I turned to a very interesting approach based on measuring glottal activity as indicator of vowel presence (Yegnanarayana & Gangashetty, 2011). This elegant solution bypasses the short time Fourier transform altogether and instead looks at a characteristic feature of our voice: glottal pulses. These are like other any short impulses in that they contain all frequencies, including the zero frequency. By making a zero frequency resonator and looking at the small fluctuations at 0 Hz, it is possible to filter out almost all other sound and only have the glottal pulse train, ie the voiced parts of speech. I implemented this technique to detect glottal activity, and used that measure to gate the level of a formant region bandpassed (500-2500 Hz) loudness envelope. For realtime use, this proved more stable and robust than any of the other approaches, and paired with a peak picking process detecting onsets and endpoints rather than only peaks of syllable nuclei, this is the solution on which I have based my further development of real time speech analysis tools.
From top to bottom: original audio; glottal pulses; gated loudness envelope; detected syllables.
Bogert, B. P., Healy, M. J. R., & Tukey, J. W. (1963). The quefrency alanysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In Proceedings of the symposium on time series analysis (Vol. 15, pp. 209–243).
de Cheveigné, A., & Kawahara, H. (2002). YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
Fletcher, H. (1933). Loudness, Its Definition, Measurement and Calculation. The Journal of the Acoustical Society of America, 5(2), 82.
Howitt, A. W. (2000). Automatic Syllable Detection for Vowel Landmarks. (Doctoral thesis). Massachusetts Institute of Technology, Cambridge, Mass.
Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. The Journal of the Acoustical Society of America, 58(4), 880–883.
Obin, N., Lamare, F., & Roebel, A. (2013). Syll-O-Matic: an Adaptive Time-Frequency Represen- tation for the Automatic Segmentation of Speech into Syllables. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6699–6703). Vancouver, Canada. https://doi.org/10.1109/ICASSP.2013.6638958
Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009). Vowel Onset Point Detection Using Source, Spectral Peaks, and Modulation Spectrum Energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Robinson, D. W., & Dadson, R. S. (1956). A re-determination of the equal-loudness relations for pure tones. British Journal of Applied Physics, 7(5), 166–181.
Yegnanarayana, B., & Gangashetty, S. V. (2011). Epoch-based analysis of speech signals. Sadhana, 36(5), 651–697.
Zwicker, E., & Fastl, H. (1999). Loudness. In Psychoacoustics (pp. 203–238). https://doi.org/10.1007/978-3-662-09562-1_8