Even though consonants are not treated as part of the prosodic expression, they can still contribute to perception of articulation and pointedness of speech. For that reason, and also to be able to have a natural percussive counterpoint to the rhythms extracted from syllables and stress, I wanted to be able to analyse the rhythms of consonants. The first thing I did was to simply listen to the inversion of my zero frequency based syllable detector, presumably giving me all unvoiced sounds. But when listening to this I realized that I was not looking for every consonant and unvoiced sounds, rather just the percussive ones contributing to the rhythm, like plosives (p, t, k,) and fricatives (f, s, sh etc). Since some phase vocoders available today have really good transient detection I also tried listening to a vocoder’s transient output, but they seem to react on any short spike, including glottal stops usually found at vowel onsets and many other tiny transients including many voiced bursts. So instead I turned to a set of spectral descriptors known as moments. These are a set of standard mathematical measures describing the shape of a distribution, and can be applied to each timeframe of an audio spectrum analysis to get a running measure of its spectral centroid, standard deviation, skewness (spectral slope or tilt) and kurtosis (pointedness vs flatness). These have been shown to give good cues for fricatives and plosives (Forrest, Weismer, Milenkovic, & Dougall, 1988), with performance comparable to the cepstral analysis frequently used in automatic speech recognition (Bunnell, Polikoff, & McNicholas, 2004).
After looking at analyses of speech using these descriptors it looked like the centroid was an obvious choice for fricatives as well as t’s and strongly articulated plosives, while the kurtosis measure seemed to give good cues to stops like p, g, and k. The second moment, deviation, turned out very similar to the centroid, and the third moment skewness turned out very much like the kurtosis, just less dynamic. As a possible cheap alternative to the centroid I compared it to the simple measure of zero crossings (how many times the audio signal crosses zero per time unit), and though it was quite similar it did not lacked some of the fast t’s, so I settled on using centroid combined with kurtosis. By finding a suitable threshold (about 800 Hz for centroid and 400 for kurtosis) I could have these two measures act as a gate for the original speech signal, resulting in a gated signal containing only stops and fricatives. This was tested with a range of different recordings of different speakers and seems to be working quote well.
Example clip: “an’at last the north wind gave up th’attempt”:
Top to bottom: raw audio; power spectrum; centroid; kurtosis; gated spectrum.
The moments measures are typically very noisy in the soft passages we regard as silent, as can be seen in the kurtosis above, but when using these measures directly as a gate that is not really a problem since the gated sound also is perceived as almost silent. Though, if I am to implement this in a peak detection process I will probably try to eliminate ‘silent’ passages and add some smoothing to the envelopes.
Original audio example:
Gated sound, only fricatives and plosives:
Bunnell, T. H., Polikoff, J., & McNicholas, J. (2004). Spectral Moment vs. Bark Cepstral Analysis of Children’s Word-Initial Voiceles Stops. 8th International Conference on Spoken Language Processing, 73, 1999. Retrieved from https://www.isca-speech.org/archive/interspeech_2004/i04_1313.html
Forrest, K., Weismer, G., Milenkovic, P., & Dougall, R. N. (1988). Statistical analysis of word‐initial voiceless obstruents: Preliminary data. The Journal of the Acoustical Society of America, 84(1), 115–123. https://doi.org/10.1121/1.396977