DSP - Voice Activity Detection: what are my options?

Question

In a performance sensitive environment, I have an audio stream. I need to classify each frame as speech/non speech. For this purpose, only "clean" voice should be classified speech. Voice with substantial background noise (maybe music) should be classified as non speech.

What features/methods could you suggest?

How much is a frame? What else might there be other than speech? Distinguishing between human speech and synthesized speech, for instance, would be difficult. Distinguishing between human speech and a jackhammer would be easy. — endolith, Aug 05 '11 at 21:41
Unfortunately, synthesized speech is an option. But I do realize that probably trying to address this would be an overkill, and I'm willing to make this sacrifice. As for frame size, I'm pretty sure I can control it. It works with 4000 sample frames at 8000 smp/sec. The 500ms delay is fine, but I wouldn't want to make it any larger. — Michael Litvin, Aug 05 '11 at 23:34
Despite the mention of "DSP" (only in the title!) this seems to be a software question, and doesn't have anything to do with electronics. Voted to close. — stevenvh, Aug 06 '11 at 09:32
I had not bad results with linear regression coefficients of the spectrum. — Michael Litvin, Aug 06 '11 at 09:11
Your answer is currently only a comment, can you explain more about how you have done this, what it means and/or why it works. — Kortuk, Aug 06 '11 at 11:16
It's still work in progress, I don't have much more to say yet. Used Matlab to calculate the slope y=ax+b of the fft, and saw by eye that there was a significant difference between speech and non speech. — Michael Litvin, Aug 06 '11 at 12:30

score 3 · Answer 1 · answered Aug 05 '11 at 21:52

3

I agree. I've previously researched this topic before and found that it is a very complex subject. Here are some basic algorithms: http://en.wikipedia.org/wiki/Speech_recognition#Algorithms.. I highly suggest you do lots of research. Here's some links for you to enjoy :) http://www.dsprelated.com/showmessage/83934/1.php & http://www.ee.columbia.edu/~dpwe/pubs/LeeE06-vad.pdf and many more on google.

answered Aug 05 '11 at 21:52

O_O

595
2
6
16

Great links! I really hope that I won't have to program and train an HMM though... thanks – Michael Litvin Aug 05 '11 at 23:48

score 1 · Answer 2 · answered Aug 07 '11 at 16:42

What I would do: first try to find the fundamental frequency. A speaking voice does not have a fixed note in this sense, so you need to do it quite quickly-responding, a direct phase-locking method may be better than doing it with FFT. Then feed this frequency into a comb filter, to remove the fundamental and all its overtones. What remains is then, ideally, only pop and hiss noises, both either quite low or quite high-frequency, so bandpassing the midrange should – for a clear and single voice signal – leave only a very weak remaining signal. For music or other noises on the other hand, you have a wide mixture of frequencies throughout the midrange, so combfiltering will not weaken the RMS very much at all. So a high level after the comb/bandfiltering process will indicate that the source was not clean voice.

I tried this with a simple SynthMaker program, SynthMaker schematic for the clear-voice detection

and it is not really reliable yet but does in principle work.

Result for speech alone:

Result of the SynthMaker schematic for the clear-voice detection for speech

The combfiltered signal is 6 dB weaker than the only-bandpassed one.

Result for music (speech+acoustic guitar, just to test):

Result of the SynthMaker schematic for the clear-voice detection for speech+acoustic guitar

Here, the combfiltered signal is actually louder (the filter is wrongly normalized).

score 0 · Answer 3 · answered Aug 05 '11 at 21:47

This is not a simple problem, and this is more of a guess than a answer.

One distinction that comes to mind is that pure voice will have very little amplitude at higher frequencies, like above 3 kHz. Unfortunately, things like a hard S sound (hissing) does occur in voice and can have components up to around 8 kHz. Sometimes music and other background noise will have frequencies limited to 3 kHz too. So, frequency distinction will help, but isn't good enough on its own.

Lots of music will have a rythmic beat, especially when looking at just the base frequencies. That isn't true of all kinds of noise though.

As I said, this is not a simple problem and will probably require substantial experimentation.

I have only 4kHz of signal anyway, so I'll have to find something else. Thanks! — Michael Litvin, Aug 05 '11 at 23:46

score 0 · Answer 4 · answered Aug 05 '11 at 23:30

0

My (long ago) Masters thesis involved doing this as a minor part of the overall task, largely with hardware. I can dig it out and see what I concluded :-). As I recall, the nature of energy content around 400 Hz is a major indicator. This was for use on telephone circuits. I also had some papers from British Telecom on the subject which I can provide references to and (just) possibly copies of the papers.

answered Aug 05 '11 at 23:30

Russell McMahon

147,325
18
210
386

Wow, that would be great! Thx :) – Michael Litvin Aug 05 '11 at 23:41

DSP - Voice Activity Detection: what are my options?

4 Answers4

Result for speech alone:

Result for music (speech+acoustic guitar, just to test):