1

In a performance sensitive environment, I have an audio stream. I need to classify each frame as speech/non speech. For this purpose, only "clean" voice should be classified speech. Voice with substantial background noise (maybe music) should be classified as non speech.

What features/methods could you suggest?

Michael Litvin
  • 141
  • 1
  • 1
  • 5
  • 1
    How much is a frame? What else might there be other than speech? Distinguishing between human speech and synthesized speech, for instance, would be difficult. Distinguishing between human speech and a jackhammer would be easy. – endolith Aug 05 '11 at 21:41
  • Unfortunately, synthesized speech is an option. But I do realize that probably trying to address this would be an overkill, and I'm willing to make this sacrifice. As for frame size, I'm pretty sure I can control it. It works with 4000 sample frames at 8000 smp/sec. The 500ms delay is fine, but I wouldn't want to make it any larger. – Michael Litvin Aug 05 '11 at 23:34
  • Despite the mention of "DSP" (only in the title!) this seems to be a software question, and doesn't have anything to do with electronics. Voted to close. – stevenvh Aug 06 '11 at 09:32
  • I had not bad results with linear regression coefficients of the spectrum. – Michael Litvin Aug 06 '11 at 09:11
  • Your answer is currently only a comment, can you explain more about how you have done this, what it means and/or why it works. – Kortuk Aug 06 '11 at 11:16
  • It's still work in progress, I don't have much more to say yet. Used Matlab to calculate the slope y=ax+b of the fft, and saw by eye that there was a significant difference between speech and non speech. – Michael Litvin Aug 06 '11 at 12:30

4 Answers4

3

I agree. I've previously researched this topic before and found that it is a very complex subject. Here are some basic algorithms: http://en.wikipedia.org/wiki/Speech_recognition#Algorithms.. I highly suggest you do lots of research. Here's some links for you to enjoy :) http://www.dsprelated.com/showmessage/83934/1.php & http://www.ee.columbia.edu/~dpwe/pubs/LeeE06-vad.pdf and many more on google.

O_O
  • 595
  • 2
  • 6
  • 16
1

What I would do: first try to find the fundamental frequency. A speaking voice does not have a fixed note in this sense, so you need to do it quite quickly-responding, a direct phase-locking method may be better than doing it with FFT. Then feed this frequency into a comb filter, to remove the fundamental and all its overtones. What remains is then, ideally, only pop and hiss noises, both either quite low or quite high-frequency, so bandpassing the midrange should – for a clear and single voice signal – leave only a very weak remaining signal. For music or other noises on the other hand, you have a wide mixture of frequencies throughout the midrange, so combfiltering will not weaken the RMS very much at all. So a high level after the comb/bandfiltering process will indicate that the source was not clean voice.


I tried this with a simple SynthMaker program, SynthMaker schematic for the clear-voice detection

and it is not really reliable yet but does in principle work.

Result for speech alone:

Result of the SynthMaker schematic for the clear-voice detection for speech

The combfiltered signal is 6 dB weaker than the only-bandpassed one.

Result for music (speech+acoustic guitar, just to test):

Result of the SynthMaker schematic for the clear-voice detection for speech+acoustic guitar

Here, the combfiltered signal is actually louder (the filter is wrongly normalized).

leftaroundabout
  • 1,394
  • 10
  • 14
0

This is not a simple problem, and this is more of a guess than a answer.

One distinction that comes to mind is that pure voice will have very little amplitude at higher frequencies, like above 3 kHz. Unfortunately, things like a hard S sound (hissing) does occur in voice and can have components up to around 8 kHz. Sometimes music and other background noise will have frequencies limited to 3 kHz too. So, frequency distinction will help, but isn't good enough on its own.

Lots of music will have a rythmic beat, especially when looking at just the base frequencies. That isn't true of all kinds of noise though.

As I said, this is not a simple problem and will probably require substantial experimentation.

Olin Lathrop
  • 310,974
  • 36
  • 428
  • 915
0

My (long ago) Masters thesis involved doing this as a minor part of the overall task, largely with hardware. I can dig it out and see what I concluded :-). As I recall, the nature of energy content around 400 Hz is a major indicator. This was for use on telephone circuits. I also had some papers from British Telecom on the subject which I can provide references to and (just) possibly copies of the papers.

Russell McMahon
  • 147,325
  • 18
  • 210
  • 386