On poor KWS performance: tweak the whole pipeline?

As @vjreddi says “it shouldn’t be really bad for yes/no. It should definitely work.” But it doesn’t feel responsive to me. To @RuiBGomes results “were a little bit slow” he postulates: “Are those issues related to: - equipment sensitivity? - model underfitting for DLR and overfitting for U?” He got me thinking.

I decided ProcessLatestResults was sending stuff to RespondToCommand that was driving me crazy, delaying and hiding data that I wanted to see. I played around and ended up putting

if (score>140){
  Serial.print(score);
  Serial.println(found_command);
}

between them. That gives me the following

A lot happens between what we see reported to us as Heard go (202) @7446ms and the next maybe Heard down (206) @124064ms (though Serial.print might be messing with error_reporting) Still, a lot of predictions are being made all the time. and they are right.

Something else is happening. Every few seconds a bunch of unknown 's show up. That may be some code ‘feature’; there is some 3 second constant.

Each group of up down or go corresponds to and utterance of me saying that word. The model is working and making good predictions most always.

there is something screwy about sound level

If you were to turn down the threshold on score you would see that the model is making even more predictions even when no one is speaking. I was inspired by @dizgotti who suggested “that you have dug into the code a bit to find failed to detect”. @brian_plancher spoke “of adding volume requirements to the website” and referenced @petewarden 's work on sound level. Even though from a different topic it caused me to consider the possibility that some normalizing the sound level is going on somewhere in the code. Once you get down to the nano 33 you end up with an interpreter that is working way too hard and is being throttled on what it reports.

possible solutions

knobs, levels and tweaking

@H_Scholten suggests a solution. The “idea is to trigger the recording with the switch on the board and record a word.” Another possibility would be to measure and ambient sound level at the beginning and only classify those samples above a certain level. Then you could rewrite a bit of the scoring, is_new_command code to up the responsiveness. Maybe add a knob for sensitivity control like on a cb radio. Give our poor classifier a fighting chance.

tweak the whole pipeline

@stephenbaileymagic says: “I believe that I will need to do a lot of tweaks to the whole pipeline to improve real world performance.” and he may be on to something. The features we get from MFCC FFT might be the only features we are using. Those sonograms have no information on amplitude. Perhaps we should redesign our features. Take that speech_commands_v0.02 dataset and double it, reducing the amplitude on one half adding a quiet classification to that half and not_quiet to the other. Add that feature to the classifier along with the MFCC stuff and retrain the whole lot.

it may be overkill but @stephenbaileymagic, @SecurityGuy and I havwe been pondering similar issues for our speaker recognition project. MFCC is great but it doesn’t give us any information on the cadence of the voice or variations in amplitude that might well help distinguish between speakers saying the same phrase.

Hope you all have a great weekend. Thanks for your posts. They are keeping my head in the game. (sorry to cross posting but the class forum doesn’t seem to recognize @person)

3 Likes

Just to give an update as I don’t have any metrics I can point to yet:

@mcktimo 's suggestion about amplitude inspired me so I spent the weekend taking a dive in to the dataset itself. Partly because it is the beginning of the pipeline and partly because I am more comfortable with audio tools than I am with the code at this point. After reviewing a few thousand wave forms I can say that there seems to be a fairly random distribution of low and high amplitude samples in the dataset. I’m intentionally saying random, not balanced.

I’ve made some changes to the dataset and I’m going to run them through to deployment to see what performance improvements, if any, they provide as soon as I get a chance. I’ll let everyone know if I get anything good. :slight_smile:

1 Like