Adventures in Dataset Engineering

This post discusses the evolution of my earlier custom keyword spotting efforts and the results of my first dataset engineering attempts.

Previously, I was able to train a model to recognize my own keyword (“abracadabra”) and deployed it to the Nano BLE Sense 33 device. Once on device, the model was able to successfully recognize my keyword and respond though the performance was not at a suitable level for any real application.

I achieved this by recording and processing 500 samples of myself saying the keyword in a variety of settings and with a variety of background noise. I then took 500 samples of each of the words in the official speech_commands dataset, combined them all together and trained using the train_micro_speech colab.

By default, this colab trains on the full speech_commands dataset using the words “yes” and “no” and produces a TFLite Model which has an accuracy of about 91%. The model I trained with my keyword and the subset of the dataset got an accuracy of 91% so I was happy about that.

In real world performance though, I thought I could do better. So I took the samples in my dataset, and created synthetic data by processing each sample through 3 different filters in the Audacity audio editing software. These filters were distortion, pitch shift and reverb and they were applied independently so my output was in the format sample1, sample1_distortion, sample1_reverb, sample1_pitch_shift, etc. In the end my dataset grew to 2000 samples per keyword.

I also added some additional personally recorded samples to the background noise folder to include sounds which are typical in my space and hopefully cut down on false positives.

After training on my new dataset, the model produced achieved a 92% accuracy on my test data, a modest gain, but it didn’t see any loss, so I thought that was good.

However, once deployed to my device, the model performance fell off a cliff. The new model performed worse than the default model and worse than my previous model with fewer samples. I don’t have any metrics to quantify it, this is just my perception of the performance.

So overall this experience taught me a couple of important things:

  1. Bench performance is not always the best indicator of what will happen in real life.
  2. Manipulating your dataset can make significant changes to performance even holding all other variables the same.

I know both of these facts were covered in the course, but to experience them first hand really sets them in my mind.

I really enjoyed this experiment, and plan to continue to work on it to see what I can achieve. I hope this post inspires others to experiment and share their results. I am happy to get more specific on any details if people are interested.


@stephenbaileymagic this rocks!

It is awesome that you went through it so thoroughly :+1:

Spot on.

We are thinking of working on a better model that can be integrated that is bigger … the course KWS model is dinky … just 16 KB … but we are hoping we can use something like DS-CNN that is better. We will look into it. Stay tuned for that. But no promises though.

1 Like


Once on device, the model was able to successfully recognize my keyword and respond though the performance was not at a suitable level for any real application.

So the question in my mind is how can you go from 91% in a tflite test to unaccaptable on the device.

It could be c++ array we send to the device isn’t as good as the tflite model. It would be good to have some code in colab that actually used that array in the c functions sent to the device with our voice tests.

But I tend to doubt that. My tests show the nano 33 is actually doing better than it is acting. The problem seems to be in the post-processing scoring code. I want to get back to that but first I think I should first finish the rest of the class.

Another model would be fun to work with but I think there is more to do to get to the bottom of this one.

As you are finding out @mcktimo there is a lot of nuance to this and no one size fits all, and this is why KWS is not widely deployed in many different languages. Cause tweaking it to get it right is challenging. This is also why, on devices like our smartphones, Google and others allow us to tweak the “operating point” (course 2 concept). In effect, what are you are doing in the post-processing scoring is like tweaking that operating point.

1 Like

Hi @mcktimo,

I think I mentioned before that I don’t have a lot of experience in the code, so if you are digging into it after the class, I’d love to hear what you find out.

I have a few thoughts on what is causing the performance difference that I plan on experimenting with. I am working from the assumption that the performance difference is somewhere in the input data. For example, the input to the model is coming from the onboard mic vs. the mic I used to record my training data.

I have also been running my model on the device and just watching what it does. I have noticed unexpected results in what causes false positives/negatives and unknowns. As they said in the Audio Analytic video that @vjreddi shared with us, the model learned something, it just didn’t learn what I thought I was teaching it.

I also think I should get through the class before I dig too much deeper, but I’m having fun trying to figure it all out. I’ll let you know what I discover. Thank you for keeping the conversation going :slight_smile:

I may be way off here since I’m a bit behind you in the course, but one thought came to mind…how long does it take to say your keyword?

If I understand the micro_speech sample code correctly, it assembles 49 x 20ms slices, so it’s running the inference on a sliding audio window of 980ms.

Playing with my stopwatch, I say “up” about three times per second (333-ish ms), and a two-syllable keyword in about 800ms.

However, if I try to pronounce “abracadabra” clearly it takes more than one second, so with a 980ms sliding window, the network is never going to receive a spectrogram of the entire word.

Maybe ML is resistant to this sort of error, but since our post-processing code considers multiple detections, could the window size be hurting accuracy? @vjreddi any advice?

Hi @SecurityGuy,

I don’t think you are way off. I was looking at the wave forms quite a bit when I was putting this together and “abracadabra” is almost guaranteed to slide outside of the window when I say it. I just went through the section in the course that described the post processing with multiple inferences last week, so I think there are surely some tweaks I need to do to optimize the response.

I have also thought that if I want to keep the one second window, I can train on just “abraca” or “abra”. After all, those are still both unlikely sounds to get spoken accidentally and “abracadabra” is really just theater. I am going to play with this some more after I finish the rest of the course to see if I can get it to pick out those parts of the word effectively.

I also had a question for you about authentication via the machine recognizing your individual voice. Would it be considered a (weak) form of two factor authentication if the machine was trained not only on one’s voice but also on a specific word? Meaning that the person would not only have to know the correct word but also the machine would have to recognize their voice saying that word? I don’t know if that question makes sense but I was just wondering if you had any thoughts.

Thanks, hope you had a good weekend!

A couple of factors might aid in resilience for detecting a keyword that is longer in duration than the default frontend spectrogram (which is also an adjustable parameter, and perhaps worth retraining a model after changing).

  • The extraction tool is probably centering most of the keywords roughly around the same stressed syllable (if you’re using this to preprocess your training data), so I would assume the model is looking for the same 1s subword rather than different portions of the whold >1s word
  • Part of the standard augmentation pipeline includes random time shifts with a default of up to 100ms, so the tiny_conv model will be used to seeing non-centered keywords (however it will pad with background noise so the training data will diverge a bit from in practice)
  • The default post-processing window looks like it’s expecting a minimum count of 3 detections to average across, and if each additional stride/slice includes the next 20ms of audio, then, loosely speaking, I believe the model score need to be above the threshold within roughly a 60ms period (to be clear, in each case looking back at the last 1s of audio) to count as a new detection, which should be covered by the random timeshifts when training (one could try to change the parameters to RecognizeCommands though)
  • As long as the spectral features of the keyword are distinct enough, then then translational invariance of convolutional filters should be able to pick up on them

All this is to say, I have found it tricky to guess whether the default settings are hurting or aiding accuracy without experimenting with changing these settings. But they are good defaults and should have at least some in-built resiliency to detecting longer words without adjustment.


For what it is worth, I’m quite happy with the results I achieved, although I did choose a simplistic use case. I wanted to detect a two-syllable word when I speak it in my office. (In other words one speaker, one physical location).

I recorded the single keyword 25 times each with three different microphones (a Rode Podmic on an audio interface, a Jabra 710 speakerphone, and a Jabra headset). My thought was that the three different microphones would pick up different amounts of background noise (the Rode being dynamic and directional is very clean, the Jabra 710 can clearly hear my aquarium and computer fans in the background, and the Jabra headset is in between).

A few samples got tossed out for being too quiet, but most were included in the training set.

I did see a few false triggers so far from background noise, and distance to the mic is limited, but if I hold it within about 1m of my head it triggers more than 9 times out of 10.

1 Like

@mazumder, that is a great tour through parts of the code. Are you a teacher? You have the clarity of a good teacher.

1 Like

Thanks! I helped with the data engineering section of TinyMLX.

1 Like

Having gone through much of the rest of the course, I have a new working theory on the cause of my unsatisfactory results. I have realized that while I made an effort to include a variety of samples with varying background noise and sound profiles, I did not include any samples made in my actual work environment.

I think I fell in to the trap of not fully considering my use case when I made my data set. I’m going to try again soon, but for now I feel a little foolish. I think I trained it to work everywhere but where I want to use it! :slight_smile:

Still chasing my magic word. My results have improved dramatically in the last week. Having finished the course, I took the suggestion to try out Edge Impulse. This allowed me to capture data in my office, using the mic on the Nano BLE sense. Combined with my previous data and the model improvements provided by the Edge Impulse defaults, this data has made a huge difference in terms of reliability and robustness.

I found it interesting to note that my model accuracy before including this data came back at 91% but the real world performance was awful, while my accuracy as measured by Edge Impulse was 88% but the performance is very strong. As @vjreddi has said, it is the full pipeline that counts, not just the model.