Hi,
This post discusses the evolution of my earlier custom keyword spotting efforts and the results of my first dataset engineering attempts.
Previously, I was able to train a model to recognize my own keyword (“abracadabra”) and deployed it to the Nano BLE Sense 33 device. Once on device, the model was able to successfully recognize my keyword and respond though the performance was not at a suitable level for any real application.
I achieved this by recording and processing 500 samples of myself saying the keyword in a variety of settings and with a variety of background noise. I then took 500 samples of each of the words in the official speech_commands dataset, combined them all together and trained using the train_micro_speech colab.
By default, this colab trains on the full speech_commands dataset using the words “yes” and “no” and produces a TFLite Model which has an accuracy of about 91%. The model I trained with my keyword and the subset of the dataset got an accuracy of 91% so I was happy about that.
In real world performance though, I thought I could do better. So I took the samples in my dataset, and created synthetic data by processing each sample through 3 different filters in the Audacity audio editing software. These filters were distortion, pitch shift and reverb and they were applied independently so my output was in the format sample1, sample1_distortion, sample1_reverb, sample1_pitch_shift, etc. In the end my dataset grew to 2000 samples per keyword.
I also added some additional personally recorded samples to the background noise folder to include sounds which are typical in my space and hopefully cut down on false positives.
After training on my new dataset, the model produced achieved a 92% accuracy on my test data, a modest gain, but it didn’t see any loss, so I thought that was good.
However, once deployed to my device, the model performance fell off a cliff. The new model performed worse than the default model and worse than my previous model with fewer samples. I don’t have any metrics to quantify it, this is just my perception of the performance.
So overall this experience taught me a couple of important things:
- Bench performance is not always the best indicator of what will happen in real life.
- Manipulating your dataset can make significant changes to performance even holding all other variables the same.
I know both of these facts were covered in the course, but to experience them first hand really sets them in my mind.
I really enjoyed this experiment, and plan to continue to work on it to see what I can achieve. I hope this post inspires others to experiment and share their results. I am happy to get more specific on any details if people are interested.