Micro speech example from TensorFlow on an ESP32

Is there a good place to ask technical/implementation questions related to tiny ml - e.g. something like the programming questions forum of Arduino (https://forum.arduino.cc/index.php?board=4.0) to get assistance with troubles implementing specifically TinyML issues?

Would it make sense for this discourse to have a similar type of category?

For example of the sort of question I was looking at ask:
I’ve been working on a modified version of the micro speech example from TensorFlow on an ESP32 and seem to be hitting an issue where the input tensor is not being initialized correctly. I’m essentially hitting a null pointer issue when trying to copy captured audio to the input->data.int8. I don’t fully understand the internals, but I suspect that the initialization is failing because of how the model is loaded, but I’m not sure (e.g. it could be because of space or other issues as I’ve extended my setup to allow OTA updates etc)

Hi @gameldar,

An awesome idea! Thanks for the suggestion. Someone else also asked this exact same question.

If you all think it makes sense, we should have it. So I created a new forum location:

And fyi, I relocated your post to the new category. Cheers!

2 Likes

@gameldar do you have a GitHub repo where we can take a look at the code?

I’ve worked out what the issue was! To understand the code better I typed it out, rather than copy/paste based upon the micro_speech example… unfortunately that meant I somehow missed the step of allocating the tensor after initializing the interpreter, hence the data buffer wasn’t allocated.

So I’m making progress now - except now I’ve hit an issue with my version of the model as I’ve somehow ended up with a CONV_2D instead of DEPTHWISE_CONV_2D operator - so I think I need to look at how I trained the model (I thought I had done it the same way as the micro_speech example, but it might be I’ve mixed up a parameter to the train.py script)

Just to explain what I’m trying to do (apart from learning how this all works!) - I’m working off the micro_speech example as a basis, but I’ve created a new model using recordings I made (from the ESP32) of my pool pump running at different speeds with the intention of being able to detect what speed the pump is running at remotely. My long term aim is actually to use Autoencoding to do anomaly detection so I can tell if my pump is having issues as well as report the running level so I can integrate it with Home Assistant to let me know if I’ve left the pump on, but started with just the level detection as the keyword spotting example matches the data format more closely than the heart rate anomaly detection.

2 Likes

So I’ve managed to fix this issue… but only by doing the training in a Google Colab instead of running it locally. As far as I can see the only difference I have locally is that I’m running with Tensorflow 2 (2.4.1) locally versus Tensforflow 1 on the Colab. Does this sound like a plausible reason for the difference?

Cool. Thanks for sharing. So are you able to run the model on device now?

So just to confirm, you still have a DEPTHWISE_CONV_2D? And you are still using the TinyCov model, right? The one we discuss in Course 2 (not sure if you are taking it or not, but just using that as a reference).

Yes - the training I was running locally is the version I had running from Course 2 (https://github.com/tinyMLx/colabs/blob/master/3-5-18-TrainingKeywordSpotting.ipynb) - the working version is actually the version from the tensorflow micro_speech example (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/micro_speech/train), but as far as I could see they are identical in practice.

So in both cases I have:

PREPROCESS = ‘micro’
WINDOW_STRIDE = 20
MODEL_ARCHITECTURE = ‘tiny_conv’
QUANT_INPUT_MIN = 0.0
QUANT_INPUT_MAX = 26.0
QUANT_INPUT_RANGE = QUANT_INPUT_MAX - QUANT_INPUT_MIN

SAMPLE_RATE = 16000
CLIP_DURATION_MS = 1000
WINDOW_SIZE_MS = 30.0
FEATURE_BIN_COUNT = 40
BACKGROUND_FREQUENCY = 0.8
BACKGROUND_VOLUME_RANGE = 0.1
TIME_SHIFT_MS = 100.0

The version built with tensorflow 2.4.1 results in the following error when trying to allocate the tensors:

Didn’t find op for builtin opcode ‘CONV_2D’ version ‘3’. An older version of this builtin might be supported. Are you using an old TFLite binary with a newer model?

When I switched to using the version from the micro_speech source it got further. At the time I was printing out the layers and could see that I had DEPTHWISE_CONV_2D when that version of the model was loaded, compared to the straight CONV_2D with my locally trained version.
So I uploaded my sound samples to a Colab and reran the training with my audio samples and the resulting model successfully worked.

I’ve now got the model running on my ESP32 - but I haven’t managed to confirm it works yet (as the pump is outside my kids’ bedrooms and it is late at night!) - all I get responding is the ‘None’ response (which I should probably roll into the Silence examples from the speech commands set)

I’ve uploaded the working case to my github - but it is ugly and isn’t a good example to follow at this point:

NOTE: you won’t be able to build this without having setup tensorflow to work with platformio

1 Like