'open sesame' door latch from KWS to SRE(speaker recognition)

’open sesame’ door latch
from KWS to SRE(speaker recogniton)

Authentication by voice on a local device is worth exploring. Current methods use FFT. Maybe our code could be adapted to recognize a person by their voice in a small micro model.


  • where: for the door to my shop and the back door of my house
  • when: I am carrying something with both hands that I don’t want to put down
  • what: say ‘open sesame’ and have the latch release so I can push open the door

collect data:

  • It only needs to work for a few people and it is better if it doesn’t work for anyone else.
  • Maybe figure out a way to collect samples with a parallel system running at the door, recording the phrase and other talk and shipping it out to some more powerful server to collect and use for training.
  • Iterate models. Someday there will be enough data and the door will open.


  • Eventually it might work. Then it will need some intermediary layer. Maybe that layer won’t be suitable to work on the device.
  • Add person recognition layer, either visual or auditory.
  • Until that layer is compactly smart, it could do a long loop to the server. If the server doesn’t agree it could set off alarms.
  • long term goal: approval from @SecurityGuy


  • a $24, 12v door latch
  • a 120v,12v transformer
  • 3.3v relay

alternate approaches:

Perhaps it is possible to leverage an audio phrase database to train a person authentication model. Maybe forget ‘open sesame’ as a training phrase. Maybe use something real common, a bajillion recordings of ‘hello’ for instance. I would have say hello a thousand times. Can the model learn which one is me?

If that works, how could I make it work on an uncommon word that will open my door? Maybe take all the audio data and reverse it in time, like playing a the Beatles White Album backwards to hear ‘Paul is dead’. Instead of ‘hello’, ‘olleh’ opens my door. Train on that and build another model.

request for team

This project seems right for a team. At the core it will require lots of experimentation and testing of models spanning the whole gamut of tinyML develpment as laid out by Vijay Janapa Reddi and the rest of the leaders and staff of the edX tinyML classes.

Team members could compare their selection of FFT variation, pre-processing, model architecture and inference implementation.

Team could compare and evaluate publically available audio data or models that we could piggy back on.

Team members could generate and share voiceprint data and experiment on optimal phrase type for voice authentication.

Team could develop a hardware/software subsystem for collecting voice data and storing on a server accessible to the whole team.

Team could discuss, reach consensus or produce multiple iterations of an actual hardware/software implementation.

I am very excited and ready to go. Please consider working with me on this or some pivot of this project idea. What do you think?

references and discussion:

“Internet elders such as MIT’s David D. Clark argue that authentication – verification of the identity of a person or organization you are communicating with – should be part of the basic architecture of a new Internet…(Tim Berners Lee) proposed a simple authentication tool – a browser button labeled “Oh, yeah?” that would verify identities” Click "Oh yeah?" | MIT Technology Review

For the Internet of Things (IOT) authentication and ownership of data are key issues that must be faced now, before we have hundreds of microcontrollers in our homes sending data to unnamed servers and big tech, claiming the data as their own, invading our homes. Recently, in a Independent Activities Period (IAP) class at MIT, Tim Berners Lee admitted to making a big mistake when he wrote HTML. There should have been authentication built in.

Recently I had to call Fidelity a lot since I was the executor of my mom’s will. I said a short phrase just a few times, over the phone and now I am authenticated as soon as they hear my voice. What can we do on a microcontroller?

door latch

1 Like

Hi @mcktimo – thanks, I’d be happy to help out. Please do keep in mind that while I’ve been in cybersecurity for more than 25 years, I’m still new to ML.

From a security perspective, it is often helpful to consider identification, authentication, and authorization (IAA for short) as three discrete functions. At the risk of oversimplifying:

  • Identification – the user asserts an identity
  • Authentication - the user proves the assert identity is theirs
  • Authorization - the decision on what to allow (or not allow) the user to do

Some systems allow two or more of these to be combined. For example, a proximity card reader usually only gets a serial number from the card, but based on that serial number they make an identification and authentication decision. The obvious implication is that if I steal your card, the system can’t tell the difference.

For a similar reason, banks put dollar limits on “tap” payments.

I’m not suggesting that using just voice to identify and authenticate an individual is necessarily bad, but depending on the value of assets involved, the threat landscape, and the accuracy of the system it may not provide an appropriate level of protection.

For example, I’d personally be very comfortable with it unlocking my interior office door, but maybe not the front door to my house.

On the other hand, if, hypothetically a system used facial recognition to identify me, and then voice identification to authenticate me, it might well provide a higher level of security than the existing lock on my front door. (Of course there are a lot of variables at play here…type 1 and type 2 errors for both devices and how the threshold is set).

Another thought might be to combine the voice authentication with something like BLE, where the physical presence of a phone plus the spoken words are used to trigger.

Also, I assume that you’re considering a solenoid-operated door latch with a fail-safe design (i.e. no impediment to egress in an emergency). I’m mostly a software guy, but please do some research on the electronic characteristics of solenoids and relays. They both give off nasty surges when the magnetic field is de-energized, so you might end up wanting some type of isolation (optoisolator?) to protect the microcontroller. People often use a diode across the relay to shunt the surge as well. I’m not sure if they are required (or include) for door latches.

Also, I recently started a YouTube channel and released a cybersecurity primer that you and others might find interesting: https://youtu.be/OSorBPde0JA


@SecurityGuy, it would be great to work with you. I am a bit of your opposite, a hacker who pays little attention to security. I could surely use a more measured approach.

I don’t expect we will develop an uncrackable sonogram model for identification and authentication but it would be fun to get a model to do a pretty good job.

What do you think the chances are for secure voice id/auth on a microcontroller? Are there any papers or articles on it in the security community?

Out of the box the esp-eye does some rudimentary tensorflow face recognition. Maybe we could add that as a second layer. (we could turn off its wifi:) and try using it for the micro-speech as well, instead of the nano 33)

The latch is pretty basic, replacing a normal door latch where the latch part swings open when 12 volts is applied. Normal apartment door stuff, the door still opens in an emergency as usual.

I’d be interested in working on this project. I am actually working on something with a similar front end right now. My word is “abracadabra” rather than “open sesame” but basically the same concept, though I was not working on a lock mechanism specifically.

By recording myself and building off of the speech_commands dataset I was able to get a deployed model to respond to my keyword, though the real world performance is still pretty rough. My current thought is that I did not use enough training data to get the performance I want (only 500 audio samples per word in my dataset rather than the 3500 in the standard speech_commands set). I’m planning to do some experimenting with synthetic data to see if I can improve the performance without having to record and process another 3000 (or more) samples of myself.

I’ll let you know what I find out. Do you have ideas for how you would like contributions to be organized?

Thanks for suggesting this! :slight_smile:

@stephenbaileymagic, I would love to work with you.

I wonder if we will have success without using lots of data. I think not.

What attracts me to this idea is that it is a different problem than speech recognition; in ours different people saying the same word should never both be recognized.

I was looking at the speech dataset and wondering whether we should forget about ‘opensesame’ or ‘abracadabra’ and maybe train on just one word with lots of different speakers + us saying that word.

After testing on that, maybe we could combine two word samples and from the dataset and then train on that plus our own rendition of the two words.

I wanted to start with ‘backwards’ but that has only 1700 samples. Maybe best to start with a word that already has ~4000 samples.

For the first stage maybe we could organize each trying one word. Then we could compare and keep the promising results. My hypothesis is that some words contain more individuating features that others.

We could then try network tweaks that might work better.

We might then mess with the model itself. Maybe besides the MFCC FFT, some pitch or cadence data might be good features to include.

cc: @SecurityGuy

I think that sounds very sensible to start out with one word each and see how it goes. So I’m understanding that for right now we will each take one word from the speech dataset and try to train a model which can pick our voices correctly out of all the speakers of that word in the set.

Do you have a preference for what word you want? I think your hypothesis that some words will have more individuating features is very interesting. I’m looking forward to digging in to this :slight_smile:

Hey @stephenbaileymagic and @SecurityGuy,

What kind of results are your getting in the tinyML 3 course? Mine are awful. I sent this to the course forum:

I kept trying to proceed with the course…but everything works so poorly I came back here

Tested the microphone… it works

Ran the built-in yes/no example for micro_speech, changing nothing. Tried lots of variations on where I kept the device, how close to mic, how loud etc.

The results are so bad that I think that putting a lot of work into this course is waste. What is your experience?

Those results are setting the is_new_command to true. Otherwise I say yes/no and get no response most of the time. I need to get this straightened out before I invest a lot of time. Maybe I have a bad board.

This is my experience:

I have trained and deployed models for the micro_speech example code for yes/no, stop/go, and on/off. These models and the example code are all using defaults except as needed by the different words. Each of these models got about 90% accuracy on the test data as I was converting to TensorflowLite and TFLiteMicro.

In real world performance as deployed on device, I got similar results to yours at first. I spent about a week vocalizing “yes” and “no” in different ways and I can now get fairly reliable results by saying the words in the way I know the device understands.

I am getting improved results on the keywords “off” and “go” with continued practice. I am getting very little from “stop” and “on” so far.

These results are without me doing any optimizations to any of the provided examples or data. I believe that I will need to do a lot of tweaks to the whole pipeline to improve real world performance.

I haven’t gotten too far in to Course 3 yet so there may be optimizations included which I don’t know about yet. I intend to circle back after I finish the course and spend the time on customizing these models and examples to make them fit me and my circumstances better.

This is where I am at so far.

@mcktimo are you still having trouble with the Nano 33 BLE Sense???

Yes. Inside collab the model is great, on the device, not so much. Might try playing speech data into the mic of nano.

I can find the colab to test a tflite model. I can’t find the colab where we took a tfmicro.cc model and tested that inside colab. Do you have a copy of that .ipynb ?

I’m a bit behind both of you on the course due to conflicting priorities. I hope to get time to work on the course over the weekend.

Hi @mcktimo
I don’t know if there is a colab for testing TFLiteMicro models specifically. This is the colab I have been using which automates training a speech model and converting all the way down to TFLiteMicro. It tests the TFLite model performance, but in looking at it I realized it doesn’t test the micro model after converting.

Train_Micro_Speech Colab