How can I use a Mozilla Common Voice Dataset?

I found that in this project there are datasets in many languages, mine (Greek) included. I downloaded the available dataset and, as I was suspecting, it contains voice files but it seems that there are no labels since each file is phrase not single word.
My question is:
How can I (if I can) use this dataset to train a model? Is a preprocessing procedure necessary? Is there a trick or something or is it out of my reach (to the present)?
There were also another dataset with single words like yes no and numbers for many languages but not for mine…
Thank you in advance!

@gkapsid, Is the file name the same as the phrase?

Great question. There are ways for us to extract words from phrases using forced alignment. “Forced alignment is a technique to take an orthographic transcription of an audio file and generate a time-aligned version using a pronunciation dictionary to look up phones for words.” Check out the below link and tomorrow @mazumder will post code that can help with this…

https://montreal-forced-aligner.readthedocs.io/en/latest/introduction.html

Hi @gkapsid

We have a preprint available that looks into the potential of training a wakeword model from Common Voice data here. It also links to our codebase, although the code is not documented and cleaned up yet.

Briefly, the section in the MFA docs on “aligning using only the dataset” describes how to do per-word alignment on sentence-aligned data (like Common Voice). This will generate per-word timestamps. Our code takes these timestamps and a list of keywords and trains a keyword spotting model from them (which currently is not deployable on an MCU). We use the textgrid library and pysox to directly extract keywords from the output of MFA. You may find it easier to use these libraries directly, and feed it into the existing Colab speech_commands pipeline, if you want to experiment with models that can run on the Arduino.

@ mcktimo No, unfortunately not… The filenames are eg “common_voice_el_20429242.mp3” (el is for greek here). There are a few tsv files (dev, test, train, validated, invalidated, other and reported) where each filename is associated with the corresponding phrase.

vjreddi and mazumder thank you very much!
I’ll have a look in the proposed resources and I’ll study them. I believe that language limitations may become an opportunity.
Maybe techniques like this become subject of a future course!

1 Like