Hello all,
I’m not sure whether this is the best place to post this, as it’s not yet a “project”, more of a “project idea”. I’m curious what others think about its feasibility, and the idea itself. It’s more of a subject for an open discussion.
Here it is:
Sometimes when one’s driving a car in traffic, there is an emergency vehicle somewhere and the siren can be heard. It is often unclear, especially in heavy traffic, where the sound is coming from. My idea would be:
Step 1) identify that there is a siren sound and display that information to the driver (the idea here is that the device will hear and identify such signal before the driver does, so it’s a “heads up”)
Step 2) identify the direction the sound is coming from, and display that information: this would help moving away from the emergency vehicle’s path in case it’s coming from behind us, or avoid getting in its way when it’s approaching the same intersection as we are (i.e. stopping before entering the intersection, etc.)
Step 2) would require a microphone array and seems very complicated, so the focus is on identifying the sound signal itself first.
It seems relatively difficult, for multiple reasons:
-
a) various countries and various emergency services may have different signals, and it’s not easy to find out the specs.
-
b) the siren sound by its very nature is quite variable due to the Doppler effect: the frequency rises as the sound source is getting closer and falls when it’s moving away. In case of a siren in traffic it can be heard very clearly.
As for point a), I was able to find some information that might be valid for various countries: there are three signals, called “hi-lo”, “wail” and “yelp”. “Hi-lo” has the sound frequency modulated by a square wave between 950 and 1150 Hz, seems like.
“Yelp” is between 500 and 2000 Hz, frequency modulated by a low frequency triangular wave, and “wail” is 1800 Hz down to 600 Hz, driven by a sawtooth LFO. At least that’s what I got from that description. The frequency of these oscillations is low, the longest cycle is “yelp”, which takes 8 seconds to go all the way down and then up again.
The ML model would need to identify this kind of wavy pattern when it appears in the total spectrum captured by the microphone, instead of looking for specific frequencies. It might create a distinct pattern in the spectrogram, but I don’t know that yet.
The dataset: I believe it would need to be synthesized, the siren part. Collecting a vast number of traffic sound samples is one thing, getting many samples of a siren coming and going from different directions is another thing, but it could be synthesized programmatically or electronically, and then mixed with traffic noise samples.
I’m not sure how feasible that is, or how useful: given the long time window used for capturing the signal, it might be pointless. And the model could be either inaccurate, or big.
What do you all think?