Latest front end in speech-to-text?

I’ve read up on old threads about speech-to-text, but nothing really fits what I’m looking for.

I would like a device that triggers from a hotword, samples audio until the talker stops speaking, then send that audio file to openhab where it is converted to text and then sent to a natural language processor. It doesn’t HAVE to be offline, but that would be nice.

My biggest wish is to have localization to establish context for the command. I imagine that the hotword can have something like RSSI included with the message. The device with the strongest RSSI would be the context for the command. Ideally, I would have ten or so audio devices plugged in at various locations around the house. One in the kitchen, and one in the dining room, which are not separated by a wall. A request to turn on the fan knows to turn on the kitchen fan or the dining room fan using the RSSI.

I’ve found the Gravity modules and the modules that are trained on a small set of commands. I’d like something more generic where the audio is passed to a real computer for text processing. I think with my Semantic model this would work well.

I already have six or so Home Minis, and I like them, but last time I tried integrating was difficult, and I don’t think there’s the RSSI concept.

It looks like the ESP32-S3 with the audio front end and voice framework gets me mostly there. I would need to work with that to do the RSSI and transmit the audio stream. I think a small box that plugs into an outlet would be ideal, under the $50 range. This gets expensive when adding multiple devices throughout the house.

So, my question is:
Are there any projects or products out there similar to this that allow me to get location if multiple devices hear the hot word?

That functionality is mostly covered by the core in the org.openhab.core.voice bundle. You can find there the interfaces for the different voice services (keyword-spotting, speech-to-text and text-to-speech, and for the human-language-interpreter), and the DialogProcessor class that connects an AudioSink and AudioSource implementation to those and orchestrates the voice command execution. So the only thing you need to implement are the audio sink and source.

That is more or less what I did in the habspeaker add-on where I implemented a sink a source over a websocket connection. If you are going to write an audio client for an embedded device and using a websocket connection seems like a good option for you, maybe we can integrate it into the same add-on or have some shared codebase. Let me know if that is aligned with your plan.

If I understand your implementation correctly, habspeaker runs in the browser, and either triggers on hot word, or on button click. Also, my understanding is that the audio stream is not persistent.

I was thinking of a device that triggers on hotword, so that part would need to be at the edge, then send the audio file to OH for processing. Or, were you considering that the edge device send audio stream to OH 100% of the time, then the STT services there trigger on hotword, and so forth? If a persistent stream is sent, how does that scale on the processor side? I suppose a rudimentary voice presence detection could remove a big load from the voice processor, only sending the stream when a voice is detected.

I’m guessing that you’ve seen Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

1 Like

I have not!! Going to check this out, but looks like what I need. Thank you.

It has different modes depending on the configuration you apply to the speaker. You can choose to run keyword spotting on the server (in that case the mic audio is streamed continuously), run the keyword spotting on the browser, or can run without keyword spotting and just trigger the dialog on click which is the default.

The DialogProcessor in openHAB is designed to run the keyword spotting on the server streaming the audio continuously from the microphone I did a little workaround to allow the other modes in habspeaker. I also think it is not the best idea if you want many speakers, but at the same time it makes easy to integrate the “dialog execution” in multiples devices, I think the ideal will be to have both options correctly supported so people can use the one that fit better their needs or the device.

Adding an AudioSource implementation which creates the AudioStream from a file is possible. Personally I think that using realtime audio sink/sources instead of relaying on files is a better idea, not to resolve this specific problem but you can use them to implement calls or music streaming more easily, so I find them more interesting.

Yeah, “file” was a bad choice of words. Websocket or stream was my intent.