Speech-to-Text via multiple audio streams

rakkur · March 26, 2021, 2:25pm

My dream here is for a way to listen to many audio streams that are constantly on and perpetually listening throughout the house. I’d like to issue a voice directive to the house from wherever I am, have it acknowledge my command, and execute it via rules.

I’ve searched through the community and documentation but to no avail. I’ve found on the audio & voice page:

openHAB has a very modular architecture that enables many different use cases. At its core, there is the notion of an audio stream. Audio streams are provided by audio sources and consumed by audio sinks.

Which is basically what I’m looking for, but the implementation seems to be lacking currently. On the Multimedia page, I found:

Although there are already interfaces defined in openHAB for speech-to-text, up to now there is no add-on available for this functionality. So the only choice that is available right now is to use the Android voice recognition feature that is built into the openHAB Android app.

I don’t really see much else of it in the community forum besides how to implement a listener via the Android app.

I’m a software developer with 10+ years of experience and would love to contribute to this if it’s something we’re exploring, but do we have any knowledge of where we are with this concept?

Thank you all for this exceptional community and software; we have loved building our smart home over the past year with it!

Andrew_Rowe · March 27, 2021, 2:18pm

Hi Eric and welcome to the OpenHAB community!!!
I’m not 100% sure I understand exactly what it is that you are asking. I think you are asking several different questions and also offering to pitch in, which is appreciated.
One question

Most of us use Alexa (Amazon echo) to accomplish this or I think Google Home has the same functionality but don’t own one. Main issue most have with this approach is it requires a internet connection and works in a cloud (some have security concerns) I think there is a non-cloud version you install on your system (MyCroft ???) know nothing about this either.
Anyhow, for me ‘Alexa. play pandora’ works
Can be automated from rules as well.

Next question

Basically, audio streams can come from anywhere. The can be something you stream off the internet, they can be something you stream yourself internally, such as a squeezebox server (free software).
An audio sink is anything that can play sound. OpenHAB recognizes anything that can play audio as a audio sink. It can be a device (using a binding) such as a Sonos speaker. It can be the system audio on the computer that OpenHAB runs on.

Myself personally, I’ve always been a little disappointed with the audio stuff myself, although I’ve eventually figured out a way to do what I want. I use the Pulse audio binding, which allows me to control the volume of my system audio and mute. But it was hard to get working, not much documentation or threads on the forum about it or audio stuff in general.
Once not to long ago, someone posted about creating an audio binding which would make using audio much easier and he kind of got shot down and told the application already does everything he was talking about. I was like no no it’s a good idea.
Anyhow, welcome and lets discuss this

denominator · March 27, 2021, 10:25pm

I use google home to control my devices

I use to use https://voxcommando.com/

rakkur · March 28, 2021, 10:59pm

Thank you Andrew! This is great information.

I got the idea for this because I was setting up a bunch of internal cameras for our house; each one has a video/audio feed coming from it to my internal server. Why not hijack them as microphone inputs?

I would prefer not to go the alexa/google home route for the security reasons; for me, the sinks should be composed of camera feeds or other, local listening devices such as computers, pis, etc.

I will look into squeezebox/pulse. I’m running a Windows box for my setup, too, which may make things more difficult. Pulse is not really actively supported; it looks like Jack is a comparable audio stream app for Windows machines.

Now that I know that this is possible (if frustrating) though, I will try some things and post as I proceed! I may look into developing an add on if we think it might be useful for tying some of these streams from various sources together, as well.

rakkur · March 28, 2021, 10:59pm

Thank you! I’ll check it out!

Andrew_Rowe · March 29, 2021, 1:20am

Eric, yes, that’s the spirit, that is what is cool about OpenHAB, your imagination is the only limit

whoa… cool idea

OK. understand squeezebox is an audio server. It serves streams to clients (there are lots of clients) what it streams is… well… what ever… stuff from the web, stuff you have on a drive somewhere… whatever

Pulse is a binding to control the audio hardware on a PC, and yeah mostly relevant to linux but I’m sure there is a Windoze equivalent

so… have fun, your imagination is the only limit here

and yeah, I’d really like to see a binding or even better core having better audio capabilities.
Sound is a great way to signal something has occurred. From a door bell to text message ‘ding’, under certain circumstances, gets your attention, should be intricate to home automation.

Maximo · March 29, 2021, 9:41am

This looks like an interesting concept. Not an easy one to implement, but should be a fun project to watch unfold.
I can’t remember the name of it, but there is an open source project that details how to build a voice assistant that’s totally local. If I can find it again I’ll post a link. I would suspect though that you’ll need an instance per audio source, which means you’ll either need to go down the docker route or multiple virtual machines.

Good luck and keep us all posted.

Maximo · March 30, 2021, 12:49pm

I think this was the Voice Assistant I looked at, where you can have it all running locally.

Years ago I remember watching a YouTube video where a guy had built a similar system where he had microphones in various rooms with speakers. He linked it all together with a piece of hardware that allowed him to identify where the voice was coming from and therefore reply in the same room.

Again I’ve tried to find the video but can’t for the life of me remember the guys name. If it comes to me, I’ll post the link to his video.

rakkur · March 30, 2021, 2:01pm

Thank you, Maximo! I’ll be working on this a bit more long term; as you pointed out, it’s quite complicated. There’s a processing efficiency involved where you don’t want to be parsing blank/empty space or extraneous noise, either. There’s apparently a lot of history with this type of project in other spaces.

I’ll post my progress here!

dalgwen · April 2, 2021, 8:42am

Hello all.
I have the same objectives and constraints (I try to have a “all local” system, or at least with good privacy)

I can talk about several possibilities to reach this, from the simplest to the hardest :

you can run Mycroft on several Raspberry pi in your house. On the Mycroft device, there is a skill you can install which can control your OpenHAB. OpenHAB has a documentation page for this. If you are skilled with python, it is also possible to create your own skill to recognize your own phrasing.
Another possibility, still with Mycroft : you could use Mycroft only for wake word + speech to text, and use it to send the raw text to an openhab item. Then you can write your own logic, on openHAB side, to handle intent. Or pass it to Habot, which can understand natural langage if I’m correct. It should be already possible with some code on a custom Mycroft skill (to send text to an openHAB item). For reference, I’m also working right now on a Mycroft binding which will hopefully allow this easily (by connecting to the Mycroft message bus). The goal is sending text from openHAB to Mycroft to say something (“text to speech”), and receiving all text that Mycroft has recognized (“speech to text’”)
Third possibility, which involve heavy coding on openHAB side : massively extending the pulse audio binding to allow it to grab audio from distant computer and use voice recognition locally on OpenHAB. Not sure if it is a good idea though (huge network bandwith, especially if you want to do the wake word recognition on the server side).
For reference, I just made my first contribution to openHAB by allowing the pulse audio binding to SEND (and not receive) audio on other pulseaudio enabled “satellite” (like a standard raspberry pi)

Note : Mycroft is not completely local. Speech to text is cloud based (but anonymised). It shoud be possible to run a local STT like the one Mozilla is working on, but it’s a lot of work and require horsepower. But it’s a goal and should be available in the future.