Integrate a Snips voice assistant with openHAB - Walkthrough

If you want to try a 100% open-source voice-control that supports openHAB as well and does not require interaction with any foreign server SEPIA Open Assistant might be interesting as well :slight_smile:

For information, I will start working on a way to integrate Snips (or at least a part of Snips features) in openHAB.

3 Likes

Excellent! :+1:

I explained my intentions in the Git issue.
I would appreciate help to define the set of intents we should define by default in openHAB.
Once done, I would need another help to setup and publish this set of intents in different languages in Snips Console. Of course I will provide the French one.

Very bad news: with Snips now bought by Sonos, this will be the end of Snips as open. They plan to close Snips console at the end of January. We have until this date to deploy a voice assistant from the console to our hardware. Then there will be no way to update/enhance the voice assistant.

By the way, as my work on openHAB side was almost finished, I will finish it and publish it. And I will try to publish something on Snips console before they close it, at least for French users.

But the Snips solution has unfortunately no future for us.

2 Likes

Indeed. Seems like I have wasted lots of time to get this integrated into my openHAB setup. What a pity!

Is there any equivalent successor in sight? SEPIA Open Assistant doesn’t make a big impression on me yet.


Please sign this petition if you want to let SNIPS be open for the community

In the mean time I think its worth having a look at RHASSPY which seems like a worthy replacement for SNIPS.

1 Like

SEPIA:

  • has a cross-platform client that works on any browser, Android and iOS (kind of), that can be customized (25+ skins) and that can work similar to a smart-speaker when in always-on mode using the integrated hotword detection (Porcupine)
  • has a headless prototype of the same client for people who don’t want to use a display
  • can use native speech recognition or its own STT server (Kaldi)
  • can use native, cloud and platform based speech-to-text (Google, Windows, Apple, eSpeak, etc.)
  • has a central server that runs on any RPi (>=1GB RAM) and transfers data between clients and users via Websocket connection
  • has a multi-turn dialog module and a customizable NLU module (including a Python interface for stuff like Rasa) that understands over a dozen user intents out-of-the-box in English and German
  • has a service for each of the 12+ intents (radio, smart home, to-do lists, alarm, news, weather, navigation, …)
  • has a teach-ui to create your own commands on the fly and a SDK to create custom services in Java that every user can install individually via website
  • can be accessed from home or when traveling
  • has extensions like the Mesh-Node server that securely connects to the main server and is integrated in the teach-ui and SDK
  • supports openHAB and FHEM
  • … and much more … :smiley:

@miba what is required to impress you? I can work on that :face_with_hand_over_mouth:

2 Likes

@sepia-assistant a few questions:
Has it got the tools inbuild for offline Training of the statistical language model, dictionary and acoustic model that you need for pocketsphinx or Kaldi? A choice of different Hotword services with custom wakeword for offline use for example the integration for the training service api of snowboy to do create wakewords from the interface? Which vad do you use and how do you handle multiple voice input sides/ satellite sites? Do you have an open api over Mqtt or Rest to do your own intent parsing in addition to the python interface? Just a few thoughts of the top of my head. Basics that are needed for an offline speech assistant alternative.
Edit I don’t see any easy way to set up a multiple site voice setup with hotword coalescing or similar and only a fairly limited way to train your own complex intents with easy slot value extraction is this right? You would have to do all this yourself on top of sepia? Do you use the general german model for offline speech recognition or do you train a custom one based on the intents?
Best regards Johannes

I would second to have a look at Rhasspy as it ties together many nice open source projects in a good way or use the services that you find there to build something of your own.
You can just install the Docker image to try it out.
https://rhasspy.readthedocs.io/en/latest/

1 Like

I am sorry, Florian - I did not mean to sound so negative!

Please correct me if I’m wrong but as far as I can tell there are no regular contributors to this project other than you. A community apparently has not yet emerged. That can and should change, of course, and I wish you every success! I will definitely observe the project.

@JGKK thanks for your questions, here are some answers:

Has it got the tools inbuild for offline Training of the statistical language model, dictionary and acoustic model that you need for pocketsphinx or Kaldi?

The STT server has an endpoint to adapt the language model and I’ve recently added an endpoint to the SEPIA server to export all custom commands as well. Automatic conversion of missing words to phonemes for the dictionary is not yet built in but on the top of the priority list. Actually I spoke to Michael from Rhasspy yesterday and he mentioned that he’s using Zamia (Kaldi) as well so it might be that models trained with Rhasspy are 100% compatible with SEPIA :slight_smile: . I will check this out soon. About acoustic model training: obviously there are recipes for Kaldi, but I would not recommend any “normal” user to do this, since Peter from Zamia is training them on strong graphics card and it usually takes over a week :grimacing: . To my experience it is usually not required if your LM has sizes typically for Snips or Rhasspy.
A word about Pocketsphinx. I’ve been working with this intensely when I built ILA voice assitant including all the above mentioned things (AM, LM, dict. adaptation) but ultimately gave up because the technology stack is not up-to date and WER was usually way too high. Because of that I’ve decided not to support this in SEPIA.

A choice of different Hotword services with custom wakeword for offline use for example the integration for the training service api of snowboy to do create wakewords from the interface?

Porcupine is the only service that is integrated deeply into the client because of its browser support but … you can use any hotword detection or any remote trigger you want via SEPIA’s remote action endpoint as demonstrated in this little video (there is a Python library to help with the integration). There is no web-interface to build your own hotword.

Which vad do you use and how do you handle multiple voice input sides/ satellite sites?

The client has support for the Mozilla VAD library by Kelly Davis, but VAD is usually handled on the STT server which has support for WebRTC VAD. Currently I’m limiting input to 4s though. What exactly do you mean by “voice input sides/ satellite sites”? Different clients that get activated at the same time by a user speaking the hotword? Each hotword trigger can target a specific device ID and user ID. If there are 2 devices with the same ID and same user logged in the last active device would be triggered (the Websocket server is keeping track of activation state).

Do you have an open api over Mqtt or Rest to do your own intent parsing in addition to the python interface?

Yes. For example there are REST APIs for intent recognition (interpret) and dialog management (answer) and in theory you can access the same APIs via the Websocket server. There is no official support for MQTT protocol yet. The Python interface is one of many modules that can supply the ‘interpret’ endpoint with results.

[…] a fairly limited way to train your own complex intents with easy slot value extraction is this right?

With the Teach-UI inside the app you can define custom sentences for intents that already exist. With the Java SDK you can define arbitrarily complex (or simple) services that can use SEPIA’s existing “slots” or define your own including questions SEPIA should ask you if parameters are missing to fulfill an intent. Existing parameters/slots that you can use out-of-the-box for services like smart home are for example: DeviceType (lights, heater, shutter, sensor etc.), Action (on, off, toggle, set, show, …), DeviceValue (70%, 20°C, 11, etc.), TimeDate (Tomorrow at 8a.m., …), Room (living-room, office, hallway, …) etc…

Do you use the general german model for offline speech recognition or do you train a custom one based on the intents?

“General” is the default model of the STT server. When you define your own LM you can use the SEPIA control HUB to switch between the models.

@miba:

Please correct me if I’m wrong but as far as I can tell there are no regular contributors to this project other than you. A community apparently has not yet emerged. That can and should change, of course, and I wish you every success! I will definitely observe the project.

Yes that is unfortunately true, it seems I’m pretty bad at marketing :stuck_out_tongue_closed_eyes::frowning_face:. When I uploaded the source code around 1.5 years ago to GitHub the project was already pretty big since I was working on it with a small start-up (I was the only programmer for most of the time) and it was meant to be a replacement for Siri and Google Assistant that gives you back control over your data. The start-up went separate ways and I decided to make my code 100% open-source and since then I’ve been working on breaking down everything into smaller parts and rewriting stuff for developers (besides adding new features :sweat_smile:).
Because of this history SEPIA was always more focussed on the app and end-users when compared to Rhasspy or Snips. It is kind of similar to Mycroft at least in its goal to offer a voice assistant that works out-of-the-box with minimal configuration yet offers developers tools to improve it and build own services. The same is valid for the openHAB integration: install SEPIA, add your openHAB server, control your devices.

I’ve decided to contact Michael from Rhasspy to start a discussion about how we could bring together the best of both worlds. Rhasspy basically covers SEPIA’s interpret and STT modules and offers a nice web-interface to manage the things I discussed above. From what I’ve seen so far both projects might actually work very well together … let’s see :slight_smile:.

(Sorry for the wall of text :sweat_smile:)

3 Likes

Thanks for the thorough answer :slightly_smiling_face: I will keep an eye on your project. For now I’ll stay with my own solution based on nodered​:see_no_evil:
I actually have had very good experiences when using pocketsphinx python with a custom 3 gram lm and a custom dictionary when working with a small vocabulary like in a smart home focused environment (a few hundred words of vocabulary and maybe a few thousand sentences that the language model is based on). And in my case it gave me better performance when running everything on one raspberry pi than kaldi did.
In quiet environments I have a stt Hit Rate of well over 95 percent.
With satellites and Hotword coalescing I mean when you have several voice input sites throughout your home but several of them overlap. So you only want the closest one to trigger if two Heard- the wakeword. I implemented a time stamp based approach but you have to integrate a custom offset for the delay certain microphones have.
Johannes

Thanks for the thorough answer :slightly_smiling_face: I will keep an eye on your project. For now I’ll stay with my own solution based on nodered​ :see_no_evil:

:smiley: From all I’ve seen so far Node-red looks pretty cool. It’s been on my to-do list to check out possible synergies with SEPIA for a long time now ^_^. I like the idea of building and connecting stuff with graphical interfaces like this or Blockly for example. Actually I’m planning to do something similar for the SEPIA SDK so people don’t have to write code in Java :slight_smile: .

I actually have had very good experiences when using pocketsphinx python with a custom 3 gram lm and a custom dictionary when working with a small vocabulary like in a smart home focused environment (a few hundred words of vocabulary and maybe a few thousand sentences that the language model is based on).

I always had the feeling that it is super-sensitive to the hardware you use and ambient noise. What kind of microphone do you use? It’s around 2-3 years ago that I last used it, maybe I should give it another shot.

when you have several voice input sites throughout your home but several of them overlap. So you only want the closest one to trigger if two Heard- the wakeword. I implemented a time stamp based approach but you have to integrate a custom offset for the delay certain microphones have.

Ok then I understood you right. I haven’t focused on this issue yet. By default SEPIA behaves as described above when you separate the remote trigger from the client (same device ID + same user ID = last active client gets the trigger). When both clients use the client-side wakeword trigger it might be possible in SEPIA to filter identical messages with very close timestamps inside the server. I will put this on my to-do list :smiley:

wow! Thats sounds really really wonderful. Thanks for all that you’ve been doing. Really appreciate.

1 Like

I use different models from Respeaker. I have good experiences with their usb mic array which is a new product that has some built in audio front processesing tailored to voice applications. But also their 2 mic and 4 mic raspberry pi hats work great

2 Likes

It seems like this thread has nothing to do with Snips and its integration into openHAB anymore. Maybe it makes sense to further discuss SEPIA in its corresponding thread to better gather relevant details there. :wink:

1 Like

Hm I’ve been using the 2 Mic HAT from Respeaker from time to time and got mixed results. But anyway I’ll do some experiments again soon and will keep that in mind :slight_smile: I did not know they have a USB mic array now :drooling_face::grin:

Maybe it makes sense to further discuss SEPIA in its corresponding thread to better gather relevant details there.

Agreed :sweat_smile:

Hi Johannes, one question just came up regarding MQTT. Do you know if there is a convention used for the format of voice-related messages?
As far as I understand Snips is using Hermes.

Hello sepia-assistant,

the last days I tried out SEPIA as well as Rhasspy. I really like both projects and I would love to see merging the best parts of both as you suggested (My fav: Rhasspy with assistants/assistent development tools).
What I am wondering about is the difference in performance with kaldi STT. I tried both on a Pi4 with 4GB. While SEPIA is amazing fast and almost instantly recognizes the words spoken, Rhasspy tooks almost 4 seconds for a short sentence. Why is that, though both of you are using the same basis (zamia / same accoustic model?). I tried to exchange the model from SEPIA to Rhasspy but with no change in speed.

Greetings