@JGKK thanks for your questions, here are some answers:
Has it got the tools inbuild for offline Training of the statistical language model, dictionary and acoustic model that you need for pocketsphinx or Kaldi?
The STT server has an endpoint to adapt the language model and I’ve recently added an endpoint to the SEPIA server to export all custom commands as well. Automatic conversion of missing words to phonemes for the dictionary is not yet built in but on the top of the priority list. Actually I spoke to Michael from Rhasspy yesterday and he mentioned that he’s using Zamia (Kaldi) as well so it might be that models trained with Rhasspy are 100% compatible with SEPIA . I will check this out soon. About acoustic model training: obviously there are recipes for Kaldi, but I would not recommend any “normal” user to do this, since Peter from Zamia is training them on strong graphics card and it usually takes over a week . To my experience it is usually not required if your LM has sizes typically for Snips or Rhasspy.
A word about Pocketsphinx. I’ve been working with this intensely when I built ILA voice assitant including all the above mentioned things (AM, LM, dict. adaptation) but ultimately gave up because the technology stack is not up-to date and WER was usually way too high. Because of that I’ve decided not to support this in SEPIA.
A choice of different Hotword services with custom wakeword for offline use for example the integration for the training service api of snowboy to do create wakewords from the interface?
Porcupine is the only service that is integrated deeply into the client because of its browser support but … you can use any hotword detection or any remote trigger you want via SEPIA’s remote action endpoint as demonstrated in this little video (there is a Python library to help with the integration). There is no web-interface to build your own hotword.
Which vad do you use and how do you handle multiple voice input sides/ satellite sites?
The client has support for the Mozilla VAD library by Kelly Davis, but VAD is usually handled on the STT server which has support for WebRTC VAD. Currently I’m limiting input to 4s though. What exactly do you mean by “voice input sides/ satellite sites”? Different clients that get activated at the same time by a user speaking the hotword? Each hotword trigger can target a specific device ID and user ID. If there are 2 devices with the same ID and same user logged in the last active device would be triggered (the Websocket server is keeping track of activation state).
Do you have an open api over Mqtt or Rest to do your own intent parsing in addition to the python interface?
Yes. For example there are REST APIs for intent recognition (interpret) and dialog management (answer) and in theory you can access the same APIs via the Websocket server. There is no official support for MQTT protocol yet. The Python interface is one of many modules that can supply the ‘interpret’ endpoint with results.
[…] a fairly limited way to train your own complex intents with easy slot value extraction is this right?
With the Teach-UI inside the app you can define custom sentences for intents that already exist. With the Java SDK you can define arbitrarily complex (or simple) services that can use SEPIA’s existing “slots” or define your own including questions SEPIA should ask you if parameters are missing to fulfill an intent. Existing parameters/slots that you can use out-of-the-box for services like smart home are for example: DeviceType (lights, heater, shutter, sensor etc.), Action (on, off, toggle, set, show, …), DeviceValue (70%, 20°C, 11, etc.), TimeDate (Tomorrow at 8a.m., …), Room (living-room, office, hallway, …) etc…
Do you use the general german model for offline speech recognition or do you train a custom one based on the intents?
“General” is the default model of the STT server. When you define your own LM you can use the SEPIA control HUB to switch between the models.
@miba:
Please correct me if I’m wrong but as far as I can tell there are no regular contributors to this project other than you. A community apparently has not yet emerged. That can and should change, of course, and I wish you every success! I will definitely observe the project.
Yes that is unfortunately true, it seems I’m pretty bad at marketing . When I uploaded the source code around 1.5 years ago to GitHub the project was already pretty big since I was working on it with a small start-up (I was the only programmer for most of the time) and it was meant to be a replacement for Siri and Google Assistant that gives you back control over your data. The start-up went separate ways and I decided to make my code 100% open-source and since then I’ve been working on breaking down everything into smaller parts and rewriting stuff for developers (besides adding new features ).
Because of this history SEPIA was always more focussed on the app and end-users when compared to Rhasspy or Snips. It is kind of similar to Mycroft at least in its goal to offer a voice assistant that works out-of-the-box with minimal configuration yet offers developers tools to improve it and build own services. The same is valid for the openHAB integration: install SEPIA, add your openHAB server, control your devices.
I’ve decided to contact Michael from Rhasspy to start a discussion about how we could bring together the best of both worlds. Rhasspy basically covers SEPIA’s interpret and STT modules and offers a nice web-interface to manage the things I discussed above. From what I’ve seen so far both projects might actually work very well together … let’s see .
(Sorry for the wall of text )