Calling for help beta-testing a CMU Sphinx speech recognition add-on

Hi!

A few months ago I experimented with CMU Sphinx as purely local speech-to-text add-on for openHAB (https://github.com/openhab/openhab2-addons/pull/2220), that is, without any cloud service dependency (and the privacy concerns which go with them). CMU Sphinx is also able to recognize languages not supported by e.g. Alexa. Development has stalled for the last 3 months, and there is much work to do to simplify the configuration, but I thought it would be awesome to have other users’ feedback first in order to assess whether it’s worth pursuing further.

Therefore, I’m creating this thread to raise awareness and try to rally up people interested in seeing it in future openHAB distributions!

So, anyone interested?

You will need:

  • some knowledge of openHAB;
  • a working, decent microphone - preferably a far-field microphone to pick up your voice from across the room - you can also buy a PlayStation Eye USB microphone for less than $10/10€ on Amazon; try recording your voice with a recorder program before attempting to have it work with openHAB to make sure the source, volume, etc. are set properly;
    IMPORTANT NOTE: the microphone should be able to record at 16kHz/8-bit/mono, otherwise it won’t work - on Linux (RPi etc.) you might have to tweak /etc/asound.conf for ALSA or whatever if you’re using PulseAudio; this helped me with the PlayStation Eye. YMMV.
  • Most of all, a hacking spirit and some free time/patience/perseverance :slight_smile:

Here’s how to get started:

(please also have a look at the README and the voice & audio concepts in the openHAB docs)

  1. First, download org.openhab.voice.cmusphinx-2.2.0-SNAPSHOT.jar

  2. Drop it into your openHAB distribution’s addons folder to load it

  3. You need to download several files from CMU Sphinx on Sourceforge according to your language - an acoustic model and a dictionary:

    a. for US English:

    b. for German:

    c. for French:

    d. for other languages:
    Download resources as available from https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/
    :warning: NOTE: you won’t be able to use Eclipse SmartHome’s standard human language interpreter (the ‘Built-in Interpreter’) since it only supports English, German and French at the moment. Feel free to submit a PR to ESH to add support for your language. You will however be able to use the ‘Rule-based interpreter’ which sends recognized speech to an item.

  4. Create a directory, for instance /opt/openhab/conf/stt, extract the acoustic model in a sub-directory and also place the dictionary file there.

  5. You will need to create a grammar file by hand (for now), in JSGF format (https://www.w3.org/TR/jsgf/) which will describe the sentences CMU Sphinx will be able to recognize. If you’re using the built-in interpreter, those sentences will have to be those supported by the interpreter. Your grammar should also include a keyword (or hotword, or “magic word”) of your choosing, which you will configure later. You will need to speak this word before giving an actual command. Note: it doesn’t need to be one word, expressions like “hey openhab” can work too.

    Create a directory in your filesystem (for instance /opt/openhab/conf/stt/grammar) and write a grammar file named commands.gram (the extension is important) in it.
    Here are some examples:

  • English:
#JSGF V1.0;

grammar commands;

<location> = living room | kitchen | bedroom | corridor | bathroom | garage;
<thing> = lights | heating | fan | blinds;
<item> = <location> <thing>; 

<onoff> = on | off;
<turn> = turn | switch;
<put> = put | bring;
<increase> = increase | brighten | harden | enhance;
<decrease> = decrease | dim | lower | soften;
<color> = white | pink | yellow | orange | purple | red | green | blue;

<switchcmd> = <turn> [the] <item> <onoff>;
<increasecmd> = <increase> the <item>;
<decreasecmd> = <decrease> the <item>;
<upcmd> = <put> the <item> up;
<downcmd> = <put> the <item> down;
<colorcmd> = [set] [the] color [of] the <item> [to] <color>; 

<keyword> = openhab;

public <command> = <keyword> | <switchcmd> | <increasecmd> | <decreasecmd> | <upcmd> | <downcmd> | <colorcmd>;
  • German (untested, please correct if necessary):
#JSGF V1.0;

grammar commands;

<location> = küche | büro | schlafzimmer | badezimmer | garage;
<thing> = beleuchtung | heizung | ventilator | rollläden;
<item> = <location> <thing>; 

<einaus> = ein | aus;
<schalte> = schalt | schalte;
<mache> = mach | mache;
<mehr> = heller | mehr;
<weniger> = dunkler | weniger;
<farbe> = weiss | pink | gelb | orange | lila | rot | grün | blau;
<dendiedas> = den | die | das;

<switchcmd> = <schalte> [<dendiedas>] <item> <einaus>;
<increasecmd> = [<schalte> | <mache>] <dendiedas> <item> <mehr>;
<decreasecmd> = [<schalte> | <mache>] <dendiedas> <item> <weniger>;
<upcmd> = <mache> <dendiedas> <item> hoch;
<downcmd> = <mache> <dendiedas> <item> runter;
<colorcmd> = [<schalte>] [<dendiedas>] <item> [auf] <farbe>; 

<keyword> = openhab;

public <command> = <keyword> | <switchcmd> | <increasecmd> | <decreasecmd> | <upcmd> | <downcmd> | <colorcmd>;
  • French:
#JSGF V1.0;

grammar commands;

<command> = allumer | éteindre | activer | éteindre | stopper | désactiver | couper | augmenter | diminuer | monter | descendre;
<lela> = le | la | les | l;
<poursurde> = pour | sur | du | de;

<color> = blanc | rose | jaune | orange | violet | rouge | vert | bleu;

<item> = bureau | salon | table | chambre | cuisine | volet;

<keyword> = maison;

public <order> = <keyword> | <command> [<lela>] <item> | couleur <color> [<poursurde>] [<lela>] <item>;
  1. You can use a word not in the dictionary (for example “openhab”) but then you have to add it: open your dictionary file with a text editor and add a line for it. Look for words with similar pronunciation and try to derive one for your word.
    For example in German “opensuse” is Q OOH P AX N Z UU Z AX and “haben” is HH AAH B AX N so for “openhab” you would add (please confirm this):
openhab Q OOH P AX N HH AAH B

in French it would be, similarly:

openhab oo pp ee nn aa bb
  1. Go to Paper UI to configure several things:

    a. in Add-ons > Voice, install a text-to-speech engine compatible with your system and language (and configure it) - MANDATORY;

    b. In Configuration > System:

    • under Audio, set Default Source to System Microphone and Default Sink to System Speaker

      Don’t forget to save!

    • under Regional Settings, set Language to e.g. en/de/fr and Country / Region to e.g. US/DE/FR

      Don’t forget to save!

    • under Voice:

      • Default Text-to-Speech: set to the TTS engine you installed above;
      • Default Speech-to-Text: set to CMU Sphinx;
      • Default Voice: configure according to your TTS engine - the voice MUST match the language/region you chose;
      • Default Human Language Interpreter: you can use either one for English/German/French and are limited to the the Rule-based interpreter in other languages (see above) - for the latter, the item receiving the commands has to be conigured in Configuration > Services > Voice > Rule Voice Interpreter ;
      • Default Keyword Spotter: set to CMU Sphinx;
      • Magic Word: set to the keyword you defined in your grammar;
      • Listening Switch: you can specify here a Switch item which will be switched on and off when the system is listening for a command after spotting the magic word. For example, you could choose to map it directly to a lightbulb or have some rules to play a sound.

      Don’t forget to save!

    c. in Configuration > Services > Voice > CMU Sphinx Speech-to-Text:

    • Locale: set to e.g. en-US or de-DE or fr-FR

    • Acoustic model path: set to the path to the directory containing the acoustic model files, e.g. /opt/openhab/conf/stt/cmusphinx-fr-ptm-5.2

    • Dictionary file path: set to the path to the file containing your dictionary, e.g. /opt/openhab/conf/stt/fr.dict

    • Language model path: leave blank (important!)

    • Grammar path: set to the path to the directory containing the grammar files, e.g. /opt/openhab/conf/stt/grammar

    • Grammar name: set to the name of the file (without the extension) containing the grammar, e.g. commands

      Leave Start listening off for now and Save.

  2. Check your openHAB logs and look for a line like:

[INFO ] [usphinx.internal.CMUSphinxSTTService] - CMU Sphinx speech recognizer initialized

…Try and fix encountered errors otherwise.

  1. Now you can go back to Configuration > Services > Voice > CMU Sphinx Speech-to-Text in Paper UI and turn on Start listening:
    You will hopefully see this log line appearing:
[INFO ] [cmusphinx.internal.CMUSphinxRunnable] - CMU Sphinx: StreamSpeechRecognizer recognition started
  1. You may now start speaking your keyword, and if it’s recognized, you will see:
[INFO ] [cmusphinx.internal.CMUSphinxRunnable] - Keyword recognized: hey openhab, speak command now

(the “listening switch” you configured will also turn on)

Then speak a command from your grammar, if recognized you’ll see it in the logs as well:

[INFO ] [cmusphinx.internal.CMUSphinxRunnable] - Command recognized: couleur bleu pour le bureau

If the text-to-speech engine is properly configured, the voice will tell either ‘Ok.’ or the error encountered. You will also see it in the logs.

If you run into trouble, you can lower the log threshold: in the openHAB Console, type:

log:set DEBUG org.openhab.voice.cmusphinx

Every recognized sentence, valid or not, will appear in the log at the DEBUG level along with other messages.

Thanks for giving it a try!

5 Likes

Great idea! I definitely would like to see the Russian one at some point. Will start with English anyway. How many PC resources does it consume? Can I run on pi or do I need a more powerful PC?

I’ve run it on my RPi clone, it takes time to initialize (10-15 sec) but sort of works.
The main gripe with it, why I still don’t run in “production”, is false positives - it is too sensitive and will pick up the keyword sometimes (like during watching a movie or standard conversation) and even mistakenly perform stuff randomly. This is annoying :slight_smile: I haven’t been to adjust the sensitivity yet - there is no confidence score reported by Sphinx4 when using a grammar.

Sounds like a general problem, not related to pi. If all goes well I will have vacation next week. I will take pi and the mic with me and give it a try.

Quick question: is it possible to setup Sphinx to recognize any text and not just the predefined word sequence? I wonder if it would work if the grammar is replaced with intent recognition tool, like opennlp. Check https://github.com/mlehman/nlp-intent-toolkit

That’s what the “language model” is supposed to do (natural language recognition). However the performance with it vs a grammar is quite poor, it eats GB of memory and won’t work at all on RPis…

Based on my experience with Julius, false recognition of the keyword happens sometimes.
For me this has never resulted in a random operation though. I wrote my own DM (dialog manager) and I time out the keyword if something hasn’t been fully recognized within 5 seconds. Additionally I did not use a pre-compiled acoustic model but trained my own. This greatly reduced errors since basically it is trained to just understand me…
I don’t recall how it works with pocketSphinx/Sphinx but with Julius it will always return the best match for what it heard. It doesn’t matter if the phrase it heard is way different, just that it is the closest to what it knows.

Damn, I failed to find Playstation Eye in my city that I can buy before I leave for vacation. Any other mic you can suggest?

Thanks for your feedback. I implemented a 10-second timeout as well but still got some unwanted behavior (they remain pretty rare but still). Training the acoustic model with your voice should help indeed.

Maybe a different engine for keyword recognition, like Snowboy would give better results.

I believe that’s exactly the problem.

I bought mine on Amazon, but there aren’t many good, affordable far-field microphones for DIY…

Maybe this: https://www.amazon.com/Andrea-Communications-C1-1021450-100-USB-SA-Microphone/dp/B003VW5Q08/
This looks promising too but is out of stock: https://www.seeedstudio.com/ReSpeaker-Mic-Array-Far-field-w%2F-7-PDM-Microphones-p-2719.html
This is being crowdfunded: https://www.kickstarter.com/projects/1101048494/matrix-voice-open-source-voice-platform-for-all

It would seem that Snowboy is pretty limited in what it can do since it is a hotword detection system. It can be defined with multiple hotwords for different functions but doesn’t seen to have a grammar definition mechanism.
For me, this just wouldn’t work out. I have quite a few things set up with my voice recognition setup and it would be impossible to remember a whole bunch of hotwords(keywords) for every single function.

For someone who wants offline voice recognition I just don’t see that there are any better solutions than Sphinx or Julius.

It would seem that there are several Mics (some with 7 detectors like the ones you mentioned), but they all seem to rely on online resources to do their job. Not sure how suitable/adaptable they would be for Sphinx.

Kenny

Sure, I meant keeping Sphinx for the actual command recognition but use Snowboy for the hotword detection only - ESH/openHAB makes a distinction between those and allows different engines to be specified, though I’m not sure how well it would play out since they share the same input audio stream.

Ahh, okay I get it.
Thanks.

I researched a bit more and found this product:

https://www.minidsp.com/products/usb-audio-interface/uma-8-microphone-array

It looks promising too and falls within my “tinkering around” budget so I ordered one, hopefully it’ll do the job as a whole room DIY solution. I’ll post results when I get some.

Hey nice find. That looks really nice.
Almost tempted to go ahead and purchase one too…