Hi openHAB community, in this post I’ll left a summary about how you can setup an speaker for dialog processing using a Raspberry PI
The OpenHAB core contains a dialog processor implementation that can interact with the different audio and voice services, I will left here a combination that I already have tested.
Sections:
-
OpenHAB Dialog Processing with the PulseAudioBinding.
-
What you need.
-
Flashing the raspberry pi
-
Connecting the Raspberry PI with PulseAudio.
- Echo Canceling
-
Configure the required voice services.
-
Keyword Spotter
-
Text to speech
-
Speech to text
-
Human language interpreter
-
-
Generating a Rustpotter wakeword
-
Configuring Rustpotter
-
Performance notes.
-
-
Start dialog processing.
-
-
Span section; Some voice services under development.
-
Whisper
-
HABSpeaker
-
ActionTemplateInterpreter+
-
-
Comment
OpenHAB Dialog Processing with the PulseAudioBinding.
What you need.
-
A raspberry pi with a speaker/microphone attached running raspbian connected to your network (use an static ip).
-
A server running OpenHAB v4.0.0 (on the same network).
Flashing the raspberry pi
This is probably already known for most people, but just in case I will describe what I did.
I have flashed the Raspberry PI OS Lite (32-bit) image to an sd-card using Raspberry PI Imager.
Before write the card I have used the options menu to define my username and password, enable ssh access, and configure my wifi credentials.
After flashing the card I have inserted it again and I have edited the cmdline.txt
file to included at the end of its only line ip=192.168.1.206
, to assign and static ip and avoid searching for the device.
I have inserted the sd-card into the board and connect it to the power and after a couple of minutes it was available in the network (my sd-card is old probably took some time due to that) and I can access to it over ssh.
Connecting the Raspberry PI with PulseAudio.
Connect to your Raspberry terminal and follow this steps to create a systemd unit that runs pulseaudio in system mode.
First run:
# Install pulseaudio
$ sudo apt update
$ sudo apt install pulseaudio
# Stop and disable pulse audio user services (as we are going to run pulseaudio in system mode)
$ sudo systemctl --global disable pulseaudio.service pulseaudio.socket
$ sudo systemctl stop pulseaudio.service pulseaudio.socket
# Enable the pulseaudio tcp protocol module for system, needed by the binding. (I have used port 4712 you can change it by any other free one)
# This command add the 'load-module module-cli-protocol-tcp port=4712' at the end of the file, you can do it manually.
$ echo 'load-module module-cli-protocol-tcp port=4712' | sudo tee -a /etc/pulse/system.pa
Then create this systemd unit at /etc/systemd/system/pulseaudio.service
:
[Unit]
Description=Pulseaudio sound server
After=avahi-daemon.service network.target
[Service]
ExecStart=/usr/bin/pulseaudio --system --disallow-exit
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
I have used nano:
$ sudo nano /etc/systemd/system/pulseaudio.service
# paste the file content and press 'Ctrl + x' and confirm typing 'y' and enter.
To enable the service run:
# Reload the unit definitions
$ sudo systemctl daemon-reload
# Make the service start after reboots
$ sudo systemctl enable pulseaudio
# Start the service
$ sudo systemctl start pulseaudio
# You can check the service status like this, should be running
$ sudo systemctl status pulseaudio
Assign these groups to the following users.
# Add the audio group to the pulse user (recommended, seems not needed in my case)
sudo usermod -aG audio pulse
# Add your user to the pulse-audio allowed group (needed to use rustpotter-cli without root after enabling system mode in pulseaudio)
sudo usermod -a -G pulse-access $USER
Go to your openHAB Main UI and:
-
Install pulseaudio from the list of official bindings.
-
Manually add a new thing using the pulseaudio binding of type “pulseaudio:bridge” and configure the “Hostname” to be your Raspberry PI ip.
-
After adding the thing and check it appears online, go to add a new thing using the pulseaudio binding again, but this time run a scan, your available sinks and sources should be detected, add them (you can avoid the once ending in monitor if present).
-
Go to your sink thing and enable the configuration option “Create an Audio Sink with simple-protocol-tcp”.
-
Go to your source thing and enable the configuration option “Create an Audio Source with simple-protocol-tcp”.
-
A recommendation, if you have a microphone that capture to much noise, using the source thing “volume” channel, set the value to 80 or similar, a 100 value can cause too much distortion on the audio captured causing the Speech-to-Text result to be worst.
The audio sink and source of your raspberry pi should be available in your openHAB server, you can check it at Settings/System Services/Audio
, they should be displayed there using their things labels.
Echo Canceling
If you have no much idea about pulseaudio I recommend you to test the full guide without this step and try to apply it later.
You can enable the pulseaudio module-echo-cancel module. You can find its documentation here searching by its name.
To do so edit the /etc/pulse/system.pa file and add this at the end:
.ifexists module-echo-cancel.so
load-module module-echo-cancel aec_method=webrtc aec_args="analog_gain_control=0 digital_gain_control=1" source_name=echocancelsource sink_name=echocancelsink rate=16000 use_volume_sharing=true
set-default-source echocancelsource
set-default-sink echocancelsink
.endif
Then restart the pulseaudio service by running:
sudo systemctl restart pulseaudio
You should be able to detect the echocancelsource and echocancelsink things when running a scan with the pulseaudio binding using the OpenHAB Main UI.
Configure the required voice services.
You need to have configured at least one of each voice service type in your server, then I recommend you to set them as the default ones in “Settings/System Services/Voice”.
Keyword Spotter
The function on this service is to detect your wakeword in a live audio stream.
In this example I have used rustpotter.
A section about the wakeword file creation using the rustpotter-cli follows this one.
Text to speech
The dialog processor uses this service to convert the interpreter text response in audio.
In this example I have used VoiceRSS as at the moment of writing there is no good local TTS for Spanish in openHAB that I like. It’s quick to configure, and gives one hour per month of free usage.
Recently a TTS cache has been added to the voice system, so the cost of using a cloud STT has been reduced (also the computational cost in case of a local one) as most responses are the same (confirmations or errors telling you that the command can not be processed and why).
Speech to text
The function of this service is to transcribe your voice to text so it can be processed by the different human language interpreter implementations configured.
I recommend you to use Vosk for local recognition or Google for cloud recognition.
If you choose to use Vosk please be careful about having enough memory.
As advised in its model page, you can not assume the memory consumption of a Vosk model by its size.
I recommended you to try first the small mobile models, or other smaller models without enabling the preloadModel option to confirm they do not bloat the system.
This is what you need to do to setup a vosk model downloaded from its page:
# Assuming you are in your openHAB userdata folder and you have write permissions
# Go to your vosk folder (created by the addon on installation)
$ cd vosk
# Download the model
$ curl -OL https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip
# Unzip the folder with the model files
$ unzip vosk-model-small-es-0.42.zip
# Rename the folder to model
$ mv vosk-model-small-es-0.42 model
# List model folder content to check.
$ ls model/
am conf graph ivector README
It’s recommended to use the preloadModel option for a good experience but as suggested better try a run without it if you are not certain about having enough memory. On debian you can check your system memory from the terminal with the free -h
command.
Human language interpreter
The function on this service is to interpret text commands and generate a response.
In this example I used the standard language interpreter (id: system), which is available in OpenHAB by default and provides you with a set of expressions for English, French, German, Italian and Spanish for some item types.
Generating a Rustpotter wakeword.
Rustpotter works better using a keyword generated with records taken on the same device, for that matter you can use its cli.
Connect to your Raspberry PI terminal and run the following commands to download the rustpotter-cli.
# Download the latest rustpotter-cli release for your arch https://github.com/GiviMAD/rustpotter-cli/releases.
# This command print your arch, just in case you don't remember.
$ uname -m
# Here I used the armv7l binary
$ curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v2.0.6/rustpotter-cli_debian_armv7l
# Make executable
$ chmod +x rustpotter-cli_debian_armv7l
# Check simple execution
$ ./rustpotter-cli_debian_armv7l --version
# Install
$ sudo mv ./rustpotter-cli_debian_armv7l /usr/local/bin/rustpotter-cli
# Print help
$ rustpotter-cli -h
You can use something like this to record multiple samples in a row and build the wakeword file with them.
I recorded them like at 2 meters of the speaker.
# Wakeword name and filename
WAKEWORD="ok casa"
WAKEWORD_FILENAME="${WAKEWORD// /_}"
# record 10 audios in a row, press 'ctrl + c' to end each, it will start recording the next one after 1 second
for i in {0..9}; do (rustpotter-cli record $WAKEWORD_FILENAME$i.wav && sleep 1); done
# play 10 audios in a row for verification
for i in {0..9}; do (aplay $WAKEWORD_FILENAME$i.wav); done
for i in {0..9}; do (rustpotter-cli filter -g --gain-ref 0.002 -b --low-cutoff 500 --high-cutoff 1500 $WAKEWORD_FILENAME$i.wav); done
# build model
rustpotter-cli build-model --model-name "$WAKEWORD" --model-path $WAKEWORD_FILENAME-gain-bandpass2.rpw $WAKEWORD_FILENAME*-gain0.002-*.wav
# run stop
rustpotter-cli spot $WAKEWORD_FILENAME.rpw
Configuring Rustpotter
There is no easy way to find the correct configuration parameters.
I recommend you to spend some time playing with the different configuration parameters.
These ones are what I used at the end.
rustpotter-cli spot -a 0.3 -t 0.52 -s p95 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw
Note that the max cpu usage of rustpotter will increase with the number of records used to create the wakeword file, I tend to use around 6.
The relevant options of the cli are:
-
You can enable the debug mode of the cli by passing the “-d” parameter so you can see the partial detections.
-
“-g” enables gain normalization. To debug the gain normalization you can use “–debug-gain”.
-
“–gain-ref” changes the gain normalization reference (the default value is printed at the beginning when debugging the gain normalization, depends on the wakeword)
-
“-a” configures the “averaged threshold”, recommended as reduces the cpu usage.
-
“-m 6” require at least 6 frames of positive scoring (compared against the detection “counter” field).
-
“-t” change the threshold.
-
“-s” the comparison strategy used.
-
“-b --low-cutoff 150 --high-cutoff 1200” the band pass configuration, there is not easy way to figure out the correct configuration for this but helps to attenuate background noises.
The comparison strategy p90 with 8 records means that the score will be an interpolation of the 2 higher scores nearest to the lower one. (90% of 8 = 7.2). It’s useful to reduce the number of false positives.
You can also use the test-model
command to spot against a file, allows you to use a record to tests your options.
rustpotter-cli test-model -a 0.3 -t 0.52 -s p90 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw ok_casa1.wav
Move the generated rpw to your openHAB server, to ‘/rustpotter/’.
Configure the Rustpotter options in the openHAB MainUI to match the ones you used on the cli (Settings/Other Services/Rustpotter).
Performance notes.
Rustpotter works well when it’s correctly configured and there is no noise and a semi-good performance in presence of lower noises can be achieved with the correct configuration of the audio filters. Enabling the echo cancel module on pulse audio also seems to improve the results.
It’s still far from other things like the Amazon/Google devices.
I’m its developer (the only one for now), its first version was released in january last year, the second one this january, and I’ll try to release another iteration the next one if I get any improvement.
If you like the project please consider give a start to the library on github.
It’s by far my most stared project :D.
If you want to comment here what configuration works better for you or what major problems you encounter with it will be great to have some feedback.
Start dialog processing.
OpenHAB includes a dialog processor implementation.
It connects an audio sink and source to the voice services (Keyword Spotting, Speech-of-Text, Human Language Interpreters and Text-to-Speech).
It allows basic voice command execution (wakeword detection → voice transcription → command interpretation → response synthesize).
In the OpenHAB 4.0.0 version the “register dialog” functionality is available, it allows to describe the desired dialog services and configuration and let the openHAB core in charge of keep that dialog running whenever is possible (configured services are available).
OpenHAB will persists the dialog registrations so the dialogs are restored after a system restart.
I’ll try to make a PR for the pulseaudiobinding to take advantage of this and integrate the dialog registration into the bridge thing configuration but for now you need to use the console to create the dialog registration.
You need to access the openhab-cli, and then you can:
# List available audio sinks to get your sink id
$ audio:sinks
# List available audio sources to get your source id
$ audio:sources
# Register dialog; change in the options in the commands.
# Let the options unset to use the defaults you have configured.
# Note that you can fallback to other interpreters using a comma separated list of ids.
$ voice registerdialog --source pulseaudio:source:f0516c6bd7:echocancelsource --sink pulseaudio:sink:f0516c6bd7:echocancelsink --hlis actiontemplatehli,system --keyword "ok casa"
# List dialog registrations
$ voice:dialogregs
# List started dialogs, note this command display the actual services in use
$ voice:dialogs
You can verify using the ‘voice:dialogs’ command that the dialog disappear from the list if you disable your pulseaudio bridge thing (the sink and source will became unavailable) and appears again after some seconds when you re-enable the thing.
If your dialog do not appear you can enable the debug logs of the Voice bundle to check what is the problem.
$ log:set DEBUG org.openhab.core.voice
Span section; Some voice services under development.
Whisper
Thanks to the existence of the whisper.cpp project you can use the whisper speech recognition capabilities in your openHAB server.
I have an open PR for this addon if you can give it a try before its merged in case some bug exists it would be great, I’m already using it on my server and seems to work ok.
The current add-on version runs only on CPU. I encounter the performance to be great on x86_64 platforms recent CPUs. On arm the performance is not to great as f16 operations are only supported in modern CPUs, it’s not the case of the Raspberry PI 4.
You can find the addon PR here, and the readme here
Addon download:
-
Github jar download (mirror)
After installation you should place a model in the <OPENHAB_USERDATA>/whisperstt
folder.
cd whisperstt
curl -OL https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin
Now you should configure the modelName option to tiny (Main UI - Settings/Other Services/Whisper), and it should be ready to use.
It’s recommended to use the preloadModel option for faster execution if you have enough ram.
HABSpeaker
HABSpeaker can be installed through the marketplace, it consist on a web interface that allows to use the voice system. It’s unfinished but the current version (beta 24) seems to work correctly, I hope I can complete this project within this year.
Due to browser security measures, it requires to be accessed over https, and it’s currently non usable over the OpenHAB Cloud Connector because it do not support websocket connections (I hope I can add that in the future, need to ask to the project maintainers if it’s ok). So the quickest way to give it a try right now it to momentously disable the browser security for your server url.
This can be done on Chrome following this steps:
-
Open chrome and enter
chrome://flags/
as url, it will display a list of options. -
Filter the list of options by ‘unsafely-treat-insecure-origin-as-secure’.
-
Enter your openHAB url in the option text fields, like “http://192.168.1.99:8080”.
-
Toggle the option to enable, the browser will ask you to relaunch.
After that, if you have the addon installed and the default voice services configured you should already be able to use it over http on the url http://<your openhab ip>:8080/habspeaker/
. It needs and initial click to setup the audio and start the connection, keyword spotting is disabled by default after a successful connection the dialog can be triggered by another click on the button (the circle in the center).
If you want to report any major problem, please use the marketplace thread.
ActionTemplateInterpreter+
This interpreter is also available through the marketplace. Its intention is to allow you to define your own custom voice commands. As the dialog processor accepts multiple interpreters to be chained which will run one after another until get a successful response, what I do is to prepend this interpreter to the standard one so I can have a couple of custom phrases added to the set that the standard interpreter gives me.
Probably creating this project wasn’t the best idea, the standard interpreter (and the underling rule interpreter which is exposed by the core) is not too different to this one, and superior in several points (apart from the code quality and not having a external dependency to the openNLP library like this one), so I think this interpreter will probably disappear in the future when some other method for adding custom phrases appear based on those resources.
I just wanted to let you know it exists in case you want to experiment adding custom voice functionalities.
I’m using it to have a couple of commands to manage my TVs, basically I have created a couple of virtual items that trigger rules and from there I implement what ever I want using the other items that I have linked to the TVs. Them I have created a couple of voice command in the interpreter that write to those items.
Things I have tried:
-
Changing my tv channel by name.
-
Play media by name using the dynamic label functionality to capture only the title from the voice command.
-
Move media state from one tv to another.
If you want to report a major problem on it or share what custom commands you have added, be welcome to do it on the marketplace thread.
Comment
Hope this guide is interesting to some of you, let me know if you think some point should be improved.
I still encounter the voice control functionality not comparable to the commercial alternatives but I think it has been improved in several points over the last year, and with the new tools and all the information that it’s appearing in the IA field I think it will continue to improve.
Best regards!