Hi openHAB community, in this post I’ll left a summary about how you can setup a speaker for dialog processing using a Raspberry PI.
Here you can find a small illustrative video of the voice control capabilities in Spanish. As speaker I’m using a Jabra Speaker2 40 connected to a Raspberry PI Zero 2 W, as OpenHAB server an Orange PI 5 (I expect the performance to be similar using a Raspberry PI 5) with the add-ons for pulseaudio, rustpotter, voicerss, and whisper.cpp with a small model. Note that I’m using OpenHAB 4.1.0 milestone 2 to use last rustpotter version and the code from this core PR which allows to use the item command description in the voice commands.
Sections:
-
OpenHAB Dialog Processing with the PulseAudioBinding.
-
What you need.
-
Flashing the raspberry pi
-
Connecting the Raspberry PI with PulseAudio.
-
Configure the required voice services.
-
Keyword Spotter
-
Text to speech
-
Speech to text
-
Human language interpreter
-
-
Generating a Rustpotter wakeword (notes about v3 at the end)
-
Configuring Rustpotter
-
Performance notes.
-
-
Start dialog processing.
-
-
Span section; Some voice services under development.
-
Whisper
-
HABSpeaker
-
ActionTemplateInterpreter+
-
-
Rustpotter v3
-
Comment
OpenHAB Dialog Processing with the PulseAudioBinding.
What you need.
-
A raspberry pi with a speaker/microphone attached running raspbian connected to your network (use an static ip).
-
A server running OpenHAB v4.0.0 (on the same network).
Flashing the raspberry pi
This is probably already known for most people, but just in case I will describe what I did.
I have flashed the Raspberry PI OS Lite (32-bit) image to an sd-card using Raspberry PI Imager.
Before write the card I have used the options menu to define my username and password, enable ssh access, and configure my wifi credentials.
After flashing the card I have inserted it again and I have edited the cmdline.txt
file to included at the end of its only line ip=192.168.1.206
, to assign and static ip and avoid searching for the device.
I have inserted the sd-card into the board and connect it to the power and after a couple of minutes it was available in the network (my sd-card is old probably took some time due to that) and I can access to it over ssh.
Connecting the Raspberry PI with PulseAudio.
Connect to your Raspberry terminal and follow this steps to create a systemd unit that runs pulseaudio in system mode.
First run:
# Install pulseaudio
$ sudo apt update
$ sudo apt install pulseaudio
# Stop and disable pulse audio user services (as we are going to run pulseaudio in system mode)
$ sudo systemctl --global disable pulseaudio.service pulseaudio.socket
$ sudo systemctl stop pulseaudio.service pulseaudio.socket
# Enable the pulseaudio tcp protocol module for system, needed by the binding. (I have used port 4712 you can change it by any other free one)
# This command add the 'load-module module-cli-protocol-tcp port=4712' at the end of the file, you can do it manually.
$ echo 'load-module module-cli-protocol-tcp port=4712' | sudo tee -a /etc/pulse/system.pa
Then create this systemd unit at /etc/systemd/system/pulseaudio.service
:
[Unit]
Description=Pulseaudio sound server
After=avahi-daemon.service network.target
[Service]
ExecStart=/usr/bin/pulseaudio --system --disallow-exit
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
I have used nano:
$ sudo nano /etc/systemd/system/pulseaudio.service
# paste the file content and press 'Ctrl + x' and confirm typing 'y' and enter.
To enable the service run:
# Reload the unit definitions
$ sudo systemctl daemon-reload
# Make the service start after reboots
$ sudo systemctl enable pulseaudio
# Start the service
$ sudo systemctl start pulseaudio
# You can check the service status like this, should be running
$ sudo systemctl status pulseaudio
Assign these groups to the following users.
# Add the audio group to the pulse user (recommended, seems not needed in my case)
sudo usermod -aG audio pulse
# Add your user to the pulse-audio allowed group (needed to use rustpotter-cli without root after enabling system mode in pulseaudio)
sudo usermod -a -G pulse-access $USER
Go to your openHAB Main UI and:
-
Install pulseaudio from the list of official bindings.
-
Manually add a new thing using the pulseaudio binding of type “pulseaudio:bridge” and configure the “Hostname” to be your Raspberry PI ip.
-
After adding the thing and check it appears online, go to add a new thing using the pulseaudio binding again, but this time run a scan, your available sinks and sources should be detected, add them (you can avoid the once ending in monitor if present).
-
Go to your sink thing and enable the configuration option “Create an Audio Sink with simple-protocol-tcp”.
-
Go to your source thing and enable the configuration option “Create an Audio Source with simple-protocol-tcp”.
-
A recommendation, if you have a microphone that capture to much noise, using the source thing “volume” channel, set the value to 80 or similar, a 100 value can cause too much distortion on the audio captured causing the Speech-to-Text result to be worst.
The audio sink and source of your raspberry pi should be available in your openHAB server, you can check it at Settings/System Services/Audio
, they should be displayed there using their things labels.
Configure the required voice services.
You need to have configured at least one of each voice service type in your server, then I recommend you to set them as the default ones in “Settings/System Services/Voice”.
Keyword Spotter
The function on this service is to detect your wakeword in a live audio stream.
In this example I have used rustpotter.
A section about the wakeword file creation using the rustpotter-cli follows this one.
Text to speech
The dialog processor uses this service to convert the interpreter text response in audio.
In this example I have used VoiceRSS as at the moment of writing there is no good local TTS for Spanish in openHAB that I like. It’s quick to configure, and gives one hour per month of free usage.
Recently a TTS cache has been added to the voice system, so the cost of using a cloud STT has been reduced (also the computational cost in case of a local one) as most responses are the same (confirmations or errors telling you that the command can not be processed and why).
Speech to text
The function of this service is to transcribe your voice to text so it can be processed by the different human language interpreter implementations configured.
I recommend you to use Vosk for local recognition or Google for cloud recognition.
If you choose to use Vosk please be careful about having enough memory.
As advised in its model page, you can not assume the memory consumption of a Vosk model by its size.
I recommended you to try first the small mobile models, or other smaller models without enabling the preloadModel option to confirm they do not bloat the system.
This is what you need to do to setup a vosk model downloaded from its page:
# Assuming you are in your openHAB userdata folder and you have write permissions
# Go to your vosk folder (created by the addon on installation)
$ cd vosk
# Download the model
$ curl -OL https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip
# Unzip the folder with the model files
$ unzip vosk-model-small-es-0.42.zip
# Rename the folder to model
$ mv vosk-model-small-es-0.42 model
# List model folder content to check.
$ ls model/
am conf graph ivector README
It’s recommended to use the preloadModel option for a good experience but as suggested better try a run without it if you are not certain about having enough memory. On debian you can check your system memory from the terminal with the free -h
command.
Human language interpreter
The function on this service is to interpret text commands and generate a response.
In this example I used the standard language interpreter (id: system), which is available in OpenHAB by default and provides you with a set of expressions for English, French, German, Italian and Spanish for some item types.
Generating a Rustpotter wakeword reference. (Notes about v3 at the end).
Rustpotter works better using a keyword generated with records taken on the same device, for that matter you can use its cli.
Connect to your Raspberry PI terminal and run the following commands to download the rustpotter-cli.
# Download the latest rustpotter-cli release for your arch https://github.com/GiviMAD/rustpotter-cli/releases.
# This command print your arch, just in case you don't remember.
$ uname -m
# Here I used the armv7l binary
$ curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v2.0.6/rustpotter-cli_debian_armv7l
# Make executable
$ chmod +x rustpotter-cli_debian_armv7l
# Check simple execution
$ ./rustpotter-cli_debian_armv7l --version
# Install
$ sudo mv ./rustpotter-cli_debian_armv7l /usr/local/bin/rustpotter-cli
# Print help
$ rustpotter-cli -h
You can use something like this to record multiple samples in a row and build the wakeword file with them.
I recorded them like at 2 meters of the speaker.
# Wakeword name and filename
WAKEWORD="ok casa"
WAKEWORD_FILENAME="${WAKEWORD// /_}"
# record 10 audios in a row, press 'ctrl + c' to end each, it will start recording the next one after 1 second
for i in {0..9}; do (rustpotter-cli record $WAKEWORD_FILENAME$i.wav && sleep 1); done
# play 10 audios in a row for verification
for i in {0..9}; do (aplay $WAKEWORD_FILENAME$i.wav); done
for i in {0..9}; do (rustpotter-cli filter -g --gain-ref 0.002 -b --low-cutoff 500 --high-cutoff 1500 $WAKEWORD_FILENAME$i.wav); done
# build model
rustpotter-cli build-model --model-name "$WAKEWORD" --model-path $WAKEWORD_FILENAME-gain-bandpass2.rpw $WAKEWORD_FILENAME*-gain0.002-*.wav
# run stop
rustpotter-cli spot $WAKEWORD_FILENAME.rpw
Configuring Rustpotter
There is no easy way to find the correct configuration parameters.
I recommend you to spend some time playing with the different configuration parameters.
These ones are what I used at the end.
rustpotter-cli spot -a 0.3 -t 0.52 -s p95 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw
Note that the max cpu usage of rustpotter will increase with the number of records used to create the wakeword file, I tend to use around 6.
The relevant options of the cli are:
-
You can enable the debug mode of the cli by passing the “-d” parameter so you can see the partial detections.
-
“-g” enables gain normalization. To debug the gain normalization you can use “–debug-gain”.
-
“–gain-ref” changes the gain normalization reference (the default value is printed at the beginning when debugging the gain normalization, depends on the wakeword)
-
“-a” configures the “averaged threshold”, recommended as reduces the cpu usage.
-
“-m 6” require at least 6 frames of positive scoring (compared against the detection “counter” field).
-
“-t” change the threshold.
-
“-s” the comparison strategy used.
-
“-b --low-cutoff 150 --high-cutoff 1200” the band pass configuration, there is not easy way to figure out the correct configuration for this but helps to attenuate background noises.
The comparison strategy p90 with 8 records means that the score will be an interpolation of the 2 higher scores nearest to the lower one. (90% of 8 = 7.2). It’s useful to reduce the number of false positives.
You can also use the test-model
command to spot against a file, allows you to use a record to tests your options.
rustpotter-cli test-model -a 0.3 -t 0.52 -s p90 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw ok_casa1.wav
Move the generated rpw to your openHAB server, to ‘/rustpotter/’.
Configure the Rustpotter options in the openHAB MainUI to match the ones you used on the cli (Settings/Other Services/Rustpotter).
Performance notes.
Rustpotter works well when it’s correctly configured and there is no noise and a semi-good performance in presence of lower noises can be achieved with the correct configuration of the audio filters. Enabling the echo cancel module on pulse audio also seems to improve the results.
It’s still far from other things like the Amazon/Google devices.
I’m its developer (the only one for now), its first version was released in january last year, the second one this january, and I’ll try to release another iteration the next one if I get any improvement.
If you like the project please consider give a start to the library on github.
It’s by far my most stared project :D.
If you want to comment here what configuration works better for you or what major problems you encounter with it will be great to have some feedback.
Start dialog processing.
OpenHAB includes a dialog processor implementation.
It connects an audio sink and source to the voice services (Keyword Spotting, Speech-of-Text, Human Language Interpreters and Text-to-Speech).
It allows basic voice command execution (wakeword detection → voice transcription → command interpretation → response synthesize).
In the OpenHAB 4.0.0 version the “register dialog” functionality is available, it allows to describe the desired dialog services and configuration and let the openHAB core in charge of keep that dialog running whenever is possible (configured services are available).
OpenHAB will persists the dialog registrations so the dialogs are restored after a system restart.
I’ll try to make a PR for the pulseaudiobinding to take advantage of this and integrate the dialog registration into the bridge thing configuration but for now you need to use the console to create the dialog registration.
You need to access the openhab-cli, and then you can:
# List available audio sinks to get your sink id
$ audio:sinks
# List available audio sources to get your source id
$ audio:sources
# Register dialog; change in the options in the commands.
# Let the options unset to use the defaults you have configured.
# Note that you can fallback to other interpreters using a comma separated list of ids.
$ voice registerdialog --source <SOURCE_ID> --sink <SINK_ID> --keyword "ok casa"
# List dialog registrations
$ voice:dialogregs
# List started dialogs, note this command display the actual services in use
$ voice:dialogs
You can verify using the ‘voice:dialogs’ command that the dialog disappear from the list if you disable your pulseaudio bridge thing (the sink and source will became unavailable) and appears again after some seconds when you re-enable the thing.
If your dialog do not appear you can enable the debug logs of the Voice bundle to check what is the problem.
$ log:set DEBUG org.openhab.core.voice
Span section; Some voice services under development.
Custom rules using the standard interpreter (for 4.1.0 version)
You can find in this PR’s first comment a write up of the standard interpreter features on v4.1.0, I’ve tried to summarize all that I know there. It’s available on the 4.1.0’s snapshots from December’s six.
Whisper (PR for v4.1.0)
Thanks to the existence of the whisper.cpp project you can use the whisper speech recognition capabilities in your openHAB server.
The limited grammar feature recently added to whisper.cpp is integrated in the add-on and allows getting pretty good results using the tiny/base models which are pretty faster compared to the other models.
I have an open PR for this add-on if you can give it a try before its merged in case some bug exists it would be great, I’m already using it on my installation and seems to work great.
The library built-in with the add-on runs only on CPU, you can build your own library if you need it. I encounter the performance to be great on x86_64 platforms recent CPUs. On arm the performance is not to great as f16 operations are only supported in modern CPUs, it’s not the case of the Raspberry PI 4.
You can find the readme here.
Addon download:
Please download from the PR or built it your self:
After installation you should place a model in the <OPENHAB_USERDATA>/whisperstt
folder.
cd whisperstt
curl -OL https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin
Now you should configure the modelName option to tiny (Main UI - Settings/Other Services/Whisper), and it should be ready to use.
It’s recommended to use the preloadModel option for faster execution if you have enough ram.
HABSpeaker (v4.1.0)
HABSpeaker can be installed through the marketplace, it consist on a web interface that allows to use the voice system. It’s unfinished but the current version (beta 24) seems to work correctly, I hope I can complete this project within this year.
Due to browser security measures, it requires to be accessed over https, and it’s currently non usable over the OpenHAB Cloud Connector because it do not support websocket connections (I hope I can add that in the future, need to ask to the project maintainers if it’s ok). So the quickest way to give it a try right now it to momentously disable the browser security for your server url.
This can be done on Chrome following this steps:
-
Open chrome and enter
chrome://flags/
as url, it will display a list of options. -
Filter the list of options by ‘unsafely-treat-insecure-origin-as-secure’.
-
Enter your openHAB url in the option text fields, like “http://192.168.1.99:8080”.
-
Toggle the option to enable, the browser will ask you to relaunch.
After that, if you have the addon installed and the default voice services configured you should already be able to use it over http on the url http://<your openhab ip>:8080/habspeaker/
. It needs and initial click to setup the audio and start the connection, keyword spotting is disabled by default after a successful connection the dialog can be triggered by another click on the button (the circle in the center).
If you want to report any problem, please use the marketplace thread.
Ruspotter v3 (v4.1.0)
I’ve created another version of the library sooner than expected because I found a great library for machine learning library that allows me to easily implement something that I was read about in many articles which is training “basic” linear layer networks (which are commonly used for image classification) with the audio MFCCs, which is what I was already in rustpotter, and the results seems very promising with the drawback that you need to take a lot of more records to create a compelling training and testings sets.
So in the v3 the mayor change is that you can generate two kind of wakeword files, the “wakeword references” which offer the previous detection by comparison functionality, and the “wakeword models” which offer detection by classification.
Both kind of wakewords can be generated using the rustpotter cli, also a new “record on partial spot” has been added that helps in the datasets creation and to detect library/audio problems.
Files generated with v2 work in v3 but not on the other side.
There is already an open PR to make this version available to OpenHAB here. Rustpotter v3 is available since OpenHAB 4.1.0 milestone 2
The task of creating/sharing trained models that work well across different devices with different persons seems a little overwhelming for me at this moment so I have it on my TODO list. Because I not sure I can distribute something that works similar to a model trained by the final user/users on the target mic/mics, but I will try to distribute something that can be used as a base.
If you make some tests training your own wakeword, you can ping me with the results here or by PM. I’ll recommend creating a wakeword reference file first and using it with the record option to collect the training/testing set, as it emits records of the same duration.
Piper (PR for v.4.1.0)
There is a pending review PR for integrating Piper as Text-to-Speech service.
On my experience Piper medium models for Spanish work great, pretty fast, using around 100MB of ram.
Extra: Script that I use to control Jellyfin Android TV devices
Sharing the javascript rule that I’m using to play media from Jellyfin on my Android TV devices. Do not take it as something ready or easy to use, I just want to share it in case someone can take advantage of it to do something cooler. I used there the androiddebugbridge, androidtv and jellyfin bindings.
It may have something wrong it’s not something I haven’t worked on too much. The major problem I have, it’s that some times it takes to much time to start playing when the device is offline, or it end up not playing the title, I don’t know if it’s related with the adb connection or with the Jellyfin client itself. But most times it’s working fine for me.
Note that it doesn’t fetch the content data from the Jellyfin server as I have some files with my media info on the file system.
example_android_jellifin.txt (8.1 KB)
As the androiddebugbridge binding supports sending intents it should be possible to do something similar integrating personal and commercial media services. It’s something I couldn’t find time to look at.
Comment
Hope this guide is interesting to some of you, let me know if you think some point should be improved.
Best regards!
Change log:
- Remove noise cancelation part as it was giving me bad results with the STT services and notes about rustpotter v3.
- 6 Dec 2023; Add piper info, link to standard interpreter summary, remove actiontemplatehli info, add example media control script and minor changes.