Dialog Processing with the PulseAudioBinding

Hi openHAB community, in this post I’ll left a summary about how you can setup an speaker for dialog processing using a Raspberry PI

The OpenHAB core contains a dialog processor implementation that can interact with the different audio and voice services, I will left here a combination that I already have tested.

Sections:

  • OpenHAB Dialog Processing with the PulseAudioBinding.

    • What you need.

    • Flashing the raspberry pi

    • Connecting the Raspberry PI with PulseAudio.

      • Echo Canceling
    • Configure the required voice services.

      • Keyword Spotter

      • Text to speech

      • Speech to text

      • Human language interpreter

    • Generating a Rustpotter wakeword

      • Configuring Rustpotter

      • Performance notes.

    • Start dialog processing.

  • Span section; Some voice services under development.

    • Whisper

    • HABSpeaker

    • ActionTemplateInterpreter+

  • Comment

OpenHAB Dialog Processing with the PulseAudioBinding.

What you need.

  • A raspberry pi with a speaker/microphone attached running raspbian connected to your network (use an static ip).

  • A server running OpenHAB v4.0.0 (on the same network).

Flashing the raspberry pi

This is probably already known for most people, but just in case I will describe what I did.

I have flashed the Raspberry PI OS Lite (32-bit) image to an sd-card using Raspberry PI Imager.

Before write the card I have used the options menu to define my username and password, enable ssh access, and configure my wifi credentials.

After flashing the card I have inserted it again and I have edited the cmdline.txt file to included at the end of its only line ip=192.168.1.206, to assign and static ip and avoid searching for the device.

I have inserted the sd-card into the board and connect it to the power and after a couple of minutes it was available in the network (my sd-card is old probably took some time due to that) and I can access to it over ssh.

Connecting the Raspberry PI with PulseAudio.

Connect to your Raspberry terminal and follow this steps to create a systemd unit that runs pulseaudio in system mode.

First run:


# Install pulseaudio

$ sudo apt update

$ sudo apt install pulseaudio

# Stop and disable pulse audio user services (as we are going to run pulseaudio in system mode)

$ sudo systemctl --global disable pulseaudio.service pulseaudio.socket

$ sudo systemctl stop pulseaudio.service pulseaudio.socket

# Enable the pulseaudio tcp protocol module for system, needed by the binding. (I have used port 4712 you can change it by any other free one)

# This command add the 'load-module module-cli-protocol-tcp port=4712' at the end of the file, you can do it manually.

$ echo 'load-module module-cli-protocol-tcp port=4712' | sudo tee -a /etc/pulse/system.pa

Then create this systemd unit at /etc/systemd/system/pulseaudio.service:


[Unit]

Description=Pulseaudio sound server

After=avahi-daemon.service network.target

[Service]

ExecStart=/usr/bin/pulseaudio --system --disallow-exit

ExecReload=/bin/kill -HUP $MAINPID

[Install]

WantedBy=multi-user.target

I have used nano:


$ sudo nano /etc/systemd/system/pulseaudio.service

# paste the file content and press 'Ctrl + x' and confirm typing 'y' and enter.

To enable the service run:


# Reload the unit definitions

$ sudo systemctl daemon-reload

# Make the service start after reboots

$ sudo systemctl enable pulseaudio

# Start the service

$ sudo systemctl start pulseaudio

# You can check the service status like this, should be running

$ sudo systemctl status pulseaudio

Assign these groups to the following users.


# Add the audio group to the pulse user (recommended, seems not needed in my case)

sudo usermod -aG audio pulse

# Add your user to the pulse-audio allowed group (needed to use rustpotter-cli without root after enabling system mode in pulseaudio)

sudo usermod -a -G pulse-access $USER

Go to your openHAB Main UI and:

  • Install pulseaudio from the list of official bindings.

  • Manually add a new thing using the pulseaudio binding of type “pulseaudio:bridge” and configure the “Hostname” to be your Raspberry PI ip.

  • After adding the thing and check it appears online, go to add a new thing using the pulseaudio binding again, but this time run a scan, your available sinks and sources should be detected, add them (you can avoid the once ending in monitor if present).

  • Go to your sink thing and enable the configuration option “Create an Audio Sink with simple-protocol-tcp”.

  • Go to your source thing and enable the configuration option “Create an Audio Source with simple-protocol-tcp”.

  • A recommendation, if you have a microphone that capture to much noise, using the source thing “volume” channel, set the value to 80 or similar, a 100 value can cause too much distortion on the audio captured causing the Speech-to-Text result to be worst.

The audio sink and source of your raspberry pi should be available in your openHAB server, you can check it at Settings/System Services/Audio, they should be displayed there using their things labels.

Echo Canceling

If you have no much idea about pulseaudio I recommend you to test the full guide without this step and try to apply it later.

You can enable the pulseaudio module-echo-cancel module. You can find its documentation here searching by its name.

To do so edit the /etc/pulse/system.pa file and add this at the end:


.ifexists module-echo-cancel.so

load-module module-echo-cancel aec_method=webrtc aec_args="analog_gain_control=0 digital_gain_control=1" source_name=echocancelsource sink_name=echocancelsink rate=16000 use_volume_sharing=true

set-default-source echocancelsource

set-default-sink echocancelsink

.endif

Then restart the pulseaudio service by running:


sudo systemctl restart pulseaudio

You should be able to detect the echocancelsource and echocancelsink things when running a scan with the pulseaudio binding using the OpenHAB Main UI.

Configure the required voice services.

You need to have configured at least one of each voice service type in your server, then I recommend you to set them as the default ones in “Settings/System Services/Voice”.

Keyword Spotter

The function on this service is to detect your wakeword in a live audio stream.

In this example I have used rustpotter.

A section about the wakeword file creation using the rustpotter-cli follows this one.

Text to speech

The dialog processor uses this service to convert the interpreter text response in audio.

In this example I have used VoiceRSS as at the moment of writing there is no good local TTS for Spanish in openHAB that I like. It’s quick to configure, and gives one hour per month of free usage.

Recently a TTS cache has been added to the voice system, so the cost of using a cloud STT has been reduced (also the computational cost in case of a local one) as most responses are the same (confirmations or errors telling you that the command can not be processed and why).

Speech to text

The function of this service is to transcribe your voice to text so it can be processed by the different human language interpreter implementations configured.

I recommend you to use Vosk for local recognition or Google for cloud recognition.

If you choose to use Vosk please be careful about having enough memory.

As advised in its model page, you can not assume the memory consumption of a Vosk model by its size.

I recommended you to try first the small mobile models, or other smaller models without enabling the preloadModel option to confirm they do not bloat the system.

This is what you need to do to setup a vosk model downloaded from its page:


# Assuming you are in your openHAB userdata folder and you have write permissions

# Go to your vosk folder (created by the addon on installation)

$ cd vosk

# Download the model

$ curl -OL https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip

# Unzip the folder with the model files

$ unzip vosk-model-small-es-0.42.zip

# Rename the folder to model

$ mv vosk-model-small-es-0.42 model

# List model folder content to check.

$ ls model/

am conf graph ivector README

It’s recommended to use the preloadModel option for a good experience but as suggested better try a run without it if you are not certain about having enough memory. On debian you can check your system memory from the terminal with the free -h command.

Human language interpreter

The function on this service is to interpret text commands and generate a response.

In this example I used the standard language interpreter (id: system), which is available in OpenHAB by default and provides you with a set of expressions for English, French, German, Italian and Spanish for some item types.

Generating a Rustpotter wakeword.

Rustpotter works better using a keyword generated with records taken on the same device, for that matter you can use its cli.

Connect to your Raspberry PI terminal and run the following commands to download the rustpotter-cli.


# Download the latest rustpotter-cli release for your arch https://github.com/GiviMAD/rustpotter-cli/releases.

# This command print your arch, just in case you don't remember.

$ uname -m

# Here I used the armv7l binary

$ curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v2.0.6/rustpotter-cli_debian_armv7l

# Make executable

$ chmod +x rustpotter-cli_debian_armv7l

# Check simple execution

$ ./rustpotter-cli_debian_armv7l --version

# Install

$ sudo mv ./rustpotter-cli_debian_armv7l /usr/local/bin/rustpotter-cli

# Print help

$ rustpotter-cli -h

You can use something like this to record multiple samples in a row and build the wakeword file with them.

I recorded them like at 2 meters of the speaker.


# Wakeword name and filename

WAKEWORD="ok casa"

WAKEWORD_FILENAME="${WAKEWORD// /_}"

# record 10 audios in a row, press 'ctrl + c' to end each, it will start recording the next one after 1 second

for i in {0..9}; do (rustpotter-cli record $WAKEWORD_FILENAME$i.wav && sleep 1); done

# play 10 audios in a row for verification

for i in {0..9}; do (aplay $WAKEWORD_FILENAME$i.wav); done

for i in {0..9}; do (rustpotter-cli filter -g --gain-ref 0.002 -b --low-cutoff 500 --high-cutoff 1500 $WAKEWORD_FILENAME$i.wav); done

# build model

rustpotter-cli build-model --model-name "$WAKEWORD" --model-path $WAKEWORD_FILENAME-gain-bandpass2.rpw $WAKEWORD_FILENAME*-gain0.002-*.wav

# run stop

rustpotter-cli spot $WAKEWORD_FILENAME.rpw

Configuring Rustpotter

There is no easy way to find the correct configuration parameters.

I recommend you to spend some time playing with the different configuration parameters.

These ones are what I used at the end.


rustpotter-cli spot -a 0.3 -t 0.52 -s p95 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw

Note that the max cpu usage of rustpotter will increase with the number of records used to create the wakeword file, I tend to use around 6.

The relevant options of the cli are:

  • You can enable the debug mode of the cli by passing the “-d” parameter so you can see the partial detections.

  • “-g” enables gain normalization. To debug the gain normalization you can use “–debug-gain”.

  • “–gain-ref” changes the gain normalization reference (the default value is printed at the beginning when debugging the gain normalization, depends on the wakeword)

  • “-a” configures the “averaged threshold”, recommended as reduces the cpu usage.

  • “-m 6” require at least 6 frames of positive scoring (compared against the detection “counter” field).

  • “-t” change the threshold.

  • “-s” the comparison strategy used.

  • “-b --low-cutoff 150 --high-cutoff 1200” the band pass configuration, there is not easy way to figure out the correct configuration for this but helps to attenuate background noises.

The comparison strategy p90 with 8 records means that the score will be an interpolation of the 2 higher scores nearest to the lower one. (90% of 8 = 7.2). It’s useful to reduce the number of false positives.

You can also use the test-model command to spot against a file, allows you to use a record to tests your options.


rustpotter-cli test-model -a 0.3 -t 0.52 -s p90 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw ok_casa1.wav

Move the generated rpw to your openHAB server, to ‘/rustpotter/’.

Configure the Rustpotter options in the openHAB MainUI to match the ones you used on the cli (Settings/Other Services/Rustpotter).

Performance notes.

Rustpotter works well when it’s correctly configured and there is no noise and a semi-good performance in presence of lower noises can be achieved with the correct configuration of the audio filters. Enabling the echo cancel module on pulse audio also seems to improve the results.

It’s still far from other things like the Amazon/Google devices.

I’m its developer (the only one for now), its first version was released in january last year, the second one this january, and I’ll try to release another iteration the next one if I get any improvement.

If you like the project please consider give a start to the library on github.

It’s by far my most stared project :D.

If you want to comment here what configuration works better for you or what major problems you encounter with it will be great to have some feedback.

Start dialog processing.

OpenHAB includes a dialog processor implementation.

It connects an audio sink and source to the voice services (Keyword Spotting, Speech-of-Text, Human Language Interpreters and Text-to-Speech).

It allows basic voice command execution (wakeword detection → voice transcription → command interpretation → response synthesize).

In the OpenHAB 4.0.0 version the “register dialog” functionality is available, it allows to describe the desired dialog services and configuration and let the openHAB core in charge of keep that dialog running whenever is possible (configured services are available).

OpenHAB will persists the dialog registrations so the dialogs are restored after a system restart.

I’ll try to make a PR for the pulseaudiobinding to take advantage of this and integrate the dialog registration into the bridge thing configuration but for now you need to use the console to create the dialog registration.

You need to access the openhab-cli, and then you can:


# List available audio sinks to get your sink id

$ audio:sinks

# List available audio sources to get your source id

$ audio:sources

# Register dialog; change in the options in the commands.

# Let the options unset to use the defaults you have configured.

# Note that you can fallback to other interpreters using a comma separated list of ids.

$ voice registerdialog --source pulseaudio:source:f0516c6bd7:echocancelsource --sink pulseaudio:sink:f0516c6bd7:echocancelsink --hlis actiontemplatehli,system --keyword "ok casa"

# List dialog registrations

$ voice:dialogregs

# List started dialogs, note this command display the actual services in use

$ voice:dialogs

You can verify using the ‘voice:dialogs’ command that the dialog disappear from the list if you disable your pulseaudio bridge thing (the sink and source will became unavailable) and appears again after some seconds when you re-enable the thing.

If your dialog do not appear you can enable the debug logs of the Voice bundle to check what is the problem.


$ log:set DEBUG org.openhab.core.voice

Span section; Some voice services under development.

Whisper

Thanks to the existence of the whisper.cpp project you can use the whisper speech recognition capabilities in your openHAB server.

I have an open PR for this addon if you can give it a try before its merged in case some bug exists it would be great, I’m already using it on my server and seems to work ok.

The current add-on version runs only on CPU. I encounter the performance to be great on x86_64 platforms recent CPUs. On arm the performance is not to great as f16 operations are only supported in modern CPUs, it’s not the case of the Raspberry PI 4.

You can find the addon PR here, and the readme here

Addon download:

After installation you should place a model in the <OPENHAB_USERDATA>/whisperstt folder.


cd whisperstt

curl -OL https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin

Now you should configure the modelName option to tiny (Main UI - Settings/Other Services/Whisper), and it should be ready to use.

It’s recommended to use the preloadModel option for faster execution if you have enough ram.

HABSpeaker

HABSpeaker can be installed through the marketplace, it consist on a web interface that allows to use the voice system. It’s unfinished but the current version (beta 24) seems to work correctly, I hope I can complete this project within this year.

Due to browser security measures, it requires to be accessed over https, and it’s currently non usable over the OpenHAB Cloud Connector because it do not support websocket connections (I hope I can add that in the future, need to ask to the project maintainers if it’s ok). So the quickest way to give it a try right now it to momentously disable the browser security for your server url.

This can be done on Chrome following this steps:

  • Open chrome and enter chrome://flags/ as url, it will display a list of options.

  • Filter the list of options by ‘unsafely-treat-insecure-origin-as-secure’.

  • Enter your openHAB url in the option text fields, like “http://192.168.1.99:8080”.

  • Toggle the option to enable, the browser will ask you to relaunch.

After that, if you have the addon installed and the default voice services configured you should already be able to use it over http on the url http://<your openhab ip>:8080/habspeaker/. It needs and initial click to setup the audio and start the connection, keyword spotting is disabled by default after a successful connection the dialog can be triggered by another click on the button (the circle in the center).

If you want to report any major problem, please use the marketplace thread.

ActionTemplateInterpreter+

This interpreter is also available through the marketplace. Its intention is to allow you to define your own custom voice commands. As the dialog processor accepts multiple interpreters to be chained which will run one after another until get a successful response, what I do is to prepend this interpreter to the standard one so I can have a couple of custom phrases added to the set that the standard interpreter gives me.

Probably creating this project wasn’t the best idea, the standard interpreter (and the underling rule interpreter which is exposed by the core) is not too different to this one, and superior in several points (apart from the code quality and not having a external dependency to the openNLP library like this one), so I think this interpreter will probably disappear in the future when some other method for adding custom phrases appear based on those resources.

I just wanted to let you know it exists in case you want to experiment adding custom voice functionalities.

I’m using it to have a couple of commands to manage my TVs, basically I have created a couple of virtual items that trigger rules and from there I implement what ever I want using the other items that I have linked to the TVs. Them I have created a couple of voice command in the interpreter that write to those items.

Things I have tried:

  • Changing my tv channel by name.

  • Play media by name using the dynamic label functionality to capture only the title from the voice command.

  • Move media state from one tv to another.

If you want to report a major problem on it or share what custom commands you have added, be welcome to do it on the marketplace thread.

Comment

Hope this guide is interesting to some of you, let me know if you think some point should be improved.

I still encounter the voice control functionality not comparable to the commercial alternatives but I think it has been improved in several points over the last year, and with the new tools and all the information that it’s appearing in the IA field I think it will continue to improve.

Best regards!

4 Likes

Hello,

As usual, thanks for your effort to make this accessible for everyone !

Don’t you think it could be a core/main ui feature ?
Amongst the many projects I hope to have time for, I have always wanted to try to make/propose in the left openHAB menu panel a new section dialog/AI/whatever. With the same look as the thing / item one.
It would have a list and details page for the dialogs registered, with the same kind of openHAB “theme” (edit, remove, + button in the bottom right to add a new one, etc.)
Dialog registering is probably the last main piece that can’t actually be done by GUI alone.
I’m more of a backend type of dev, so I’m not very confortable with the idea to take this by myself, but I certainly will try to put some effort into this, some day.

And for your idea of the pulseaudio binding to be able to register a dialog :
This new “dialog registration” section could have a kind of “inbox” like the one we have for things ? And binding, like pulseaudio, could be able to add an “ready to use” entry in this.
And another example : I’m currently working (very early beginning) on a java project, that would have the functionnality of an intercom, to make calls between room in a house (hopefully SIP compliant).
I plan (very long term) to make a corresponding binding that would auto discover these satellites and register a sink / source for each one. And also maybe a dialog registration. This kind of project would greatly benefit of this kind of “inbox” dialog registration capability.
(I also want to offload the wake word responsibility to this “satellite”, but it’s another subject)

Sorry for the slightly off topic, I didn’t want to transform this in a development thread but I got carried away by the prospect :sweat_smile:

1 Like

Glad you comment that, when I introduced the register dialog feature I did it thinking in something like that, but as we use source id as the dialog processor id, I think now that having the dialog configuration integrated on the thing that register the source or its bridge thing also makes sense and can be done using the current UI/Core features, even if you show there all the sinks to allow dialogs with audio services from different bindings.

As a summary the dialog registration is also identified by the source id, I think the dialog registration can be integrated into the bindings by adding the dialog options to the thing config among a registerDialog boolean that instruct the binding to using the voiceManager register/unregister the dialog on the modified events. This way there is no difference whetter you use the console or the thing config to create the dialog registration, I think that this can generate some collisions if people mix the use the console and the binding config, but I think the console commands are something most people will not use and will understand that touching the registrations that are managed by the bindings can generate state problems.

I think it’s a good first way to go, it does not requires any UI enhancements, it does not require too much code on the separate bindings just requires having all the audio options added and the calls to the voice manager, and as the dialog registration is persisted on its own it will not require any data migration in order to be migrated to a centralize panel in the future, just having that panel available and remove the code from the bindings (that for now I think is only the pulseaudio binding, so not much work).

But I will open an issue about it prior to do anything related, and ping you there. Ping me if you end up opening a related discussion.

Sound amazing, I have though couple of times on trying to implement that on the pulseaudio binding but the source generates a raw audio stream and the sink expect a stream with header, also the core do not provide anything to represent that communication, so I see like a lot of work to do to have that correctly integrated but it’s something I will love to see in the future.

About that, I’m currently working on the next version of the library, and probably the last one, I finally get it to perform better by using a classification network to make the detections. Really happy to give that project a good conclusion :slight_smile: .

The process of training a model requires some time, but the result outperform the previous detection method (which is still available) by a lot.

The v3 betas of the cli have the train command available. By running “… train --help” it displays extended instructions. In case you end up trying it let me know how it goes or if you need more indications, still need to write a guide on the readme or something.

Another off-topic: I think I have found a bug in the pulseaudio binding, the source volume changes some times and I think it’s caused by the binding, still need to look at it properly. It was caused by the echo cancelation module not by the binding.

I didn’t try rustpotter recently and always was a little afraid of the “training” part.
Especially because I have a family with kids and don’t imagine putting them in front of the mic :sweat_smile:

By the way, do you know how well rustpotter behave with voice from different persons ? I read that CPU usage is dependant on the number of sample, which seems not very convenient for a “universal” keyword that can be used by not only the family, but also friend, etc. ? (but I may be totally wrong and would be glad to)

I think I have a volume change bug with pulseaudio and was glad to see that I was not alone, but by reading you again I’m a little disappointed : I don’t use the echo cancellation module (yet), so it doesn’t seem to be the same issue. I will need to take a look.
Is the bug in the pulseaudio module ? Is it a core bug or a bug related to the usage you have ?

In the latest version I’m adding an option that allows to generate usage records. I made that with debugging purposes but maybe it helps with the task of recording other people usages.

In the version 3 there will be two kind of wakewords (detection methods):

  • The one that was already present in the v2 which I have renamed to WakewordRef (wakeword reference) as it just compare the live audio with the reference which was extracted from the provided audios, these ones are generated using the rustpotter-cli build command.
  • The new option is to create a classification network to do the detection, by providing tagged and untagged audios as training and testing data (tagged audios means here they have a [label] as part of their name and the untagged data will be considered under the tag “none” and will not produce a detection). These ones should be a better option for creating “universal” files because the amount of training data do not increases the execution time, and as you can train those to detect different labels they can be used for detecting different speakers and things like that, still I didn’t have tried any of that. These ones are generated using the rustpotter-cli train command.

I still need to do more testing to have more clear conclusions, but for me after setting the threshold and min scores options to appropriate values the new detection method works quite better than the previous one (less false or missing detections).

For me it was caused by the usage of the webrtc cancelation, once I added aec_args="analog_gain_control=0 digital_gain_control=1" use_volume_sharing=true and reset pulse it was fixed.