Dialog Processing with the PulseAudioBinding

Hi openHAB community, in this post I’ll left a summary about how you can setup a speaker for dialog processing using a Raspberry PI.

Here you can find a small illustrative video of the voice control capabilities in Spanish. As speaker I’m using a Jabra Speaker2 40 connected to a Raspberry PI Zero 2 W, as OpenHAB server an Orange PI 5 (I expect the performance to be similar using a Raspberry PI 5) with the add-ons for pulseaudio, rustpotter, voicerss, and whisper.cpp with a small model. Note that I’m using OpenHAB 4.1.0 milestone 2 to use last rustpotter version and the code from this core PR which allows to use the item command description in the voice commands.

Sections:

  • OpenHAB Dialog Processing with the PulseAudioBinding.

    • What you need.

    • Flashing the raspberry pi

    • Connecting the Raspberry PI with PulseAudio.

    • Configure the required voice services.

      • Keyword Spotter

      • Text to speech

      • Speech to text

      • Human language interpreter

    • Generating a Rustpotter wakeword (notes about v3 at the end)

      • Configuring Rustpotter

      • Performance notes.

    • Start dialog processing.

  • Span section; Some voice services under development.

    • Whisper

    • HABSpeaker

    • ActionTemplateInterpreter+

  • Rustpotter v3

  • Comment

OpenHAB Dialog Processing with the PulseAudioBinding.

What you need.

  • A raspberry pi with a speaker/microphone attached running raspbian connected to your network (use an static ip).

  • A server running OpenHAB v4.0.0 (on the same network).

Flashing the raspberry pi

This is probably already known for most people, but just in case I will describe what I did.

I have flashed the Raspberry PI OS Lite (32-bit) image to an sd-card using Raspberry PI Imager.

Before write the card I have used the options menu to define my username and password, enable ssh access, and configure my wifi credentials.

After flashing the card I have inserted it again and I have edited the cmdline.txt file to included at the end of its only line ip=192.168.1.206, to assign and static ip and avoid searching for the device.

I have inserted the sd-card into the board and connect it to the power and after a couple of minutes it was available in the network (my sd-card is old probably took some time due to that) and I can access to it over ssh.

Connecting the Raspberry PI with PulseAudio.

Connect to your Raspberry terminal and follow this steps to create a systemd unit that runs pulseaudio in system mode.

First run:


# Install pulseaudio

$ sudo apt update

$ sudo apt install pulseaudio

# Stop and disable pulse audio user services (as we are going to run pulseaudio in system mode)

$ sudo systemctl --global disable pulseaudio.service pulseaudio.socket

$ sudo systemctl stop pulseaudio.service pulseaudio.socket

# Enable the pulseaudio tcp protocol module for system, needed by the binding. (I have used port 4712 you can change it by any other free one)

# This command add the 'load-module module-cli-protocol-tcp port=4712' at the end of the file, you can do it manually.

$ echo 'load-module module-cli-protocol-tcp port=4712' | sudo tee -a /etc/pulse/system.pa

Then create this systemd unit at /etc/systemd/system/pulseaudio.service:


[Unit]

Description=Pulseaudio sound server

After=avahi-daemon.service network.target

[Service]

ExecStart=/usr/bin/pulseaudio --system --disallow-exit

ExecReload=/bin/kill -HUP $MAINPID

[Install]

WantedBy=multi-user.target

I have used nano:


$ sudo nano /etc/systemd/system/pulseaudio.service

# paste the file content and press 'Ctrl + x' and confirm typing 'y' and enter.

To enable the service run:


# Reload the unit definitions

$ sudo systemctl daemon-reload

# Make the service start after reboots

$ sudo systemctl enable pulseaudio

# Start the service

$ sudo systemctl start pulseaudio

# You can check the service status like this, should be running

$ sudo systemctl status pulseaudio

Assign these groups to the following users.


# Add the audio group to the pulse user (recommended, seems not needed in my case)

sudo usermod -aG audio pulse

# Add your user to the pulse-audio allowed group (needed to use rustpotter-cli without root after enabling system mode in pulseaudio)

sudo usermod -a -G pulse-access $USER

Go to your openHAB Main UI and:

  • Install pulseaudio from the list of official bindings.

  • Manually add a new thing using the pulseaudio binding of type “pulseaudio:bridge” and configure the “Hostname” to be your Raspberry PI ip.

  • After adding the thing and check it appears online, go to add a new thing using the pulseaudio binding again, but this time run a scan, your available sinks and sources should be detected, add them (you can avoid the once ending in monitor if present).

  • Go to your sink thing and enable the configuration option “Create an Audio Sink with simple-protocol-tcp”.

  • Go to your source thing and enable the configuration option “Create an Audio Source with simple-protocol-tcp”.

  • A recommendation, if you have a microphone that capture to much noise, using the source thing “volume” channel, set the value to 80 or similar, a 100 value can cause too much distortion on the audio captured causing the Speech-to-Text result to be worst.

The audio sink and source of your raspberry pi should be available in your openHAB server, you can check it at Settings/System Services/Audio, they should be displayed there using their things labels.

Configure the required voice services.

You need to have configured at least one of each voice service type in your server, then I recommend you to set them as the default ones in “Settings/System Services/Voice”.

Keyword Spotter

The function on this service is to detect your wakeword in a live audio stream.

In this example I have used rustpotter.

A section about the wakeword file creation using the rustpotter-cli follows this one.

Text to speech

The dialog processor uses this service to convert the interpreter text response in audio.

In this example I have used VoiceRSS as at the moment of writing there is no good local TTS for Spanish in openHAB that I like. It’s quick to configure, and gives one hour per month of free usage.

Recently a TTS cache has been added to the voice system, so the cost of using a cloud STT has been reduced (also the computational cost in case of a local one) as most responses are the same (confirmations or errors telling you that the command can not be processed and why).

Speech to text

The function of this service is to transcribe your voice to text so it can be processed by the different human language interpreter implementations configured.

I recommend you to use Vosk for local recognition or Google for cloud recognition.

If you choose to use Vosk please be careful about having enough memory.

As advised in its model page, you can not assume the memory consumption of a Vosk model by its size.

I recommended you to try first the small mobile models, or other smaller models without enabling the preloadModel option to confirm they do not bloat the system.

This is what you need to do to setup a vosk model downloaded from its page:


# Assuming you are in your openHAB userdata folder and you have write permissions

# Go to your vosk folder (created by the addon on installation)

$ cd vosk

# Download the model

$ curl -OL https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip

# Unzip the folder with the model files

$ unzip vosk-model-small-es-0.42.zip

# Rename the folder to model

$ mv vosk-model-small-es-0.42 model

# List model folder content to check.

$ ls model/

am conf graph ivector README

It’s recommended to use the preloadModel option for a good experience but as suggested better try a run without it if you are not certain about having enough memory. On debian you can check your system memory from the terminal with the free -h command.

Human language interpreter

The function on this service is to interpret text commands and generate a response.

In this example I used the standard language interpreter (id: system), which is available in OpenHAB by default and provides you with a set of expressions for English, French, German, Italian and Spanish for some item types.

Generating a Rustpotter wakeword reference. (Notes about v3 at the end).

Rustpotter works better using a keyword generated with records taken on the same device, for that matter you can use its cli.

Connect to your Raspberry PI terminal and run the following commands to download the rustpotter-cli.


# Download the latest rustpotter-cli release for your arch https://github.com/GiviMAD/rustpotter-cli/releases.

# This command print your arch, just in case you don't remember.

$ uname -m

# Here I used the armv7l binary

$ curl -OL https://github.com/GiviMAD/rustpotter-cli/releases/download/v2.0.6/rustpotter-cli_debian_armv7l

# Make executable

$ chmod +x rustpotter-cli_debian_armv7l

# Check simple execution

$ ./rustpotter-cli_debian_armv7l --version

# Install

$ sudo mv ./rustpotter-cli_debian_armv7l /usr/local/bin/rustpotter-cli

# Print help

$ rustpotter-cli -h

You can use something like this to record multiple samples in a row and build the wakeword file with them.

I recorded them like at 2 meters of the speaker.


# Wakeword name and filename

WAKEWORD="ok casa"

WAKEWORD_FILENAME="${WAKEWORD// /_}"

# record 10 audios in a row, press 'ctrl + c' to end each, it will start recording the next one after 1 second

for i in {0..9}; do (rustpotter-cli record $WAKEWORD_FILENAME$i.wav && sleep 1); done

# play 10 audios in a row for verification

for i in {0..9}; do (aplay $WAKEWORD_FILENAME$i.wav); done

for i in {0..9}; do (rustpotter-cli filter -g --gain-ref 0.002 -b --low-cutoff 500 --high-cutoff 1500 $WAKEWORD_FILENAME$i.wav); done

# build model

rustpotter-cli build-model --model-name "$WAKEWORD" --model-path $WAKEWORD_FILENAME-gain-bandpass2.rpw $WAKEWORD_FILENAME*-gain0.002-*.wav

# run stop

rustpotter-cli spot $WAKEWORD_FILENAME.rpw

Configuring Rustpotter

There is no easy way to find the correct configuration parameters.

I recommend you to spend some time playing with the different configuration parameters.

These ones are what I used at the end.


rustpotter-cli spot -a 0.3 -t 0.52 -s p95 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw

Note that the max cpu usage of rustpotter will increase with the number of records used to create the wakeword file, I tend to use around 6.

The relevant options of the cli are:

  • You can enable the debug mode of the cli by passing the “-d” parameter so you can see the partial detections.

  • “-g” enables gain normalization. To debug the gain normalization you can use “–debug-gain”.

  • “–gain-ref” changes the gain normalization reference (the default value is printed at the beginning when debugging the gain normalization, depends on the wakeword)

  • “-a” configures the “averaged threshold”, recommended as reduces the cpu usage.

  • “-m 6” require at least 6 frames of positive scoring (compared against the detection “counter” field).

  • “-t” change the threshold.

  • “-s” the comparison strategy used.

  • “-b --low-cutoff 150 --high-cutoff 1200” the band pass configuration, there is not easy way to figure out the correct configuration for this but helps to attenuate background noises.

The comparison strategy p90 with 8 records means that the score will be an interpolation of the 2 higher scores nearest to the lower one. (90% of 8 = 7.2). It’s useful to reduce the number of false positives.

You can also use the test-model command to spot against a file, allows you to use a record to tests your options.


rustpotter-cli test-model -a 0.3 -t 0.52 -s p90 -m 8 -g --gain-ref 0.019 -b --low-cutoff 1000 --high-cutoff 1500 ok_casa.rpw ok_casa1.wav

Move the generated rpw to your openHAB server, to ‘/rustpotter/’.

Configure the Rustpotter options in the openHAB MainUI to match the ones you used on the cli (Settings/Other Services/Rustpotter).

Performance notes.

Rustpotter works well when it’s correctly configured and there is no noise and a semi-good performance in presence of lower noises can be achieved with the correct configuration of the audio filters. Enabling the echo cancel module on pulse audio also seems to improve the results.

It’s still far from other things like the Amazon/Google devices.

I’m its developer (the only one for now), its first version was released in january last year, the second one this january, and I’ll try to release another iteration the next one if I get any improvement.

If you like the project please consider give a start to the library on github.

It’s by far my most stared project :D.

If you want to comment here what configuration works better for you or what major problems you encounter with it will be great to have some feedback.

Start dialog processing.

OpenHAB includes a dialog processor implementation.

It connects an audio sink and source to the voice services (Keyword Spotting, Speech-of-Text, Human Language Interpreters and Text-to-Speech).

It allows basic voice command execution (wakeword detection → voice transcription → command interpretation → response synthesize).

In the OpenHAB 4.0.0 version the “register dialog” functionality is available, it allows to describe the desired dialog services and configuration and let the openHAB core in charge of keep that dialog running whenever is possible (configured services are available).

OpenHAB will persists the dialog registrations so the dialogs are restored after a system restart.

I’ll try to make a PR for the pulseaudiobinding to take advantage of this and integrate the dialog registration into the bridge thing configuration but for now you need to use the console to create the dialog registration.

You need to access the openhab-cli, and then you can:


# List available audio sinks to get your sink id

$ audio:sinks

# List available audio sources to get your source id

$ audio:sources

# Register dialog; change in the options in the commands.

# Let the options unset to use the defaults you have configured.

# Note that you can fallback to other interpreters using a comma separated list of ids.

$ voice registerdialog --source <SOURCE_ID> --sink <SINK_ID> --keyword "ok casa"

# List dialog registrations

$ voice:dialogregs

# List started dialogs, note this command display the actual services in use

$ voice:dialogs

You can verify using the ‘voice:dialogs’ command that the dialog disappear from the list if you disable your pulseaudio bridge thing (the sink and source will became unavailable) and appears again after some seconds when you re-enable the thing.

If your dialog do not appear you can enable the debug logs of the Voice bundle to check what is the problem.


$ log:set DEBUG org.openhab.core.voice

Span section; Some voice services under development.

Custom rules using the standard interpreter (for 4.1.0 version)

You can find in this PR’s first comment a write up of the standard interpreter features on v4.1.0, I’ve tried to summarize all that I know there. It’s available on the 4.1.0’s snapshots from December’s six.

Whisper (PR for v4.1.0)

Thanks to the existence of the whisper.cpp project you can use the whisper speech recognition capabilities in your openHAB server.

The limited grammar feature recently added to whisper.cpp is integrated in the add-on and allows getting pretty good results using the tiny/base models which are pretty faster compared to the other models.

I have an open PR for this add-on if you can give it a try before its merged in case some bug exists it would be great, I’m already using it on my installation and seems to work great.

The library built-in with the add-on runs only on CPU, you can build your own library if you need it. I encounter the performance to be great on x86_64 platforms recent CPUs. On arm the performance is not to great as f16 operations are only supported in modern CPUs, it’s not the case of the Raspberry PI 4.

You can find the readme here.

Addon download:

Please download from the PR or built it your self:

After installation you should place a model in the <OPENHAB_USERDATA>/whisperstt folder.


cd whisperstt

curl -OL https://huggingface.co/ggerganov/whisper.cpp/blob/main/ggml-tiny.bin

Now you should configure the modelName option to tiny (Main UI - Settings/Other Services/Whisper), and it should be ready to use.

It’s recommended to use the preloadModel option for faster execution if you have enough ram.

HABSpeaker (v4.1.0)

HABSpeaker can be installed through the marketplace, it consist on a web interface that allows to use the voice system. It’s unfinished but the current version (beta 24) seems to work correctly, I hope I can complete this project within this year.

Due to browser security measures, it requires to be accessed over https, and it’s currently non usable over the OpenHAB Cloud Connector because it do not support websocket connections (I hope I can add that in the future, need to ask to the project maintainers if it’s ok). So the quickest way to give it a try right now it to momentously disable the browser security for your server url.

This can be done on Chrome following this steps:

  • Open chrome and enter chrome://flags/ as url, it will display a list of options.

  • Filter the list of options by ‘unsafely-treat-insecure-origin-as-secure’.

  • Enter your openHAB url in the option text fields, like “http://192.168.1.99:8080”.

  • Toggle the option to enable, the browser will ask you to relaunch.

After that, if you have the addon installed and the default voice services configured you should already be able to use it over http on the url http://<your openhab ip>:8080/habspeaker/. It needs and initial click to setup the audio and start the connection, keyword spotting is disabled by default after a successful connection the dialog can be triggered by another click on the button (the circle in the center).

If you want to report any problem, please use the marketplace thread.

Ruspotter v3 (v4.1.0)

I’ve created another version of the library sooner than expected because I found a great library for machine learning library that allows me to easily implement something that I was read about in many articles which is training “basic” linear layer networks (which are commonly used for image classification) with the audio MFCCs, which is what I was already in rustpotter, and the results seems very promising with the drawback that you need to take a lot of more records to create a compelling training and testings sets.

So in the v3 the mayor change is that you can generate two kind of wakeword files, the “wakeword references” which offer the previous detection by comparison functionality, and the “wakeword models” which offer detection by classification.

Both kind of wakewords can be generated using the rustpotter cli, also a new “record on partial spot” has been added that helps in the datasets creation and to detect library/audio problems.

Files generated with v2 work in v3 but not on the other side.

There is already an open PR to make this version available to OpenHAB here. Rustpotter v3 is available since OpenHAB 4.1.0 milestone 2 :slight_smile:

The task of creating/sharing trained models that work well across different devices with different persons seems a little overwhelming for me at this moment so I have it on my TODO list. Because I not sure I can distribute something that works similar to a model trained by the final user/users on the target mic/mics, but I will try to distribute something that can be used as a base.

If you make some tests training your own wakeword, you can ping me with the results here or by PM. I’ll recommend creating a wakeword reference file first and using it with the record option to collect the training/testing set, as it emits records of the same duration.

Piper (PR for v.4.1.0)

There is a pending review PR for integrating Piper as Text-to-Speech service.

On my experience Piper medium models for Spanish work great, pretty fast, using around 100MB of ram.

Extra: Script that I use to control Jellyfin Android TV devices

Sharing the javascript rule that I’m using to play media from Jellyfin on my Android TV devices. Do not take it as something ready or easy to use, I just want to share it in case someone can take advantage of it to do something cooler. I used there the androiddebugbridge, androidtv and jellyfin bindings.

It may have something wrong it’s not something I haven’t worked on too much. The major problem I have, it’s that some times it takes to much time to start playing when the device is offline, or it end up not playing the title, I don’t know if it’s related with the adb connection or with the Jellyfin client itself. But most times it’s working fine for me.

Note that it doesn’t fetch the content data from the Jellyfin server as I have some files with my media info on the file system.

example_android_jellifin.txt (8.1 KB)

As the androiddebugbridge binding supports sending intents it should be possible to do something similar integrating personal and commercial media services. It’s something I couldn’t find time to look at.

Comment

Hope this guide is interesting to some of you, let me know if you think some point should be improved.

Best regards!

Change log:

  • Remove noise cancelation part as it was giving me bad results with the STT services and notes about rustpotter v3.
  • 6 Dec 2023; Add piper info, link to standard interpreter summary, remove actiontemplatehli info, add example media control script and minor changes.
8 Likes

Hello,

As usual, thanks for your effort to make this accessible for everyone !

Don’t you think it could be a core/main ui feature ?
Amongst the many projects I hope to have time for, I have always wanted to try to make/propose in the left openHAB menu panel a new section dialog/AI/whatever. With the same look as the thing / item one.
It would have a list and details page for the dialogs registered, with the same kind of openHAB “theme” (edit, remove, + button in the bottom right to add a new one, etc.)
Dialog registering is probably the last main piece that can’t actually be done by GUI alone.
I’m more of a backend type of dev, so I’m not very confortable with the idea to take this by myself, but I certainly will try to put some effort into this, some day.

And for your idea of the pulseaudio binding to be able to register a dialog :
This new “dialog registration” section could have a kind of “inbox” like the one we have for things ? And binding, like pulseaudio, could be able to add an “ready to use” entry in this.
And another example : I’m currently working (very early beginning) on a java project, that would have the functionnality of an intercom, to make calls between room in a house (hopefully SIP compliant).
I plan (very long term) to make a corresponding binding that would auto discover these satellites and register a sink / source for each one. And also maybe a dialog registration. This kind of project would greatly benefit of this kind of “inbox” dialog registration capability.
(I also want to offload the wake word responsibility to this “satellite”, but it’s another subject)

Sorry for the slightly off topic, I didn’t want to transform this in a development thread but I got carried away by the prospect :sweat_smile:

1 Like

Glad you comment that, when I introduced the register dialog feature I did it thinking in something like that, but as we use source id as the dialog processor id, I think now that having the dialog configuration integrated on the thing that register the source or its bridge thing also makes sense and can be done using the current UI/Core features, even if you show there all the sinks to allow dialogs with audio services from different bindings.

As a summary the dialog registration is also identified by the source id, I think the dialog registration can be integrated into the bindings by adding the dialog options to the thing config among a registerDialog boolean that instruct the binding to using the voiceManager register/unregister the dialog on the modified events. This way there is no difference whetter you use the console or the thing config to create the dialog registration, I think that this can generate some collisions if people mix the use the console and the binding config, but I think the console commands are something most people will not use and will understand that touching the registrations that are managed by the bindings can generate state problems.

I think it’s a good first way to go, it does not requires any UI enhancements, it does not require too much code on the separate bindings just requires having all the audio options added and the calls to the voice manager, and as the dialog registration is persisted on its own it will not require any data migration in order to be migrated to a centralize panel in the future, just having that panel available and remove the code from the bindings (that for now I think is only the pulseaudio binding, so not much work).

But I will open an issue about it prior to do anything related, and ping you there. Ping me if you end up opening a related discussion.

Sound amazing, I have though couple of times on trying to implement that on the pulseaudio binding but the source generates a raw audio stream and the sink expect a stream with header, also the core do not provide anything to represent that communication, so I see like a lot of work to do to have that correctly integrated but it’s something I will love to see in the future.

About that, I’m currently working on the next version of the library, and probably the last one, I finally get it to perform better by using a classification network to make the detections. Really happy to give that project a good conclusion :slight_smile: .

The process of training a model requires some time, but the result outperform the previous detection method (which is still available) by a lot.

The v3 betas of the cli have the train command available. By running “… train --help” it displays extended instructions. In case you end up trying it let me know how it goes or if you need more indications, still need to write a guide on the readme or something.

Another off-topic: I think I have found a bug in the pulseaudio binding, the source volume changes some times and I think it’s caused by the binding, still need to look at it properly. It was caused by the echo cancelation module not by the binding.

I didn’t try rustpotter recently and always was a little afraid of the “training” part.
Especially because I have a family with kids and don’t imagine putting them in front of the mic :sweat_smile:

By the way, do you know how well rustpotter behave with voice from different persons ? I read that CPU usage is dependant on the number of sample, which seems not very convenient for a “universal” keyword that can be used by not only the family, but also friend, etc. ? (but I may be totally wrong and would be glad to)

I think I have a volume change bug with pulseaudio and was glad to see that I was not alone, but by reading you again I’m a little disappointed : I don’t use the echo cancellation module (yet), so it doesn’t seem to be the same issue. I will need to take a look.
Is the bug in the pulseaudio module ? Is it a core bug or a bug related to the usage you have ?

In the latest version I’m adding an option that allows to generate usage records. I made that with debugging purposes but maybe it helps with the task of recording other people usages.

In the version 3 there will be two kind of wakewords (detection methods):

  • The one that was already present in the v2 which I have renamed to WakewordRef (wakeword reference) as it just compare the live audio with the reference which was extracted from the provided audios, these ones are generated using the rustpotter-cli build command.
  • The new option is to create a classification network to do the detection, by providing tagged and untagged audios as training and testing data (tagged audios means here they have a [label] as part of their name and the untagged data will be considered under the tag “none” and will not produce a detection). These ones should be a better option for creating “universal” files because the amount of training data do not increases the execution time, and as you can train those to detect different labels they can be used for detecting different speakers and things like that, still I didn’t have tried any of that. These ones are generated using the rustpotter-cli train command.

I still need to do more testing to have more clear conclusions, but for me after setting the threshold and min scores options to appropriate values the new detection method works quite better than the previous one (less false or missing detections).

For me it was caused by the usage of the webrtc cancelation, once I added aec_args="analog_gain_control=0 digital_gain_control=1" use_volume_sharing=true and reset pulse it was fixed.

I have removed the noise reduction configuration section as after testing it for a while I don’t think it was working too well for me, also I have added some notes about the rustpotter v3 version.

I saw that you want to create some kind of “generic” model for people to test/use rustpotter KS.
Did you think about using already open and available datasets to train it ?
This is what I was planning to do (I would like to avoid putting all my family in front of a mic :sweat_smile: )
There is this article from pico voice which give several links :

I would prefer not to mimic the available words for porcupine, nether the Alexa or Google ones.

This is what I was planning to do (I would like to avoid putting all my family in front of a mic :sweat_smile: )

With the record feature you can record on usage. You can train a model with the records suggested and then just leave that enabled for some days in order to collect records from everybody, to later add those to the dataset and train a better version. It’s also useful to collect false detections.

I am a little bit stucked and need help please.

I am on openHAB 4.0.2 but there is no “Settings/Other Services/Rustpotter”. There is no “Settings/Other Services” but Rustpotter is installed.

This command should no work for me:
voice registerdialog --source pulseaudio:source:f0516c6bd7:echocancelsource --sink pulseaudio:sink:f0516c6bd7:echocancelsink --hlis actiontemplatehli,system --keyword "ok casa"

I have no audio source echocancelsource and sink echocancelsink.
These are the two which were found
pulseaudio:source:livingroom:alsa_input_platform_soc_sound_multichannel_input
pulseaudio:sink:livingroom:alsa_output_platform_soc_sound_multichannel_output
plus the monitor channel which i didn’t added.

I modified the command to this
registerdialog --source pulseaudio:source:livingroom:alsa_input_platform_soc_sound_multichannel_input --sink pulseaudio:sink:livingroom:alsa_output_platform_soc_sound_multichannel_output --hlis rulehli --keyword "ok haus"

In generell there are no errors but it’s not working. And when i register the dialog there is some noise on the speaker of my satelite. When i unregister the dialog it’s gone.

You need to go to the add-on page and press on the “clock wheel” icon. In the openHAB 4.1.0 snapshots the “Settings/Other Services/Rustpotter” link is back there.

Yes it was a bad idea to put my ids there, I have edited the command. Thanks for the input.

You can enable the debug logs for the rustpotterks add-on to see if it starts correctly. It prints something like:

19:09:53.534 [DEBUG] [potterks.internal.RustpotterKSService] - Audio wav spec: sample rate 16000, 16 bits, 1 channels, little-endian
19:09:53.883 [DEBUG] [potterks.internal.RustpotterKSService] - Wakeword '/openhab/userdata/rustpotter/ok_casa.rpw' loaded

Also, you can verify the sink and source are correctly working using the voice listenAndAnswer command which summons a single shot dialog execution without involving the keyword spotting, I’m going to give it a try and add the command there.

When you said “there is some noise” are you meaning a continuos noise, or just some initial noise?

Thanks, don’t know that there are settings as well.

The last relevant logline is
bundle org.openhab.voice.rustpotterks:4.0.2 (448)[org.openhab.voice.rustpotterks.internal.RustpotterKSService(457)] : Declared Method org.openhab.voice.rustpotterks.internal.RustpotterKSService.activate([interface org.osgi.service.component.ComponentContext]) not found
Don’t know what’s going wrong. There is realy no method with this parameter.
The logline before is
bundle org.openhab.voice.rustpotterks:4.0.2 (448)[org.openhab.voice.rustpotterks.internal.RustpotterKSService(457)] : Found activate method: protected void org.openhab.voice.rustpotterks.internal.RustpotterKSService.activate(java.util.Map)

While using listenAndAnswer (without additional parameters) there are two sound a longer lower and a higher short sound and nothing more.

It’s a continious cracking noise.
By the way teaching the hotword worked fine on the satelite (directly on the raspberry pi) → clear sound
And using “voice say” on the server results as well in a clear audio output on the satelite.

I have configured “Eb:300” as “Listening Melody” in the “System Settings/Voice”. So when I execute the “listenandanswer” command I hear a quick beep that indicates I should start talking when it ends.

These are my logs of one execution, in case it helps, I have enabled the voice system logs log:set DEBUG org.openhab.core.voice and I’m using the whisper add-on with the small model (note I’m using openHAB in Spanish) :

08:53:28.207 [DEBUG] [b.core.voice.internal.DialogProcessor] - Sounds enabled
08:53:28.216 [DEBUG] [.core.voice.internal.VoiceManagerImpl] - Executing a simple dialog for source jabra_microphone (pulseaudio:source:f0516c6bd7:alsa_input_usb_0b0e_Jabra_SPEAK_510_USB_50C2ED0C3D3A0220000_00_mono_fallback)
08:53:28.241 [DEBUG] [b.core.voice.internal.DialogProcessor] - Acquiring dialog group 'default'
08:53:28.295 [DEBUG] [whisperstt.internal.WhisperSTTService] - Creating whisper state...
08:53:28.622 [DEBUG] [whisperstt.internal.WhisperSTTService] - Whisper state created
08:53:28.623 [DEBUG] [whisperstt.internal.WhisperSTTService] - Creating VAD instance...
08:53:28.624 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD instance created
08:53:28.624 [DEBUG] [whisperstt.internal.WhisperSTTService] - Samples per step 16000
08:53:28.625 [DEBUG] [whisperstt.internal.WhisperSTTService] - Max transcription samples 160000
08:53:28.625 [DEBUG] [whisperstt.internal.WhisperSTTService] - Max init silence samples 48000
08:53:28.625 [DEBUG] [whisperstt.internal.WhisperSTTService] - Max silence samples 8000
08:53:29.502 [DEBUG] [b.voice.whisperstt.internal.utils.VAD] - VAD: 2/50 - required: 10
08:53:29.505 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD: silence detected 1/3
08:53:29.507 [DEBUG] [whisperstt.internal.WhisperSTTService] - removing start silence
08:53:30.476 [DEBUG] [b.voice.whisperstt.internal.utils.VAD] - VAD: 41/50 - required: 10
08:53:30.478 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD: voice detected
08:53:31.450 [DEBUG] [b.voice.whisperstt.internal.utils.VAD] - VAD: 49/50 - required: 10
08:53:31.454 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD: voice detected
08:53:32.426 [DEBUG] [b.voice.whisperstt.internal.utils.VAD] - VAD: 16/50 - required: 10
08:53:32.429 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD: voice detected
08:53:33.401 [DEBUG] [b.voice.whisperstt.internal.utils.VAD] - VAD: 0/50 - required: 10
08:53:33.407 [DEBUG] [whisperstt.internal.WhisperSTTService] - VAD: silence detected
08:53:33.448 [WARN ] [eaudio.internal.PulseAudioAudioSource] - IOException while reading from pulse source: Socket closed
08:53:33.494 [DEBUG] [whisperstt.internal.WhisperSTTService] - removing end silence
08:53:33.501 [DEBUG] [whisperstt.internal.WhisperSTTService] - running whisper...
08:53:38.089 [DEBUG] [whisperstt.internal.WhisperSTTService] - whisper result code 0
08:53:38.089 [DEBUG] [whisperstt.internal.WhisperSTTService] - Available transcription segments 1
08:53:38.090 [DEBUG] [whisperstt.internal.WhisperSTTService] - Saving audio file: /openhab/userdata/whisper/samples/2023-10-07.08.53.38.90(_apagar_la_l_mpara.).wav
08:53:38.091 [DEBUG] [whisperstt.internal.WhisperSTTService] - Saving properties file: /openhab/userdata/whisper/samples/2023-10-07.08.53.38.90(_apagar_la_l_mpara.).props
08:53:38.092 [DEBUG] [whisperstt.internal.WhisperSTTService] - single utterance mode, ending transcription
08:53:38.115 [DEBUG] [b.core.voice.internal.DialogProcessor] - RecognitionStopEvent event received
08:53:38.116 [DEBUG] [whisperstt.internal.WhisperSTTService] - Final text: apagar la lámpara.
08:53:38.117 [DEBUG] [whisperstt.internal.WhisperSTTService] - Final text no specials: apagar la lámpara
08:53:38.117 [DEBUG] [b.core.voice.internal.DialogProcessor] - SpeechRecognitionEvent event received
08:53:38.118 [DEBUG] [b.core.voice.internal.DialogProcessor] - Text recognized: apagar la lámpara
08:53:38.119 [DEBUG] [b.core.voice.internal.DialogProcessor] - Interpretation exception: No se que hacer
08:53:38.121 [DEBUG] [ice.text.AbstractRuleBasedInterpreter] - Partial matched items against [lámpara] that accept OnOffType: LamparaHabitacion, LamparaSalon
08:53:38.121 [DEBUG] [ice.text.AbstractRuleBasedInterpreter] - Exact matched items against [lámpara] that accept OnOffType: LamparaHabitacion, LamparaSalon
08:53:38.122 [DEBUG] [b.core.voice.internal.DialogProcessor] - Interpretation exception: Hay más de un objeto con un nombre similar.
08:53:38.147 [DEBUG] [b.core.voice.internal.DialogProcessor] - Releasing dialog group 'default'

Two minor notes: There is a warning coming from the pulseaudio source when closing the stream, but that doesn’t seems to be a problem, some rework is needed there to avoid logging it. Also the logs about the “dialog group” like this one “Acquiring dialog group ‘default’” will not show up in 4.0.x version, as it’s something recently added to 4.1.0.

Maybe there is some problem with pulseaudio, IDK if it can be in the device side or in the openHAB side.
Can you check if the command running on the “satelite” displays more than one audio source?

sudo PULSE_RUNTIME_PATH=/var/run/pulse -u pulse pacmd list-sources | grep name:

In my case I have these two, and only the “mono-fallback” seems to be working for me:

	name: <alsa_output.usb-0b0e_Jabra_SPEAK_510_USB_50C2ED0C3D3A0220000-00.analog-stereo.monitor>
	name: <alsa_input.usb-0b0e_Jabra_SPEAK_510_USB_50C2ED0C3D3A0220000-00.mono-fallback>

Also if you are using openHAB 4.0.x please use the last rustpotter-cli 2.x.x version, the PR for rustpotter 3.x.x was already merged but still not available outside the 4.1.x snapshots.

rustpotter-cli -V
rustpotter-cli 3.0.1

If you run the rustpotter-cli with the pulse user you will see pulse as one of the available devices:

$ sudo -u pulse rustpotter-cli devices
0 - default
1 - pulse
2 - hw:CARD=USB,DEV=0
3 - plughw:CARD=USB,DEV=0
4 - sysdefault:CARD=USB
5 - front:CARD=USB,DEV=0
6 - surround40:CARD=USB,DEV=0
7 - iec958:CARD=USB,DEV=0
8 - dsnoop:CARD=USB,DEV=0

You can try to record using it to see if the noise error appears:

cd /tmp
# in my case "pulse" is the device index 1
sudo -u pulse rustpotter-cli record -i 1 test.wav

Hope something helps, and a little lost on what could be the problem.

I am planning to use a lot of the work you’ve done here. I just posted this:

If you have any information on low power edge devices, I am interested.

No, I have very low experience on those platforms, I did some tests with the esp32 some years ago, but that was all. Mostly I don’t know anything about electronics.

The cracking noise is gone, don’t know why, maybe because of some restarts…

My log output differs from yours, because I only have the first two lines when executing listenandanswer
I hear the sound (change to Eb:300 as well) and then the prompt returned, no time to say something.
The dialog is registered and is visible with dialogs / dialogregs.

This gave me

||name: <alsa_output.platform-soc_sound.multichannel-output.monitor>|
|---|---|
||name: <alsa_input.platform-soc_sound.multichannel-input>|

and i am using the second one.

Recording is working fine with rustpotter-cli on the satelite.

What speech-to-text add-on are you using? You can try to enable its debug logs.

Yes, the listen-and-answer command does not wait for the dialog execution, just triggers it.

Then I assume should be a problem in the speech-to-text service or the pulseaudiobinding. I hope there are some more clues in the logs.

I have added a small video that shows how it works for me. Hope it helps.

Why does it take so long from voice command to the real action?

Whisper execution time (bigger the model used bigger the time), the rest of it works almost immediately. It works faster with the available cloud STT services (Google or Watson) and also with Vosk.

But personally I want something that runs locally and can be fine-tuned and Whisper is the only thing I found that meets those requirements, I didn’t find information on how to fine-tune the Vosk models and I don’t think they will release it because I think that offering fine-tuned models is the business model of the enterprise behind it.

I’m good with the delay as long as the voice recognition behaves correctly, and my experience is better than with Vosk because some specific phrases where always or most of the time recognized incorrectly by the Vosk models for Spanish. Whisper mades a bad transcription some times, but it tends to work for me with the small model, now I’m using a small model version fine-tuned for Spanish that I found in huggingface that behaves a little better that the original one, so my plan is to eventually try to replicate its training adding some personal records to the dataset to improve it’s behavior with some phrases that fail more often than others, but I’m pretty good for now, I just wanted to have a non-proprietary voice interface to my OpenHAB server because I found some things hard to accomplish with the Alexa limits and I wanted some freedom.

I hope to find and be able to integrate a local TTS service that works well with Spanish to run the voice control fully offline, but as the TTS cache works great it’s not something very urgent for me.