Yet another cloud-free voice control

requireiot · November 17, 2025, 10:55am

I wanted a few small units for speech-to-text (voice recognition) and text-to-speech (voice output) for my home automation environment, which includes openHAB and Rhasspy (details here).

Previously, I have used Rhasspy satellites built with a Raspberry Pi and a small speaker (described here), but now I wanted something simpler and more compact, and with the voice recognition happening in the satellite device itself, without streaming audio to a server, not even to an in-house server.

The solution was to package the ESP32-S3 module with a small speaker, and write software that uses the speech recognition library provided by Espressif, integrates with my existing home automation environment.

Key features are

low cost:a small device based on an ESP32-S3, total cost of materials ca. €20
privacy: self-contained speech recognition in the device, no audio streaming to a server
dynamic: automatic configuration for voice-controlled items defined in openHAB

Source code and schematics in Github, and more information in my blog.

digitaldan · November 17, 2025, 4:40pm

Awesome work ! Can’t wait to play with it a bit.

milo · November 24, 2025, 1:04pm

It should also work with this??

If this is merged

github.com/openhab/openhab-core

[audio] Add pcm audio websocket with dialog support

main ← GiviMAD:voice/audio_pcm_websocket

opened 08:45PM - 10 Jan 24 UTC

GiviMAD

+1772 -123

These PR adds a WebSocket Adapter that allows transferring PCM audio to a sink a…nd source whose existence will be tied to the WebSocket connection (register/unregister the components on connection/disconnection) also allows to spawn a dialog processor instance connected to those (will also be tear down on disconnection). The PR is incomplete and untested, I'll try to add a client to the UI and develop both PR in parallel. Also IDK if placing the code in the voice bundle is correct, I did it there because it already requires the audio bundle. For me it makes more sense to do it on the audio bundle but I'll have to require both the voice and websocket bundles there. Best regards!

There will be no need for rhasspy anymore or?

florian-h05 · November 24, 2025, 1:38pm

Once that core PR is merged, there will be a WebSocket API to transmit PCM audio to the openHAB server and process STT, interpreting and TTS there, so yes, you don’t need Rhasspy anymore. But you need to adjust existing projects to openHAB‘s audio WebSocket api.

requireiot · November 24, 2025, 1:41pm

Yes, shouldn’t be too difficult to adapt my firmware to the onju-voice hardware. I don’t have that hardware, so I can’t test it, obviously.

One caveat: the onju-voice hardware uses the same I2S interface for audio input and output. Therefore the sample rates for microphones and speaker must be the same. The esp-sr library expects a microphone signal sampled at 16’000 Hz, but some of the TTS voices in Rhasspy produce a WAV file sampled at 22’050 Hz, including the Larynx blizzard_lessac voice at medium quality that I have used so far.

I don’t understand what the PCM audio websocket feature for OpenHAB has to do with my project. The ESP32-S3 machine communicates with OpenHAB over MQTT only, to report recognized commands. The ESP32-S3 machine communicates with Rhasspy over MQTT only, to receive the WAV file. There is no audio streaming between the device and OpenHAB.

I am currently experimenting with a standalone TTS server in a VM, as an alternative to using Rhasspy. TTS is then done via a small shell script called by OpenHAB, which sends the text to the TTS server, retrieves the WAV file (via HTTP) and feeds it to the ESP32-S3 machine (via MQTT, pretending to be Rhasspy).

Regarding “no need for Rhasspy anymore”: if you don’t need commands that include numbers, such as “start kitchen timer for 15 minutes”, then yes, the combination of OpenHAB with the ESP32-S3 machine and a standalone TTS server, e.g. using piper, can satisfy all your voice control needs.

milo · November 24, 2025, 2:25pm

But your version can handle this?

start kitchen timer for 15 minutes

milo · November 24, 2025, 2:31pm

But also streaming back the audio which was converted with tts in openhab?

florian-h05 · November 24, 2025, 3:16pm

IIRC the PR allows bidirectional audio, so yes.

requireiot · November 24, 2025, 10:14pm

No, my ESP32-S3 gadget can’t handle this type of command. That’s why I will keep one Rhasspy satellite based on a Raspberry Pi, in the kitchen …

milo · November 30, 2025, 1:45am

So finally it’s merged

Would be now es ya to have voice assistant intergraded