I wanted a few small units for speech-to-text (voice recognition) and text-to-speech (voice output) for my home automation environment, which includes openHAB and Rhasspy (details here).
Previously, I have used Rhasspy satellites built with a Raspberry Pi and a small speaker (described here), but now I wanted something simpler and more compact, and with the voice recognition happening in the satellite device itself, without streaming audio to a server, not even to an in-house server.
The solution was to package the ESP32-S3 module with a small speaker, and write software that uses the speech recognition library provided by Espressif, integrates with my existing home automation environment.
Key features are
low cost:a small device based on an ESP32-S3, total cost of materials ca. âŹ20
privacy: self-contained speech recognition in the device, no audio streaming to a server
dynamic: automatic configuration for voice-controlled items defined in openHAB
Source code and schematics in Github, and more information in my blog.
Once that core PR is merged, there will be a WebSocket API to transmit PCM audio to the openHAB server and process STT, interpreting and TTS there, so yes, you donât need Rhasspy anymore. But you need to adjust existing projects to openHABâs audio WebSocket api.
Yes, shouldnât be too difficult to adapt my firmware to the onju-voice hardware. I donât have that hardware, so I canât test it, obviously.
One caveat: the onju-voice hardware uses the same I2S interface for audio input and output. Therefore the sample rates for microphones and speaker must be the same. The esp-sr library expects a microphone signal sampled at 16â000 Hz, but some of the TTS voices in Rhasspy produce a WAV file sampled at 22â050 Hz, including the Larynx blizzard_lessac voice at medium quality that I have used so far.
I donât understand what the PCM audio websocket feature for OpenHAB has to do with my project. The ESP32-S3 machine communicates with OpenHAB over MQTT only, to report recognized commands. The ESP32-S3 machine communicates with Rhasspy over MQTT only, to receive the WAV file. There is no audio streaming between the device and OpenHAB.
I am currently experimenting with a standalone TTS server in a VM, as an alternative to using Rhasspy. TTS is then done via a small shell script called by OpenHAB, which sends the text to the TTS server, retrieves the WAV file (via HTTP) and feeds it to the ESP32-S3 machine (via MQTT, pretending to be Rhasspy).
Regarding âno need for Rhasspy anymoreâ: if you donât need commands that include numbers, such as âstart kitchen timer for 15 minutesâ, then yes, the combination of OpenHAB with the ESP32-S3 machine and a standalone TTS server, e.g. using piper, can satisfy all your voice control needs.