Yet another cloud-free voice control

I wanted a few small units for speech-to-text (voice recognition) and text-to-speech (voice output) for my home automation environment, which includes openHAB and Rhasspy (details here).

Previously, I have used Rhasspy satellites built with a Raspberry Pi and a small speaker (described here), but now I wanted something simpler and more compact, and with the voice recognition happening in the satellite device itself, without streaming audio to a server, not even to an in-house server.

The solution was to package the ESP32-S3 module with a small speaker, and write software that uses the speech recognition library provided by Espressif, integrates with my existing home automation environment.

Key features are

  • low cost:a small device based on an ESP32-S3, total cost of materials ca. €20

  • privacy: self-contained speech recognition in the device, no audio streaming to a server

  • dynamic: automatic configuration for voice-controlled items defined in openHAB

Source code and schematics in Github, and more information in my blog.

4 Likes

Awesome work ! Can’t wait to play with it a bit.

1 Like

It should also work with this??

If this is merged

There will be no need for rhasspy anymore or?

Once that core PR is merged, there will be a WebSocket API to transmit PCM audio to the openHAB server and process STT, interpreting and TTS there, so yes, you don’t need Rhasspy anymore. But you need to adjust existing projects to openHAB‘s audio WebSocket api.

1 Like

Yes, shouldn’t be too difficult to adapt my firmware to the onju-voice hardware. I don’t have that hardware, so I can’t test it, obviously.

One caveat: the onju-voice hardware uses the same I2S interface for audio input and output. Therefore the sample rates for microphones and speaker must be the same. The esp-sr library expects a microphone signal sampled at 16’000 Hz, but some of the TTS voices in Rhasspy produce a WAV file sampled at 22’050 Hz, including the Larynx blizzard_lessac voice at medium quality that I have used so far.

I don’t understand what the PCM audio websocket feature for OpenHAB has to do with my project. The ESP32-S3 machine communicates with OpenHAB over MQTT only, to report recognized commands. The ESP32-S3 machine communicates with Rhasspy over MQTT only, to receive the WAV file. There is no audio streaming between the device and OpenHAB.

I am currently experimenting with a standalone TTS server in a VM, as an alternative to using Rhasspy. TTS is then done via a small shell script called by OpenHAB, which sends the text to the TTS server, retrieves the WAV file (via HTTP) and feeds it to the ESP32-S3 machine (via MQTT, pretending to be Rhasspy).

Regarding “no need for Rhasspy anymore”: if you don’t need commands that include numbers, such as “start kitchen timer for 15 minutes”, then yes, the combination of OpenHAB with the ESP32-S3 machine and a standalone TTS server, e.g. using piper, can satisfy all your voice control needs.

But your version can handle this?

start kitchen timer for 15 minutes

But also streaming back the audio which was converted with tts in openhab?

IIRC the PR allows bidirectional audio, so yes.

No, my ESP32-S3 gadget can’t handle this type of command. That’s why I will keep one Rhasspy satellite based on a Raspberry Pi, in the kitchen 


So finally it’s merged

Would be now es ya to have voice assistant intergraded