I hope this way it can be easily indexed by search engines and many people who look for it can find it online.
The steps are elaborate but not hard it basically boils down to installing all the add-ons and configuring them, recording a whole bunch of audio and training the wake word detection and creating some glue logic items and things.
I wanted to submit this topic here so users could try out to get the setup working for them and so that I could help them with a discussion here if things are unclear. I can then go back and iteratively improve the blog post.
I’ve been looking at how to integrate something like this, but got stuck after getting whisper and piper working, so will definitely be checking out the blog for implementing wakeword detection, a working interpreter, and LLM, when i get the time to dive into this.
I will definitely post back once i try this out, also when you have been running this for some time do feel free to update the post with some experiences for the setup.
Also what language have you ended up using. I haven’t read in detail your blog, but i see the dutch language there, how has that fared? I’m in Denmark, my initial trials with both TTS and STT weren’t great in terms of accuracy
I have done everything in English because I found there was a large gap in accuracy and quality between multilingual models and the ones that solely recognize English. For TTS, in Piper specifically ,I use ‘Ljspeech’. For Whisper I use ‘tiny.en’ because otherwise the delay before the response becomes to large.
For a similar reason I use ‘Qwen3:1.7b’ in Ollama because larger models would add to much delay. The rustpotter model does use ‘large’ but I found it to be to inaccurate and trigger on many false positives otherwise. The delay of rustpotter mainly seems related to the length of your wakeword, shorter wakewords having a smaller delay.
All of this together runs on a i5 4690k and it is struggling a bit at times.