Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

mhilbush · June 23, 2023, 3:23pm

Yes, that’s what I was thinking.

Ah, that’s a good idea. I hadn’t thought of that. This way you get the best of both worlds. You get the text, which you can pass to the NLI, AND all the other metadata.

kristiankielhofner · June 23, 2023, 3:50pm

Yes, please!

The project name was specifically and carefully chosen for wake name: three syllables (with Hey), uncommon in regular speech (with Hey), well-known proper noun (tree, first name), and near universal pronunciation that also happens to be gender neutral (for flexibility of male/female voice in TTS response), etc.

I think it’s a good name too but you would be amazed at how passionate people get on this topic… Everyone seems to want to be able to train their own wake word even though I repeatedly enumerate how bad of an idea that is. There’s always learning this the hard way but I’d rather not have to deal with the inevitable flood of issues (that reflect poorly on the project) due to the fundamental (yet understandable) lack of knowledge on this topic. People want one syllable words, common words, words that are almost impossible to repeat consistently, etc, etc. I’ve had to activate BDFL mode much sooner than expected!

“Hey openHAB” is one that would likely work well. We plan on selecting finalists and starting kickstarter campaigns for them (with pass-through cost).

The inference server returns a wealth of information as JSON natively. Combined with known WIllow device parameters (such as location you mentioned) we have and can provide nearly anything relevant. The issue is that due to lack of support/common format we currently strip everything but transcribed text for HA and openHAB. The REST endpoint already does pass-through for full processing when used with a custom REST endpoint.

I agree completely. Given that we’re platform agnostic I currently taken the stance that add-ons should be left up to their respective communities (openHAB, HA, Hubitat, etc). We will provide protocol implementation from our side, support, and documentation but there’s no way we could possibly take on all of these platforms and their respective add-ons internally.

rlkoshak · June 23, 2023, 3:55pm

I wouldn’t expect any differently.

I just want to make it clear for the audience here that the current integration is likely to be a stepping stone and not where the integration will likely end up. The current limitations will likely be temporary.

mhilbush · June 23, 2023, 4:09pm

Apologies but I’m not quite following. Are you saying that if I enter the openHAB REST endpoint as a custom REST endpoint in Willow that Willow will pass the full JSON object? Or, if I do that, will there be an auth issue as a result of not having the openHAB auth token?

mhilbush · June 23, 2023, 4:24pm

I’ve opened 3 issues. Please let me know if there’s any additional information I can provide, or if I should change the placement of the devices. Also please let me know if there are any config parameters I should try tweaking.

Your help is greatly appreciated!

github.com/toverainc/willow

Sometimes it's necessary to repeat the wake-up word before Willow wakes up

opened 04:00PM - 23 Jun 23 UTC

mhilbush

I have two devices running Willow built from a repo I cloned on June 11 (is ther…e a better way to specify what version I'm running?). Each device is in a completely separate part of the house (1st floor kitchen and lower level family/rec room). I only ever use the wake word "Alexa". After not using the device for a while (sometimes as little as a couple hours), I find that I need to say "Alexa" two or more times before I see the screen turn on. There is very little ambient noise in these rooms. In the kitchen you can barely hear the refrigerator running. In the family room, there's basically nothing generating any ambient noise. I experience the problem even when I'm quite close (2-3 feet) to the devices. For example, in the kitchen picture below, I can be standing at the counter top directly in front of the device. Here are pictures so you can see the environment in which the devices are located. Kitchen ![PXL_20230623_144513266~2](https://github.com/toverainc/willow/assets/19286354/6ca300e3-ce3c-43a3-b923-872788ebf35a) Family/Rec Room ![PXL_20230623_144846677~2](https://github.com/toverainc/willow/assets/19286354/d4e38c20-abdd-44c9-9ac2-22c6596d8936)

github.com/toverainc/willow

I need to be fairly close to the device for it to "hear" me

opened 04:18PM - 23 Jun 23 UTC

mhilbush

I have two devices running Willow built from a repo I cloned on June 11. Each de…vice is in a completely separate part of the house (1st floor kitchen and lower level family/rec room). I only ever use the wake word "Alexa". I find that I need to speak pretty loudly even when I’m relatively close to the device (5-6 feet). There's very little to no background noise in the areas where the devices are located. Please see issue #199 for a description and photos of the environment where my devices are located.

github.com/toverainc/willow

Certain words/phrases are detected inconsistently

opened 04:14PM - 23 Jun 23 UTC

mhilbush

I have two devices running Willow built from a repo I cloned on June 11. Each de…vice is in a completely separate part of the house (1st floor kitchen and lower level family/rec room). There are several words I use frequently in my home that Willow often detects inconsistently. This occurs with both devices. - “pool table” is sometimes also detected as “pull table” - “rec room” is sometimes also detected as “wreck room” - “sun room” is sometimes also detected as “summer”, “sunroom” and “sub room” Please see issue #199 for a description and photos of the environment where my devices are located.

Ha ha, yeah, somebody needs to be the bearer of bad news.

kristiankielhofner · June 23, 2023, 4:57pm

I saw them, thanks!

I’ll be getting back to you there shortly.

anon71759204 · June 23, 2023, 8:59pm

Do we have a quote from Espressif Systems for a custom wake word?
The requirements sound quite tough: Espressif Speech Wake-up Solution Customization Process - ESP32-S3 - — ESP-SR latest documentation

ceinstaller · June 26, 2023, 9:18pm

Got my S3 today! Looking forward to getting it up and running!

dalgwen · June 27, 2023, 1:35pm

Hello,

It sounds great ! I just ordered one ESP BOX just to try Willow.

I have a question, I don’t think it has been answered (please forgive me otherwise) :
Is there an API to access the raw audio, both ways ?

openHAB has AudioSource and AudioSink addons (to receive from and to send audio to remote component). For example we can send/receive raw sound from any pulseaudio server on the network, or send voice notification to Sonos, Chromecast, etc.
It is a key part to use custom TTS and STT engines.
If Willow has such an API, we can design an addon for it.

Miguel_M.A.D · July 30, 2023, 9:13pm

Hey @kristiankielhofner, sorry for the late reply, if using a WebSocket connection is an option for you, I have a project that I think can help you to create a Willow add-on.

You can find the source code on my fork of the addons repo.

I didn’t share it before because the project is still unfinished, and I had some problems with the audio resampling, but I think that now the server part is correctly implemented, and the migration to the openHAB 4 apis has simplified the implementation, but do not judge the code too much .

It consist on a web UI that works with raw wav audio and register a sink and source into openHAB and start a dialog processor instance with them that can be triggered remotely, also allows to discover the clients as openHAB things to expose the device state and options to openHAB, so I think the design can match what you need.

Let me know in case you end using it and need any clarifications about it.

Hope I can get an ESP BOX soon to try this project.

Best regards!

Harveg · August 8, 2023, 7:58am

Hi there
I also would like to know, if there is a possibility to tap an audio stream with openHab in order to use the ESP box also as a baby phone?
Cheers and I really appreciate your work!

tardismechanic · October 3, 2023, 7:14pm

I see that the new versions support sending a reply from OH to the Willow and have it display the text on the display, optionally speaking it. Has anyone gotten that to work? I’m really confused on how that even would be set up. I currently have it set up with Rule-based Interpreter under Voice (everything else on that page default) and can get the command from willow into my item for processing in a rule but how do I then send a reply? I don’t currently have any TTS addons installed but I wouldn’t think I would need one to just send text back.

JimT · October 4, 2023, 3:27am

I’ve just put a (back)order for the ESP32-S3-BOX-3. Looking forward to playing with Willow.

tardismechanic · November 14, 2023, 10:58pm

Just bumping this in case someone has any tips on configuring the OpenHAB side for replies. I’ve tried asking on their github and they just said I need to configure the willow side to use the OpenHAB command endpoint but otherwise since nobody there uses OpenHAB they can’t help on this side of the configuration. I can get an item populated with the incoming text but I’m very lost on how or if a reply can be sent back. I wouldn’t think I need to set up a text-to-speech service since that’s handled on the willow side and I think that’s just for doing TTS on an attached speaker? Reading pull requests the willow github suggests it’s possible if I understand it correctly but I’m not entirely sure.

JimT · November 15, 2023, 11:25am

My ESP32-S3-BOX-3 just arrived today. I’ve installed Willow on it, and also set up WAS. I was able to get it “working”.

Basically, you say “Hi ESP” (or “Alexa”) - the only two possible wake words you can use. When set up correctly, it will send a command to a VoiceCommand item in openhab. You’ll receive the text of what you said. Willow would do the STT for you (through WIS).

That’s ALL!

You’ll need to write a rule to handle the received command / text to do what you want to do.

Currently Willow doesn’t do TTS nor does it act as an audio sink, so if you want to respond in voice, you’ll have to use another means, e.g. play it through (Amazon) Alexa TTS via an Alexa Echo device.

So I installed HABot, and call Voice.interpret(event.command, "opennlp") so I can say “Hi ESP, turn off the living room light”, and the item VoiceCommand will receive a command “turn off the living room light” and HABot is smart enough to understand it, and actually turn off the corresponding light.

Here’s the actual rule written in JRuby, a one-liner:

received_command(VoiceCommand) { |event| Voice.interpret(event.command, "opennlp") }

I tried hooking it up to chatgpt but my free account doesn’t let me make too many queries.

tardismechanic · November 15, 2023, 5:32pm

Yeah that seems to line up with my experience too. It seems that even though the willow side is capable of receiving a reply from the endpoint, OpenHAB only sends a HTTP status code instead of allowing you to send a custom reply. They are releasing a way to do TTS over REST endpoint soon so hopefully that can work as a (admittedly hacky) workaround until someone can develop a more native way, unfortunately my development experience is way lacking to be able to contribute that so I will just need to be patient.

I was just confused because it seems that there are pull requests for the OpenHAB endpoint code on the willow side that allows that so I figured if that’s accurate there must therefore be a way to take advantage of that.

kristiankielhofner · November 15, 2023, 5:48pm

Willow does TTS across all supported command endpoints. Make sure you have Willow Audio Response Type in the Willow Application Server Web UI set to “Text to Speech”. This is in the Web UI under Configuration → General. Our current TTS engine audio level is also really soft (fixed in pending release) so you’ll probably want to crank up the playback volume to 80 or so.

Better yet set your TTS URL to our preview WIS implementation with more reasonable volume and better voices:

https://wisng.tovera.io/api/tts

This is available under a branch in the Willow Inference Server repo today if you are self hosting and will be released soon.

The issue for OpenHAB is it doesn’t say much (read on).

Willow does TTS for all supported command endpoints - HA, REST, and OpenHAB if enabled (see above). With HA we parse out all of the responses across the legacy conversation API or the new pipeline websocket stuff. With REST we currently expect the REST endpoint to return a plain text body with the text for TTS. We’ll be improving that soon - we weren’t sure how people would use it and wanted to make it as simple as possible.

The speech text/values for all of these is also what is shown on the display.

The issue with OpenHAB is (as you say) it only responds with HTTP status code so all we can do is print and say “Success” (HTTP 2xx) or “Error” (anything else). With OpenHAB it likely makes more sense to leave the audio response type set to Chimes because that’s the level of granularity you get anyway.

Does anyone know if there are plans for the OpenHAB API to respond with some kind of text string real status so we can say something meaningful ?

rlkoshak · November 15, 2023, 6:04pm

What would it say?

With the current approach Willow sends an HTTP POST to the default interpreter? So OH responds with the HTTP code indicating success or not. Because it uses the same Item for all commands, even if it did respond with some text it wouldn’t be terribly useful text. “Successfully commanded VoiceCommand to turn on the bedroom light”. Then there’s all the fun with internationalization (OH users speak a lot of different languages).

However, having said that, it might be possible to create a new interpreter and modify the API to return a response. But that’s not going to be very configurable from the end user’s side, at least now without creating a huge addition to this part of OH

What @tardismechanic and others want is to be able to push text to Willow for it to say. Then in a rule OH could push some custom response to Willow during or post processing the command in a rule. Which leads us back to the whole conversation that to really have a useful integration between Willow and OH it’s going to require an add-on the handle the communications between the two.

kristiankielhofner · November 15, 2023, 6:43pm

I would look to HA for inspiration on this. Their conversation REST API or assist pipelines via websocket can say anything. The basic intent handling is pretty basic but people are going wild expanding it. Out of the box HA will say “turned [on|off] $entity_type” where that is switch, light, fan, cover, etc. You can ask it things like “How many lights are on?” and it will respond with 36 (or whatever). There are various other ones for garage doors, setting light brightness, etc. They are expanding these built in intents at a brisk pace.

Then of course there’s well beyond with things like asking “What’s the thermostat set to?” and any number of other things. People have wired up all kinds of stuff as you can imagine.

I don’t want to keep making the comparison to HA but with Home Assistant Willow supports any language supported by Whisper (speech to text) - around 40 or so. Willow sends Home Assistant the automatically detected or preconfigured speech language ISO code per request and Home Assistant can also provide the language in the response. Full internationalization of strings is supported everywhere. Willow has users speaking Korean, Japanese, Russian, etc without issue so even non-Roman character sets are supported.

Other than passing around the language code Willow doesn’t care or know the difference when it comes to language and speech.

Great timing! We have a notification API coming in the next release. You can send an HTTP POST to the Willow Application Server with a JSON body containing all kinds of stuff - send to single defined device/broadcast to all, flash the display (and at what interval), set the volume, set the display brightness, set the text to show on the display, set the notification repeat count, playback an audio file URL, playback TTS, etc.

We also make use of the “Cancel” button on the display so in the case of broadcast and high repeat counts, whatever, when you hit Cancel on one device it stops/cancels that specific notification on all of them. I use this for an alarm clock everyday and when my half-dozen devices get going with max display brightness, max volume, and flashing it tends to get you out of bed

We also take care to make sure the audio playback is synced in the case of broadcast to multiple devices.

Our web UI also has a notification playground where you can play with all of these params and send your notifications. It dynamically renders the equivalent curl command in real time as you type/play/click around and does the same for the HA config necessary for HA to call the notification. If someone can give an example of how to do this with OpenHAB we’ll definitely add it.

This could be leveraged by OH in the way you describe but the UX will be janky. What you’ll end up with is the initial Success/Error, and then some ms (10s-100s depending on a variety of factors) the display will redraw and do whatever the notification does. You’ll also have the challenge of determining which device issued the command so you know where to send the notification back.

rlkoshak · November 15, 2023, 7:03pm

All I can say is OH isn’t HA and we don’t have the advantage of being backed by a for profit company with roadmaps and top down directed development. What gets implemented is what people volunteer to implement.

What OH lacks right now is:

a volunteer to make changes to that REST API endpoint along with the text interpreters to support it
an text interpreters that support meaningful responses in the first place
anything in the architecture that would allow an end user to customize what would be returned by that REST API endpoint assuming we had 2
a UI to support the customization of what is returned by that end point
a volunteer to develop an add-on to implement a more native interaction between OH and Willow

Until we have 1-4 or 5, all we have are rules where we can receive the text, interpret, do the action, and generate a customize response.