Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

Spaceman_Spiff · May 30, 2023, 5:02am

Intent recognition is no easy task but there are already countless approaches beeing done in different projects.
I use Rhasspy as a voice assistant and there is a nice overview about intent recognition in the docs.
I’d use that as a starting point for my research.
Once you have the spoken sentence as a string imho intend recognition is (more or less) easy.

kristiankielhofner · May 30, 2023, 8:58pm

Great! On wake word, we have an issue tracking that. There are parameters that we can change but then there are also some “fundamentals” in terms of what comes “baked in” the wake word models from Espressif we currently use. We want to get a lot of feedback and testing to not only improve the built-in wake words with parameters we have control over but also tweak the speaker selection, pronunciation, etc for “Hi Willow” or “Hi openHAB”. We’ll likely do “Hi Willow” and as part of that we would be happy to do “Hi openHAB” or whatever the community wants with pass-through cost from Espressif. If the OH community is interested in Willow, wants a wake word, and can fund it we would do it in a second!

If you want to jump in that issue and/or Github discussions we can walk through some basic steps to debug your wake failure issue.

Yeah… I used to have an AGX myself but even I sold it because while they’re neat the economics just don’t work out for this application: a Xavier NX dev kit is currently more than 2x the cost of a used gamer laptop with a GTX 1070 that blows away the Xavier NX in every regard (except for maybe $1/mo in electricity). If you want a quadcopter, a robot, or a polished (expensive) commercial product to have local CUDA they’re the way to go otherwise I’m just kind of “meh” on them.

That said, I’m not fundamentally opposed to supporting the NX/AGX, etc for people that already have them but there are issues there now and will continue to be more as we eye standardizing on CUDA 12, etc. I’d also be really curious to see what the performance is like - as I have talked to death re: TPU, etc they just weren’t really designed for use with larger models like speech recognition.

This is correct.

This is exactly our approach - our current focus is to support great far-field speech hardware in the physical world, wake, do the best speech recognition possible, and send the results wherever you want. Cheap. FAST. Accurate.

We’ll be getting into the intent stuff with the forthcoming Willow Application Server. I have quite a bit of experience with intent recognition, NLP, etc and I think we’ll be able to do some really interesting things not seen before in the open source space. Willow and WIS fundamentally enable speech pattern, languages, grammar, speed, and quality that were more-or-less thought to be impossible/impractical previously and we want to have broader ecosystem support for those kinds of things. Such as:

“Put a lunch with Homer Simpson on my calendar for noon tomorrow at Giordano’s on Clark Street” (or whatever). We currently recognize and transcribe such speech segments very well - all of that intent recognition and API integration “just” needs to get wired up .

rlkoshak · May 30, 2023, 9:04pm

Funding it might be a challenge. I’m not sure this would fit with the allowable expenses the foundation is allowed to spend money on and retain it’s non-profit charity status in Germany. It’s definitely borderline. @hmerk, I’m thinking the answer would be no, but you are much more knowledgable on the legal ins and outs of what the foundation can spend money on.

hmerk · May 30, 2023, 9:11pm

Your guessing is correct, this connot be funded by the foundation, as it is not covered by its purpose which is consumer advice and consumer protection as stated in the constitution.
Spending money for other purpose puts our non profit / charitiy tax state on risk.

Only chance I can see is a “private” fund raising by members of the openHAB community.
But it would need someone to organise that.

mvbergen · May 30, 2023, 9:19pm

Yes that’s understandable. No idea what costs Espressif would charge (@kristiankielhofner ?), if it is reasonable we could arrange “private” funding perhaps. Made some returns on nvidia stock recently ;-).

kristiankielhofner · May 30, 2023, 11:47pm

Apologies, I should have been clearer - direct foundation funding is not what I was referring to.

Wake words certainly aren’t specific to the “command endpoint” (HA, OH, etc) of Willow. They are specific to Willow or any other ESP-SR based project/product.

Our plan is to (essentially) crowd source potential wake names. I will be a little “BDFL” in that I won’t support/submit wake words that are completely impractical - less than three syllables, for example. Many people have asked for common words, less than three syllables, etc and I don’t want others to donate their money with a certain expectation of user experience such wake words can’t fundamentally meet.

Other than that we will take each of the finalist wake words and make a Kickstarter (or similar) campaign for each of them. The ones that reach the target goal for production will get made and added to Willow. I think “putting your money where your mouth is” for wake words is very apropos in this scenario . For “Hi openHAB” or similar the OH community can independently and separately from OH officially contribute and get it done.

I haven’t reached out to Espressif yet but I imagine it’s more expensive than would be imagined by people unfamiliar with the process. The monetization strategy of Willow is for commercial applications in privacy-sensitive use cases (healthcare, etc) where tens of thousands of dollars (for purposes of discussion) for a custom branded wake word is nothing in the larger scale of the project.

Regardless, we will almost certainly be self-funding and creating “Hi/Hey Willow” or similar and will make that available to the community as the default. This benefits us for various reasons, not the least of which is “Hi ESP” is actually fairly difficult for a lot of people to trigger reliably as noted by @mvbergen. It’s a pretty terrible wake word and many people have much better success with Alexa (which we also include) - they just don’t like having to say it!

hmerk · May 31, 2023, 6:57am

For all the Marvel fans, Jarvis would be nice as a wake word

kristiankielhofner · May 31, 2023, 12:00pm

Hah, this is actually a great example! As a Marvel fan I appreciate it but the broader industry has learned over the years that three syllables or more is really the only way to do wake in the real-world without causing a significant number of false wakes. It’s essentially a fundamental rule at this point.

I’m also a Star Trek fan and many have suggested “Computer”. Interestingly, Amazon supports this with Alexa as an alternate wake word but I fear it would also cause a significant number of false wakes due to it’s broad and frequent use in casual conversation, media playing in the background, etc.

I don’t want to be “Mr. No” but with Willow I’d like to try to find some balance of enabling the open source community to do the fun, different, and somewhat quirky maker things we’re known for while also not handing users footguns left and right.

hmerk · May 31, 2023, 12:25pm

I am using it for my echo in my office and have not had any issues with that.

On the other hand, years ago we watched the TV series “Castle” a lot and our Alexa in the livingroom hat many false wakes whe Mr Castle called his dauchter, Alexis …

kristiankielhofner · May 31, 2023, 12:40pm

I respect and appreciate this but another challenge with wake is that, more than ever, anecdotal experiences extrapolated to broader use are very dangerous.

You know how people can almost immediately recognize a known voice of a friend, family member, actor, singer, etc? Our ability to immediately discern unique regional accents (I can tell if you’re from Chicago with a single word). Wake attempts to turn the fundamental human quality of distinct speech patterns on it’s head - on a resource constrained device with often challenging acoustic environments.

We have a worldwide base of users with drastically different environments, accents, languages, etc. Even in the early days of Willow we are already seeing and hearing reports of drastically different experiences with wake word function and performance. There was a Willow user recently where wake is 100% for him - but his wife almost can’t get it to work at all (with “Hi ESP”). He had a somewhat humorous anecdote where his wife was repeatedly failing to wake Willow. He was giving her instruction with an example, speaking from roughly 30ft away (with music in the background), where Willow activated and successfully executed the command. She attempted to mimic him standing next to the device and wake failed for her.

Then we have other examples of users reporting their families with children across ages, etc essentially playing with Willow for extended periods of time.

This is why wake word choice is important - many users find they simply cannot get their voice - pitch, accent, cadence, etc to work with a given wake word. Or, they have an environment where (for a variety of reasons) they get failure to wake or false wake very frequently.

hmerk · May 31, 2023, 12:43pm

Don’t get me wrong, not going to convince you to anything, just wanted to describe my experience so far.

kristiankielhofner · May 31, 2023, 12:47pm

Oh I understand that, this is just a very unique field and challenge. I’m learning again and again that people who haven’t spent years in this field (surprise surprise) don’t have a complete understanding of the vast issues at play.

In these explanations I’m attempting to impart these challenges and why it’s important to heed the hard lessons learned over the decades.

rlkoshak · May 31, 2023, 1:53pm

So “HAL” isn’t a good choice. Though you couldn’t trust that it would do what you ask anyway. “HAL, open the garage door.” “I’m sorry, I can’t do that Dave.” “Who’s Dave?”

“GLaDOS” is probably not a good idea either. Always the promise of cake and never delivering.

“WOPR” would just want to play tic-tack-toe.

“MU-TH-UR” would just generally be unhelpful.

“Omnius” would just want to take over. (I’m probably getting too obscure now.)

mhilbush · June 23, 2023, 1:23pm

The hardware finally came in stock. Yay! I bought 2 and flashed with Willow.

Overall, I’m quite happy, so kudos on what you’ve created so far! However, there are a couple things that I find a bit frustrating.

After not being used for a while, almost always I need to say “Alexa” 2 or 3 times before the screen wakes up. This occurs even when I’m “close” to the device (2-3 feet away) and there’s no background noise.
I find that I need to speak pretty loudly even when I’m relatively close to the device (5 feet) and there’s no background noise.
There are a few words I use pretty frequently that are detected inconsistently. I suppose this is the nature of the beast, but thought I would mention anyway. I’m training myself to speak a bit more slowly and clearly to reduce the occurrences of this.
- “pool table” is sometimes also detected as “pull table”
- “rec room” is sometimes also detected as “wreck room”
- “sun room” is sometimes also detected as “summer”, “sunroom” and “sub room”

Definitely looking forward to wake words other than “Alexa” and “Hey ESP”. I don’t like saying “Hey ESP”, as it just feels awkward to say. And I thoroughly dislike having to say “Alexa” lol.

Also, it would be really nice to know which device generated the command, but I don’t know how that can be done given the current openHAB implementation.

Again, thanks for the excellent work you’ve done so far!

kristiankielhofner · June 23, 2023, 2:24pm

Great! I’m glad it’s gone fairly well so far.

You seem to be experiencing some odd issues we don’t see a lot of…

Of course The Willow developers “eat our own dogfood” and use our Willow devices throughout the day. As one example, I have a Willow device in my bedroom. All it does (more or less) is turn off lights, switches, etc when I go to bed - so there is roughly 24 hours in between commands. I haven’t experienced this issue and we haven’t had any other reports of it. One of the many challenges in this field are the limitless variables - speakers and environments being the two biggest. We also do a variety of automated tests that checks for this as well.

Is this for wake, speech recognition, or both? Again with the dogfood approach, I have another Willow device in my loft and I’m able to wake and issue commands at normal speaking volume from downstairs (around several corners, with multiple large monitors in the way (Willow is on my desk), etc. We’ve also had many users give glowing reports of issuing commands from 25ft away. Same with commands to my bedroom Willow device from the bathroom, which is probably 20ft away and around a corner.

This is the beast of speech recognition… We use Whisper (like almost everyone else does) and it has very high quality. That said there are many parameters that can be tuned on a per-request basis. Our defaults balance accuracy vs speed well but I’d be interested to see how much better your experience could be by tuning some of these in your configuration.

Haha, don’t I know! Possibly related to your first point - the wake word is actually “Hi ESP” so depending on which you use more regularly that could be one of the reasons for wake failures. Either way it’s just about the worst wake word I can think of and it’s unfortunate that it’s one of the few out of the box options from Espressif… We’re really looking forward to getting more feedback on additional wake words so we can commision one or more that are more palatable to the community.

This is something we’re working towards - there are some really interesting possibilities I’ve highlighted before. Things like being able to say “turn off the lights” and the Willow device has an association between the area where it is physically installed and the associated devices. We’d also use this as the base for another issue - multiple devices in proximity simultaneously waking and attempting to issue commands.

With no shade to openHAB the current API is very, very limited. All it supports is the transcript, unfortunately and the response has virtually no usable status other than command result. We experience limitations like this with HA as well. I suspect that because Willow more-or-less came out of nowhere and offers functionality never before seen in the ecosystem (free-form speech, speaker identification, automatic multi-language support, etc) the various APIs we integrate with will have to “catch up” so to speak.

For your issues, do you mind going to our Github and creating issues for them? We’re preparing for the 1.0 release (early July) and we’d like to get these addressed for you.

mhilbush · June 23, 2023, 2:40pm

I’m not surprised haha. I always seem to run into stuff that nobody else does. lol

With respect to the multiple “Alexa” commands to wake up, as well as the distance issue, would it be helpful to see a picture of where the two devices are located?

Happy to try out tuning some of the parameters. Could you suggest what parameters to try as starting point?

Actually, I never use the ESP wake word, which probably is why I wrote “Hey ESP” instead of “Hi ESP”.

Yes, exactly the scenarios I was thinking about.

I would like to say “open the blinds” or “close the blinds”. It should know if I’m in the kitchen, the bedroom, the sun room, or the rec room.

None taken. Instead of getting just the text, I was thinking it would be nice to get a JSON object containing the text, language, a Willow device id, and probably several other bits of data. But, you’ve certainly thought about this more than me, so would be interested to hear your idea(s) about how best to do this.

Not at all. Would you prefer separate issues for the wake word and the distance issue?

kristiankielhofner · June 23, 2023, 2:50pm

Hah, yep! There are always special users (the best kind) that seem to be able to break anything.

Sure! You can upload pictures with your github issue(s).

If possible we should continue this on github.

Exactly what I’ve been thinking. The issue is we’re a bit ahead here but the benefit is we’re in a position to essentially define the open standard for this. We can send anything but the command endpoint needs to be able to parse it.

Yes please, separate issues would be great!

mhilbush · June 23, 2023, 3:03pm

Would you like me also to open an issue for the words that are detected inconsistently (i.e. sun room, pool table, rec room)?

Well, “Hey Willow” (or something like that) seems like a good starting point. It seems easy enough to say, and it doesn’t seem like there would be too many opportunities for false wake-ups (of course I’m certainly no expert). And, more parochially, I would be interested in “Hey openHAB”.

mhilbush · June 23, 2023, 3:12pm

I suppose you could make it configurable in Willow whether to return just the plain text, or a JSON object. Then it’s up to the end user to configure their Willow(s) based on how they want to handle it. Is there any chance at all that this would get into the first release?

@rlkoshak WDYT? I suppose configuring a JSON response would break HABbot if it sees a JSON object. Probably any other NLI that the user has configured, as well. So until these are updated to deal with the JSON object, users would need to keep their Willow configured for just the plain text.

rlkoshak · June 23, 2023, 3:20pm

I think the only way to process the JSON would be to set the default human language interpreter to the rules based one. Then in the rule you can pull out the relevant information and go from there. Then from the rule you can use the interpret action to direct just the transcript text to the interpreter of your choice (e.g. HABot).

That’s how I’d work around that for the time being. Ultimately I still do think that an add-on is going to be needed to fully take advantage of Willow in OH.

I’ve got my hardware and I’ve been playing with it. I need try and put together a couple of hours to write up a summary of my thoughts on it though. I’ve some more things I want to try too. I’m way behind where I want to be on this.