Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

Well, at least in my home quite often the wrong Alexa reacts which is not the closest one. I have to add that it only happens with Echos of different hardware, maybe because of different microphones. It’s quite annoying and the only way to stop this is to give them different signal names. I wished Alexa had a calibration function to avoid this.
So maybe this is a function where Willow could be better. So I’m really looking forward to this gadget.

That’s a really good point and a great example of one of the issues with far-field voice.

Many (all?) open source speech efforts that have come before Willow operated under the assumption that a “microphone is a microphone” and you can just throw together random hardware components (Raspberry Pi and pick a mic) and have an Alexa experience.

You say otherwise, and you’re right! There is a tremendous amount of audio and acoustic engineering that goes into these things, down to the specifications of the plastic of the enclosure, the microphones, other internal components (in the “cavity”), even the specification of the microphone holes in the enclosure come into play. It’s hard - to the point where Amazon can’t even get it quite right between different hardware revisions (I’ve still always had a very good experience).

I love the ESP BOX because get this - Espressif actually had it tested and qualified by Amazon themselves for wake word detection, audio quality, etc as an Alexa enabled device:

So when we say Alexa quality, we really mean it - and Amazon agrees!

Our goal is to best Echo+Alexa in very way possible while being open source, private, and trustworthy. So we’ll certainly try!

Since others have mentioned it I’ll list my current GA and Alexa use cases as just another data point.

Our household is maybe a little different from most here in that we have tons of Nest Home Hubs and speakers about but we almost never use them for home automation, not even the TTS announcements I brought up previously. I had them for awhile but my SO found them to be creepy.

About the only thing we do is on occasion from a home automation perspective is trigger the garage door opener from a tile driven through the Home app (though that changed on Android 13 and I never went back to fix it) or turn on/off all the lights.

Our house is not a Star Trek house. “Computer, earl grey, hot”.

But we do play music and will sync some or all the speakers in the house, especially on cleaning days. My 10-year-old watches YT videos on the kitchen screen (mostly Minecraft stuff and Mark Rober, the latter of which I highly recommend), plays his favorite music, and sends broadcast messages throughout the house. And of course to fact check his parents in real-time.

All the Hubs show the highlights from our latest and favorite photos (wish it did videos too) and show the weather and such.

In the kitchen we use the Hub as a cookbook.

The grandparents use theirs to video call us.

When my internet goes offline, the Hubs are usually the first to know and change their screen to tell us, making them excellent visual Internet status indicators.

For the most part what we love about them is they fade into the background when not needed but respond well (usually) when called upon.

I have one Alexa echo show given to me for free because Amazon ended their program that allowed our printer to order toner when it gets low directly. It now has to go through the Echo. I find the Echo to be way busier and pushier than the Hubs, showing lots of current news, pushing Echo features, and such. It definitely demands way too much attention so here it sits on my desk, turned around. The only reason I keep it is that ability of the printer to order toner has been a life saver and through the Amazon Control binding I can do minimal interaction with my thermostat and outdoor low voltage lighting which support Alexa but there is no openHAB binding.

To summarize from this I’d like to see/use Willow to:

  • play music based on a voice search
  • play videos based on a voice search
  • sync speakers in groups (stereo, danger though as here be patents)
  • adhoc internet searches
  • recording and sending a voice message to another device in the house or all devices in the house
  • rotating photo display, automatically updated (for cloudless Nextcloud integration or Photoprism integration would be cool)
  • cookbook
  • weather (obtained from OH?)
  • video streams pushed from IP cameras in cases where an event was detected (e.g. someone rings the doorbell, show the doorbell camera feed). My account got borked when I moved from Nest accounts to Google accounts so this has never worked for me but it would be nice. OH has the IP Camera binding which could be a source for this.
    a. I’d say that this implies an ability to push data to the Willow, not just receive queries from the Willow.

Would this reasoning take place on the Willow or on the openHAB side? If on the openHAB side I’d recommend HABot as the reasoner. It’s semantic model aware and supports the NLP to interpret commands like these. Willow would just need to add “in the location” to the end of what ever the person said based on the room the Willow knows it’s in. Some users are using HABot with chat (Nextcloud Talk, Telegram, etc.) to carry on two way conversations with openHAB.

  • “The garage door has been open for a long time”
  • “Please close the garage door”
  • "The garage door is now closed

I’ve had the same experience with Google devices, though I get a nice little toast on my phone asking if the right device answered. Not sure it changes anything. Most of the time it gets it right though.

1 Like

Generally speaking the overall the scope of Willow is (currently) quite narrow. I’ve done too much too early in the past and it doesn’t work out well! Once you get past all of the fancy audio processing, wake, speech recognition, etc all we really do is:

  1. Wake
  2. Speech to text
  3. Send text somewhere (OH in this conversation)
  4. OH does whatever you have configured with the transcribed text
  5. Display speech to text transcript on LCD, result/status from OH, and give tone-based audio feedback for success/failure (more on this later)

I’m VERY new to OH (first install ever yesterday!) but here’s an equivalent curl of what we do from the Willow device for OH:

curl -u $OH_TOKEN: https://$OH_INSTANCE/rest/voice/interpreters -H 'accept: application/json' -H 'Accept-Language: en' -H 'Content-Type: text/plain' -d 'turn off upstairs fan'

This uses the default system configured OH Human Language Interpreter which in my case is currently set to Built-in Interpreter.

Where the OH console reports:

10:50:59.543 [INFO ] [openhab.event.ItemCommandEvent       ] - Item 'upstairs_fan_Switch' received command OFF
10:50:59.544 [INFO ] [openhab.event.ItemStatePredictedEvent] - Item 'upstairs_fan_Switch' predicted to become OFF
10:50:59.546 [INFO ] [openhab.event.ItemStateChangedEvent  ] - Item 'upstairs_fan_Switch' changed from ON to OFF

For the audio feedback (basically text to speech) like what you’re describing we’re still hashing out our overall strategy but it will be similar in straightforwardness, with the initial approach likely being something as simple as “if the response contains audio, play it instead of the tones or built in audio chimes for success/failure”. We have a TTS engine in our Willow Inference Server and there is ongoing work to do TTS on device as well. I’m torn between allowing users to use a variety of STT/TTS engines because we are completely committed to “Alexa or better experience” and the use of engines we haven’t validated for quality and response time has the very real potential of ruining the experience for the user.

Thanks for the pointer to HAbot! This is exactly the kind of feedback I was looking for in this thread. Being so new to OH I’m not in touch with the ecosystem and how the community is actually using it. I’ll look into it but generally speaking we’re currently more-or-less aiming for broader compatibility as opposed to requiring any extra steps on the part of the user. The onboarding for Willow and your existing install should be as simple as “point us there and we’ll figure it out”. For now, at least :slight_smile: .

This is what I was trying to say. I can’t comment on Alexa, but Google is wrong often enough that it erodes my confidence.

Honestly, a big part of the problem might be “me”. My voice projects so far that I can be heard across crowded rooms, even when I’m talking quietly. I can only speak to my own experience, but maybe I’m an extreme edge case for voice detection. :wink:

In a nutshell: room-specific interaction is great if it works super reliably, so it’s definitely worth shooting for. I’m just saying that I wouldn’t be disappointed to fall back on different wake words in order to get 100% reliability for my circumstances.

Now I get why there’s a strong “Amazon is too pushy” sentiment. My Nest Hubs are set to show photo slideshows, so that’s all they do. I had no idea that Amazon uses the Echo Show like a billboard.

yes @kristiankielhofner check out HABot because Yannick already has put a ton of work into this project before he built the new UI
Also check out the really cool project by Miguel

1 Like

It would be cool if that could be user configurable. With HABot installed (it’s an add-on) there’s three choices for interpreter: built in, HABot and rules based. The rules based requires/lets the users essentially dump the text into a rule where it can be parsed and handled however they choose (call out to a NLP processor on the command line, a cloud service, or just brute force it.

OK, if I’m interpreting that REST API call correctly, it sends the text to whatever is configured as the default voice interpreter. So the end user can choose how to handle the text. So the OH user can change between interpreters independent of Willow, which I think it the proper way to do it. That’s good! I can choose to have it handled by HABApp or write my own rules to process the text in addition to using built-in (which I admit I don’t know it’s full capabilities).

Settings → Voice

It’s relatively new but OH’s Voice subsystem also now supports multi-step dialog (similar to the back and forth shown above). I’ve not played with that yet but at some point that’s going to be a pretty powerful way of interaction.

  • Are the lights in the master bedroom on?
  • Yes, three lights are on in the master bedroom
  • Turn them off

All I really know about them (beyond having watched the PRs go by) is the voice/dialog/start and voice/dialog/end REST API endpoints. There’s also the listenandanswer endpoint.

OH does support network discovery. So the onboarding theoretically could be as simple as Willow sending out a broadcast packet, listening for the response and looking for OH as one of the responses. Then you’ll get the URL(s, some have more than one OH running) to send the text to. If you use the URL you show above, the rest will be options in the standard OH settings, assuming the user wants to change the defaults. I can’t think of an easier integration (though something will have to show on the Willow to choose which OH to send the text to, I’m not sure all of them is the right answer).

Another option/integration approach could be to create an add-on which presents a normalized API to Willow. Rather than deal with different APIs for each and every home automation system out there, you can settle on one and then integration becomes writing add-ons/modules/plug-ins on the HUB to speak your API. I’m not sure if that would be less work but it might make the code you need to write on the Willow itself cleaner.

And whether you support it or not, someone somewhere will eventually write a willow2mqtt bridge. :wink: MQTT could be another option. OH supports autodiscovery of a number of “standards” including HomeAssistant and Homie among others.

I think Google makes the device determination in the cloud instead of locally which makes it sensitive to network jitter and system clocks being even slightly off.

I’m just guessing though.

Yes, for example, I just installed HABot, set it as default HLI, and it works without any changes on our side.

Yes but authentication is still an issue. For Willow applications best practice is to use a long-lived API key.

I’ve talked about it a bit but the monetization strategy for Willow is enterprise, where will have our own protocol going to a middle/application server to do whatever wild things they want to do. But that’s just yet another thing for home users, so my thinking is:

  • Home Assistant
  • OH
  • Maybe one more
  • Generic (simple HTTP POST or something to an API with fixed grammar)
  • WAS (Willow Application Server) protocol

The WAS client support and specs will be open, so if we get important or interesting enough and people want to develop WAS modules for XYZ they certainly can and that would almost certainly provide an even better user experience than the somewhat “lowest common denominator” result we have with the various APIs exposed by differing systems.

These are all really cool projects but (at the risk of being smug) Willow and the Willow Inference Server are already leagues beyond these efforts and engines. I keep running into this and the why and how behind it so I made a technical deep dive/demo video:

Oh sorry - there’s also a quick demo from a Willow user in the UK showing speech recognition quality and speed for free-form text:

Sorry one last edit: his response times are slow because community WIS server is in Chicago and he’s across the pond.

So instead of putting HA specific and OH specific stuff on the Willow side, leave it talking your own WAS protocol and create add-ons/plug-ins to connect that up to those systems. That’s what I’m proposing.

I can’t speak for HA, but installing an add-on to support a new device/technology is standard operating procedure for OH so that doesn’t really add to the installation/configuration in any significant way.

That lets you leave your Willow boxes clean and simple and means adding support for each new system as almost a completely separate independent effort that doesn’t have to touch the core Willow code and it lets you avoid that lowest common denominator problem even among HA and OH. Your WAS can include everything Willow can support. Then each home automation systems add-ons can take what they can from WAS and ignore the rest it can’t handle.

When/if you support OH pushing stuff to the Willow boxes (TTS, audio/video streams, etc.) it’s almost certainly going to require an add-on anyway. May as well start out with one.

It’s just a thought.

I’m tracking but let’s use OH as an example. I’ve never even seen it before yesterday.

But I know how to read API documentation (or even reverse engineer them) and I went from pulling the docker container to having our side built out in about an hour. When released a user can point Willow at any OH install going back who knows how long and it just works. It also actually makes support on our side easier too because APIs tend to have better long term support, versioning, etc. Add-ons, etc tend to deprecate faster and the only thing worse than not having one is having an old/broken one. We don’t have the resources to maintain a bunch of modules in random programming languages across vast ecosystems.

The platform module based approach means we have to do software development in a completely foreign ecosystem, figure out how to get it listed in whatever component manager or whatever, keep up to date documentation with screenshots, etc, etc. OH is Java and I’ve spent about an hour in my entire life programming with Java. I don’t know how long it would take me but I wouldn’t be proud of it let’s just say that!

So the plan is a handful of well abstracted Willow integration modules on device with the Tier 1 module being the WAS protocol. If community members (or even us, potentially down the road) are interested in developing an add-on or whatever for Home Assistant, OH, something I don’t know exists, etc they can do so using the WAS protocol (basically faking the real Willow Application Server) and enjoy the same Tier 1 support in the Willow ecosystem that enables absolutely everything we’re capable of.

The danger is every time there is major new version released (OH 4 comes out in June) breaking changes to the APIs are allowed. I don’t think there are any breaking changes to this part of the REST API but that might be the case in the future for OH 5. We can’t predict what will change, but even in cases where the API needs to change, breaking changes are kept to a minimum so it should be relatively safe.

On-the-other-hand, if someone wrote an add-on and submitted it to the OH project (not a requirement by any means and it doesn’t have to be you) the maintainers group of the add-ons often fix your add-on for you when breaking changes happen in core. We tend to go to great lengths to keep plug-ins working once submitted to the project to avoid their becoming deprecated or broken.

Anyway, all of your arguments are valid as well. But ultimately it’s going to mean Willow’s ability to interact with OH and OH’s ability to interact with Willow (is that even possible with what’s proposed?) is going to be significantly limited and potentially break on future versions of OH if that API does change until someone steps up to write an add-on.

If there is interest in Willow (which seems to be the case) and the WAS protocol is reasonable (I’ve no reason to think it’ won’t be) I’m sure someone will write an add-on for OH to support it (or a WAS protocol to MQTT bridge which seems to be the approach everyone is taking these days anyway which makes sense, why implement it X times for X hubs when you can implement MQTT once and all the hubs support MQTT?).

My concern is that a REST API only interface where Willow only pushes text to OH on that one API endpoint is pretty limiting even compared to what’s possible with Google Assistant, Alexa, HomeKit, and Mycroft.AI. If that’s the only integration possible or planned, it’s not very compelling. But as a first something to get out the door, it sounds great!

I look forward to seeing what this WAS protocol becomes.

If there is enough interest, traction, etc with Willow and a Willow module ends up in tree with support from the core devs that would be ideal!!! We would LOVE that.

I just don’t want to presume that will be the case and without getting into it too much often times there can be mis-alignment with that in situations where some platforms may consider us as “competing” with whatever their native/alternative approaches may be. I don’t see that being the case with OH but it’s bitten me before.

Thanks for the heads up on OH 5, I guess we’ll just have to see. We already have some compatibility API code for HA (I don’t like it, believe me) because they deprecated one of their voice APIs in a matter of months.

This is actually related to Websockets - we have to detect feature/platform support for HA and switch between REST and Websockets depending on whether support is detected. I would strongly prefer Websockets with eventing to OH for the kind of functionality you describe but I’m a little concerned about the fairly unique DIY “ping/pong within 10 seconds or we drop you” keep-alive requirement of OH. Might be a non-issue but it does give me pause considering Willow runs on a microcontroller, the #1 priority is audio, and we have to count every byte and CPU cycle to ensure the best end-user experience while doing our best to support all of the functionality on the roadmap and what else is to come.

I have to say I’m a little perplexed by what I see some people/projects/etc doing with MQTT. I like MQTT for certain applications, and I don’t pretend to be the end-all be-all of protocol design, but it just feels “wrong” to me.

One of the reasons I started Willow was from a sense that this corner of the open source ecosystem does not fundamentally understand the nuances of voice. My former life (thankfully) was 15 years in VoIP, WebRTC, etc so I like to think I know voice and audio (and video, ugh) pretty well. I’ve observed an ever-growing graveyard of failed assistants, speech projects, etc that (to me) seem to all stem from this fundamental lack of understanding. My “favorite” being all of the teams who seem to think you can slap a random microphone on a Raspberry Pi and get clean speech from across the room. It hasn’t worked for the last 10 years because it never will but you see it time and time again.

We are very open to alternative competing products. There are a few rules to avoid confusion (e.g. we don’t allow multiple versions of a binding that does the same thing/supports the same technology) on our marketplace but we do support alternate unofficial marketplaces.

In general we want people to find the best solutions for them. That’s one reason I think we are not talked about more in the wider Internet like HA is. We aren’t partisan and we aren’t evangelists. We are home automation enthusiasts.

It’s one of the things I love about the OH community.

I’d say that’s the number one reason people who switch from HA to OH give for the switch. OH tends to be very deliberate in it’s progress and puts a lot of value in backwards compatibility.

And I didn’t mean to be alarmist about OH 5. OH 4 isn’t even out yet and it’s usually at least two years between major versions. I would hope someone has written a WAS add-on by then, or some other and more complete approach is found.

If the changes are not breaking and they fit with the overall architecture, the OH core developers are amenable to change things like that to support external third party connections better. I can’t promise anything but a lot of the REST API endpoints exist only so support third party integrations. If you find it’s a problem the way it does it now and have an approach that might work better it would be worth opening an issue.

It has some features that make it good at bridging between technologies. It’s an open standard, has libraries for most popular programming languages, relatively easy to implement/use, light weight enough to put on a microcontroller, and most services in the FOSS home automation realm support it out-of-the-box. It may not be the best but it has a low barrier to entry which makes it good enough most of the time.

I’m not saying that a WAS to MQTT bridge is the best idea or even a good idea. But I can almost guarantee you that someone somewhere is going to write one and put it on GitHub.

1 Like

Understandable and not an issue at all.

On that timeline, I’m sure of it!

I really appreciate this! I’m not saying it is or will be necessarily, it just jumped out at me. Not unheard of in terms of WebSockets but compared to WS support we have for HA that doesn’t have that requirement it would need a lot of testing and validation on our side.

I don’t disagree on any of these points! I’m just traumatized from VoIP. After A LOT of pain the entire industry and ecosystem finally realized that realtime, latency sensitive data isn’t just “stuff some data in a packet and transport it however” as it was originally conceptualized. We KNOW that doesn’t end well. Audio and video is a unique sensory experience for humans compared to a web page loading or a notification showing up. It’s a completely different animal and when I look at MQTT and more specifically the architectures that people assemble with it I get flashbacks to the early and very, very rough days of VoIP.

Hah, I guarantee it. Caveat emptor.

MQTT is definitely not suitable for real-time streaming data of any sort. But it doesn’t even pretend to address that. But it excels at delivering small discrete messages, with or without guarantees of delivery, persistence, etc. And it’s amazing how much of the home automation world in general can be broken up into small discrete messages, including the text that a Willow captured from spoken audio or a state change in an OH Item.

However, I’ve seen a lot of bad MQTT implementations out there which violate the underlying principals of MQTT, especially “don’t impose work on your clients.” I can’t believe how many times I’ve seen some device publish some huge JSON requiring the clients to parse out the data they want.

Yes - I would have expected Willow to send the recognized text over MQTT, not the raw voice data.

Will it be possible to display a custom text on the display as a notification or an answer to a question?

Sounds perfect. Does is support the Dutch language?

We include support for MQTT in the Willow binary but don’t currently utilize it for anything. We included it because interacting with platforms via MQTT with status text (which MQTT is good for), etc is something on our roadmap.

Wake is currently English-only, not that it matters much because a proper noun is the same in any language AFAIK. I have seen videos from various users around the world and the wake engine does quite well even with accents. It’s really just the sound your voice makes after all :). The Willow Inference Server using our highly optimized Whisper implementation supports something like 30 languages, of which Dutch is one. You can find the available languages and expected accuracy across them here. Dutch is a “top 10” supported language so I would expect the accuracy to be very good in practice.

We’re working through some font issues on the display for Unicode characters to have full character set support across all of these languages but other than that the platform is ready.

This is so nice !
I’ve had the hardware sitting on my desk for over a year and apart from the initial “wow” factor, I was never able to get it to do more than just turn on the red LED.
I’ll make sure to experiment with this in the coming weeks as I already have an openHAB instance running on an Arch Linux based Linux distribution (Manjaro).
I read the instructions, and yes they are quite convoluted but nothing that I have not already done in the past.
Once again, thank you for putting such an effort on this.

And in Europe, all ESP Boxes seem to be sold out currently… :frowning:

But I wanted to ask about the server-side hardware. @kristiankielhofner, could you give a rough indication of what kind of hardware is needed to run the willow server locally?

My home server runs on a Celeron G3930 @ 2.90GHz, about half of which is already used by a number of docker containers (such as OH3). It has plenty of RAM, though (24 GB). Am I right in guessing that willow will not run on it (in its default settings, which, I believe, includes whisper large-v2, right? Could you indicate how many orders of magnitude my existing capacity is off?

1 Like