Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

Generally speaking the overall the scope of Willow is (currently) quite narrow. I’ve done too much too early in the past and it doesn’t work out well! Once you get past all of the fancy audio processing, wake, speech recognition, etc all we really do is:

  1. Wake
  2. Speech to text
  3. Send text somewhere (OH in this conversation)
  4. OH does whatever you have configured with the transcribed text
  5. Display speech to text transcript on LCD, result/status from OH, and give tone-based audio feedback for success/failure (more on this later)

I’m VERY new to OH (first install ever yesterday!) but here’s an equivalent curl of what we do from the Willow device for OH:

curl -u $OH_TOKEN: https://$OH_INSTANCE/rest/voice/interpreters -H 'accept: application/json' -H 'Accept-Language: en' -H 'Content-Type: text/plain' -d 'turn off upstairs fan'

This uses the default system configured OH Human Language Interpreter which in my case is currently set to Built-in Interpreter.

Where the OH console reports:

10:50:59.543 [INFO ] [openhab.event.ItemCommandEvent       ] - Item 'upstairs_fan_Switch' received command OFF
10:50:59.544 [INFO ] [openhab.event.ItemStatePredictedEvent] - Item 'upstairs_fan_Switch' predicted to become OFF
10:50:59.546 [INFO ] [openhab.event.ItemStateChangedEvent  ] - Item 'upstairs_fan_Switch' changed from ON to OFF

For the audio feedback (basically text to speech) like what you’re describing we’re still hashing out our overall strategy but it will be similar in straightforwardness, with the initial approach likely being something as simple as “if the response contains audio, play it instead of the tones or built in audio chimes for success/failure”. We have a TTS engine in our Willow Inference Server and there is ongoing work to do TTS on device as well. I’m torn between allowing users to use a variety of STT/TTS engines because we are completely committed to “Alexa or better experience” and the use of engines we haven’t validated for quality and response time has the very real potential of ruining the experience for the user.

Thanks for the pointer to HAbot! This is exactly the kind of feedback I was looking for in this thread. Being so new to OH I’m not in touch with the ecosystem and how the community is actually using it. I’ll look into it but generally speaking we’re currently more-or-less aiming for broader compatibility as opposed to requiring any extra steps on the part of the user. The onboarding for Willow and your existing install should be as simple as “point us there and we’ll figure it out”. For now, at least :slight_smile: .

This is what I was trying to say. I can’t comment on Alexa, but Google is wrong often enough that it erodes my confidence.

Honestly, a big part of the problem might be “me”. My voice projects so far that I can be heard across crowded rooms, even when I’m talking quietly. I can only speak to my own experience, but maybe I’m an extreme edge case for voice detection. :wink:

In a nutshell: room-specific interaction is great if it works super reliably, so it’s definitely worth shooting for. I’m just saying that I wouldn’t be disappointed to fall back on different wake words in order to get 100% reliability for my circumstances.

Now I get why there’s a strong “Amazon is too pushy” sentiment. My Nest Hubs are set to show photo slideshows, so that’s all they do. I had no idea that Amazon uses the Echo Show like a billboard.

yes @kristiankielhofner check out HABot because Yannick already has put a ton of work into this project before he built the new UI
Also check out the really cool project by Miguel

1 Like

It would be cool if that could be user configurable. With HABot installed (it’s an add-on) there’s three choices for interpreter: built in, HABot and rules based. The rules based requires/lets the users essentially dump the text into a rule where it can be parsed and handled however they choose (call out to a NLP processor on the command line, a cloud service, or just brute force it.

OK, if I’m interpreting that REST API call correctly, it sends the text to whatever is configured as the default voice interpreter. So the end user can choose how to handle the text. So the OH user can change between interpreters independent of Willow, which I think it the proper way to do it. That’s good! I can choose to have it handled by HABApp or write my own rules to process the text in addition to using built-in (which I admit I don’t know it’s full capabilities).

Settings → Voice

It’s relatively new but OH’s Voice subsystem also now supports multi-step dialog (similar to the back and forth shown above). I’ve not played with that yet but at some point that’s going to be a pretty powerful way of interaction.

  • Are the lights in the master bedroom on?
  • Yes, three lights are on in the master bedroom
  • Turn them off

All I really know about them (beyond having watched the PRs go by) is the voice/dialog/start and voice/dialog/end REST API endpoints. There’s also the listenandanswer endpoint.

OH does support network discovery. So the onboarding theoretically could be as simple as Willow sending out a broadcast packet, listening for the response and looking for OH as one of the responses. Then you’ll get the URL(s, some have more than one OH running) to send the text to. If you use the URL you show above, the rest will be options in the standard OH settings, assuming the user wants to change the defaults. I can’t think of an easier integration (though something will have to show on the Willow to choose which OH to send the text to, I’m not sure all of them is the right answer).

Another option/integration approach could be to create an add-on which presents a normalized API to Willow. Rather than deal with different APIs for each and every home automation system out there, you can settle on one and then integration becomes writing add-ons/modules/plug-ins on the HUB to speak your API. I’m not sure if that would be less work but it might make the code you need to write on the Willow itself cleaner.

And whether you support it or not, someone somewhere will eventually write a willow2mqtt bridge. :wink: MQTT could be another option. OH supports autodiscovery of a number of “standards” including HomeAssistant and Homie among others.

I think Google makes the device determination in the cloud instead of locally which makes it sensitive to network jitter and system clocks being even slightly off.

I’m just guessing though.

Yes, for example, I just installed HABot, set it as default HLI, and it works without any changes on our side.

Yes but authentication is still an issue. For Willow applications best practice is to use a long-lived API key.

I’ve talked about it a bit but the monetization strategy for Willow is enterprise, where will have our own protocol going to a middle/application server to do whatever wild things they want to do. But that’s just yet another thing for home users, so my thinking is:

  • Home Assistant
  • OH
  • Maybe one more
  • Generic (simple HTTP POST or something to an API with fixed grammar)
  • WAS (Willow Application Server) protocol

The WAS client support and specs will be open, so if we get important or interesting enough and people want to develop WAS modules for XYZ they certainly can and that would almost certainly provide an even better user experience than the somewhat “lowest common denominator” result we have with the various APIs exposed by differing systems.

These are all really cool projects but (at the risk of being smug) Willow and the Willow Inference Server are already leagues beyond these efforts and engines. I keep running into this and the why and how behind it so I made a technical deep dive/demo video:

Oh sorry - there’s also a quick demo from a Willow user in the UK showing speech recognition quality and speed for free-form text:

Sorry one last edit: his response times are slow because community WIS server is in Chicago and he’s across the pond.

So instead of putting HA specific and OH specific stuff on the Willow side, leave it talking your own WAS protocol and create add-ons/plug-ins to connect that up to those systems. That’s what I’m proposing.

I can’t speak for HA, but installing an add-on to support a new device/technology is standard operating procedure for OH so that doesn’t really add to the installation/configuration in any significant way.

That lets you leave your Willow boxes clean and simple and means adding support for each new system as almost a completely separate independent effort that doesn’t have to touch the core Willow code and it lets you avoid that lowest common denominator problem even among HA and OH. Your WAS can include everything Willow can support. Then each home automation systems add-ons can take what they can from WAS and ignore the rest it can’t handle.

When/if you support OH pushing stuff to the Willow boxes (TTS, audio/video streams, etc.) it’s almost certainly going to require an add-on anyway. May as well start out with one.

It’s just a thought.

I’m tracking but let’s use OH as an example. I’ve never even seen it before yesterday.

But I know how to read API documentation (or even reverse engineer them) and I went from pulling the docker container to having our side built out in about an hour. When released a user can point Willow at any OH install going back who knows how long and it just works. It also actually makes support on our side easier too because APIs tend to have better long term support, versioning, etc. Add-ons, etc tend to deprecate faster and the only thing worse than not having one is having an old/broken one. We don’t have the resources to maintain a bunch of modules in random programming languages across vast ecosystems.

The platform module based approach means we have to do software development in a completely foreign ecosystem, figure out how to get it listed in whatever component manager or whatever, keep up to date documentation with screenshots, etc, etc. OH is Java and I’ve spent about an hour in my entire life programming with Java. I don’t know how long it would take me but I wouldn’t be proud of it let’s just say that!

So the plan is a handful of well abstracted Willow integration modules on device with the Tier 1 module being the WAS protocol. If community members (or even us, potentially down the road) are interested in developing an add-on or whatever for Home Assistant, OH, something I don’t know exists, etc they can do so using the WAS protocol (basically faking the real Willow Application Server) and enjoy the same Tier 1 support in the Willow ecosystem that enables absolutely everything we’re capable of.

The danger is every time there is major new version released (OH 4 comes out in June) breaking changes to the APIs are allowed. I don’t think there are any breaking changes to this part of the REST API but that might be the case in the future for OH 5. We can’t predict what will change, but even in cases where the API needs to change, breaking changes are kept to a minimum so it should be relatively safe.

On-the-other-hand, if someone wrote an add-on and submitted it to the OH project (not a requirement by any means and it doesn’t have to be you) the maintainers group of the add-ons often fix your add-on for you when breaking changes happen in core. We tend to go to great lengths to keep plug-ins working once submitted to the project to avoid their becoming deprecated or broken.

Anyway, all of your arguments are valid as well. But ultimately it’s going to mean Willow’s ability to interact with OH and OH’s ability to interact with Willow (is that even possible with what’s proposed?) is going to be significantly limited and potentially break on future versions of OH if that API does change until someone steps up to write an add-on.

If there is interest in Willow (which seems to be the case) and the WAS protocol is reasonable (I’ve no reason to think it’ won’t be) I’m sure someone will write an add-on for OH to support it (or a WAS protocol to MQTT bridge which seems to be the approach everyone is taking these days anyway which makes sense, why implement it X times for X hubs when you can implement MQTT once and all the hubs support MQTT?).

My concern is that a REST API only interface where Willow only pushes text to OH on that one API endpoint is pretty limiting even compared to what’s possible with Google Assistant, Alexa, HomeKit, and Mycroft.AI. If that’s the only integration possible or planned, it’s not very compelling. But as a first something to get out the door, it sounds great!

I look forward to seeing what this WAS protocol becomes.

If there is enough interest, traction, etc with Willow and a Willow module ends up in tree with support from the core devs that would be ideal!!! We would LOVE that.

I just don’t want to presume that will be the case and without getting into it too much often times there can be mis-alignment with that in situations where some platforms may consider us as “competing” with whatever their native/alternative approaches may be. I don’t see that being the case with OH but it’s bitten me before.

Thanks for the heads up on OH 5, I guess we’ll just have to see. We already have some compatibility API code for HA (I don’t like it, believe me) because they deprecated one of their voice APIs in a matter of months.

This is actually related to Websockets - we have to detect feature/platform support for HA and switch between REST and Websockets depending on whether support is detected. I would strongly prefer Websockets with eventing to OH for the kind of functionality you describe but I’m a little concerned about the fairly unique DIY “ping/pong within 10 seconds or we drop you” keep-alive requirement of OH. Might be a non-issue but it does give me pause considering Willow runs on a microcontroller, the #1 priority is audio, and we have to count every byte and CPU cycle to ensure the best end-user experience while doing our best to support all of the functionality on the roadmap and what else is to come.

I have to say I’m a little perplexed by what I see some people/projects/etc doing with MQTT. I like MQTT for certain applications, and I don’t pretend to be the end-all be-all of protocol design, but it just feels “wrong” to me.

One of the reasons I started Willow was from a sense that this corner of the open source ecosystem does not fundamentally understand the nuances of voice. My former life (thankfully) was 15 years in VoIP, WebRTC, etc so I like to think I know voice and audio (and video, ugh) pretty well. I’ve observed an ever-growing graveyard of failed assistants, speech projects, etc that (to me) seem to all stem from this fundamental lack of understanding. My “favorite” being all of the teams who seem to think you can slap a random microphone on a Raspberry Pi and get clean speech from across the room. It hasn’t worked for the last 10 years because it never will but you see it time and time again.

We are very open to alternative competing products. There are a few rules to avoid confusion (e.g. we don’t allow multiple versions of a binding that does the same thing/supports the same technology) on our marketplace but we do support alternate unofficial marketplaces.

In general we want people to find the best solutions for them. That’s one reason I think we are not talked about more in the wider Internet like HA is. We aren’t partisan and we aren’t evangelists. We are home automation enthusiasts.

It’s one of the things I love about the OH community.

I’d say that’s the number one reason people who switch from HA to OH give for the switch. OH tends to be very deliberate in it’s progress and puts a lot of value in backwards compatibility.

And I didn’t mean to be alarmist about OH 5. OH 4 isn’t even out yet and it’s usually at least two years between major versions. I would hope someone has written a WAS add-on by then, or some other and more complete approach is found.

If the changes are not breaking and they fit with the overall architecture, the OH core developers are amenable to change things like that to support external third party connections better. I can’t promise anything but a lot of the REST API endpoints exist only so support third party integrations. If you find it’s a problem the way it does it now and have an approach that might work better it would be worth opening an issue.

It has some features that make it good at bridging between technologies. It’s an open standard, has libraries for most popular programming languages, relatively easy to implement/use, light weight enough to put on a microcontroller, and most services in the FOSS home automation realm support it out-of-the-box. It may not be the best but it has a low barrier to entry which makes it good enough most of the time.

I’m not saying that a WAS to MQTT bridge is the best idea or even a good idea. But I can almost guarantee you that someone somewhere is going to write one and put it on GitHub.

1 Like

Understandable and not an issue at all.

On that timeline, I’m sure of it!

I really appreciate this! I’m not saying it is or will be necessarily, it just jumped out at me. Not unheard of in terms of WebSockets but compared to WS support we have for HA that doesn’t have that requirement it would need a lot of testing and validation on our side.

I don’t disagree on any of these points! I’m just traumatized from VoIP. After A LOT of pain the entire industry and ecosystem finally realized that realtime, latency sensitive data isn’t just “stuff some data in a packet and transport it however” as it was originally conceptualized. We KNOW that doesn’t end well. Audio and video is a unique sensory experience for humans compared to a web page loading or a notification showing up. It’s a completely different animal and when I look at MQTT and more specifically the architectures that people assemble with it I get flashbacks to the early and very, very rough days of VoIP.

Hah, I guarantee it. Caveat emptor.

MQTT is definitely not suitable for real-time streaming data of any sort. But it doesn’t even pretend to address that. But it excels at delivering small discrete messages, with or without guarantees of delivery, persistence, etc. And it’s amazing how much of the home automation world in general can be broken up into small discrete messages, including the text that a Willow captured from spoken audio or a state change in an OH Item.

However, I’ve seen a lot of bad MQTT implementations out there which violate the underlying principals of MQTT, especially “don’t impose work on your clients.” I can’t believe how many times I’ve seen some device publish some huge JSON requiring the clients to parse out the data they want.

Yes - I would have expected Willow to send the recognized text over MQTT, not the raw voice data.

Will it be possible to display a custom text on the display as a notification or an answer to a question?

Sounds perfect. Does is support the Dutch language?

We include support for MQTT in the Willow binary but don’t currently utilize it for anything. We included it because interacting with platforms via MQTT with status text (which MQTT is good for), etc is something on our roadmap.

Wake is currently English-only, not that it matters much because a proper noun is the same in any language AFAIK. I have seen videos from various users around the world and the wake engine does quite well even with accents. It’s really just the sound your voice makes after all :). The Willow Inference Server using our highly optimized Whisper implementation supports something like 30 languages, of which Dutch is one. You can find the available languages and expected accuracy across them here. Dutch is a “top 10” supported language so I would expect the accuracy to be very good in practice.

We’re working through some font issues on the display for Unicode characters to have full character set support across all of these languages but other than that the platform is ready.

This is so nice !
I’ve had the hardware sitting on my desk for over a year and apart from the initial “wow” factor, I was never able to get it to do more than just turn on the red LED.
I’ll make sure to experiment with this in the coming weeks as I already have an openHAB instance running on an Arch Linux based Linux distribution (Manjaro).
I read the instructions, and yes they are quite convoluted but nothing that I have not already done in the past.
Once again, thank you for putting such an effort on this.

And in Europe, all ESP Boxes seem to be sold out currently… :frowning:

But I wanted to ask about the server-side hardware. @kristiankielhofner, could you give a rough indication of what kind of hardware is needed to run the willow server locally?

My home server runs on a Celeron G3930 @ 2.90GHz, about half of which is already used by a number of docker containers (such as OH3). It has plenty of RAM, though (24 GB). Am I right in guessing that willow will not run on it (in its default settings, which, I believe, includes whisper large-v2, right? Could you indicate how many orders of magnitude my existing capacity is off?

1 Like

Hi @kristiankielhofner welcome. I’m am just catching up with your thread (have not read the whole thing yet), but wanted to drop in and say HI :wave:

I’m super excited you to hear from you. I actually bought a ESP BOX right after i saw your project on Hacker News, its sitting on my desk right now. Historically i have been pretty involved with early versions of our Alexa skill (having wrote version 1.0 ) and have a real motivation to see local voice.

I’m have also been running whisper (along with all the llama variations for ai) locally and was tinkering with a solution for openHAB, so your project has come along at the perfect time :ok_hand:

Would be happy to help any way i can !

Also @Miguel_M.A.D might be interested in this as well, he as been working quite a bit on the different parts of our voice pipeline.

4 Likes

Great! Initial OH support has landed in Willow and I’ve updated the docs with instructions. It’s pretty simple - create an OH API key, put in your OH server address and API key, build and flash.

We’ve put quite a bit of work into WIS to support a wide range of hardware configurations. Through approaches such as model quantization, etc we currently require about 6GB of RAM/VRAM for full STT and TTS functionality - even while loading Whisper base, medium, and large-v2 simultaneously so they can be utilized dynamically depending on user configuration on a per-request basis.

One caveat - while we support CPU only configurations CPUs are ill-suited to the task of speech recognition at the accuracy and speed requirements of our intended goal - a self-hosted, private Alexa-like experience without compromise. It will “work” on your hardware but you will likely find the speed and/or accuracy frustrating.

I get into this elsewhere but a six year old used $100 Nvidia GPU offers performance and accuracy that bests the fastest CPUs on the market at significantly lower cost and power utilization. GPUs are just so different architecturally that they excel at these types of tasks.

However, we provide two alternatives:

  1. A Tovera-hosted best-effort GPU-powered WIS instance for community use. While this goes against our goal of Willow being completely self-hosted it’s a great way for users to experiment and witness the accuracy and speed of GPU-hosted WIS themselves. I’ve had multiple users who were initially reluctant to pursue a self-hosted GPU approach come back and tell me they bought a GPU :slight_smile:.

  2. On device command recognition. The ESP BOX itself has built in speech command recognition for up to 400 commands that we support. However, we don’t currently dynamically configure this with OH like we do with Home Assistant. I’m VERY new to openHAB and this is certainly something we would be happy to work with the OH community on implementing.

Hi @digitaldan! Also fantastic news, as I told @obones you can follow the instructions in the Willow README to use Willow with OH today. We are super interested in hearing feedback from the OH community as we are new to OH and have only tested against a single OH instance. It would be great to get your feedback generally but also validate Willow against multiple disparate “real world” OH installs.

EDIT - Sorry, left out @Miguel_M.A.D. Stop by! Very curious to hear your thoughts as well.

1 Like

Hello friend! Exciting times, seeing everyone working towards voice interaction makes my heart pump faster!
I look forward to see this project move forward and to see how well it integrates with OH :slight_smile: