Willow - Open Source Echo/Google Home Quality Speech Hardware for $50

I very much appreciate your taking the time to answer these questions. And it’s great to know that you are providing a WIS instance for the community to use. (I had to look up what best-effort means in this context, so if anyone else is wondering, here is what GPT-4 told me:

“Best-effort” is a term used in computer networking to describe a service that provides the best possible effort to deliver data packets but does not guarantee delivery.”)

I have no problem using that server to get started (once my ESP box arrives in July :roll_eyes:). I suppose information about how to access it will reveal itself once I start digging into the documentation…

As others have noted, the documentation is currently not very accessible (and I completely understand that making it more accessible is not a priority at this point. It may even be a good gatekeeper, intentionally or not). For example, I am confused by there being two repositories, one for “Willow” and one for “Willow Inference Server”…

Anyway, regarding the self-hosted solution with a GPU: never mind the cost to buy a GPU, but when the thing is running 24/7, aren’t we incurring significant electricity costs? According to this page, the GTX 1070 idles around 7-10 Watts. Is that correct for our use case here?

When I built my home server it used around 16 W (yes, the whole machine). Since then, I have added some HDDs so consumption is surely higher now, but adding 10 W to whatever it is at now seems considerable…

One more question: I read that in order to optimize for speed, you are not using auto-detection of the language. I’d be curious, though, how much delay is added when autodetecting the language. I’m not sure if it is possible with whisper, but for our purposes it would suffice to limit the range of possible languages to three, which should reduce the delay.

Background: in our household, three languages are spoken and while we have gotten used to Alexa and Google only understanding one of them, the annoying limitation is that they don’t understand song titles or shopping list items in the other languages and while we have invented English names for some frequently needed grocery items, there is no workaround when it comes to song titles…

1 Like

I suppose that in US centric platforms this isn’t a problem. But many EU households are multi language.
In mine we speak English daily and at work, but whenever my parents are involved then it’s in Portuguese. Additionally my wife speaks French and likes to hear French music frequently. Google SUUuUcks in understanding places streets or really anything asked in a different language mixed with English .
For example, asking “hey google is Carramona open right now?”
Carramona being a local restaurant, it basically never gets it right. But if it’s McDonald’s or something international then it works fine.
That is a conundrum I’d very much like to see fixed.

That is generally accurate.

Fast and accurate speech recognition across 30 languages has trade-offs. GPU is required for anything resembling usable accuracy and responsiveness. At a high level the choices are:

  1. Use Alexa or commercial equivalent that offloads speech processing (and everything else) to the cloud for you.

  2. Local speech recognition.

2a) Use full language support - CPU. A $100 GPU using 5-10 watts of idle power is (in my last testing) using GTX 1060 (so actually a bit slower than GTX 1070) roughly 88x faster than a CPU of the class of machine you’re likely referring to. Yes, nearly two orders of magnitude. For the model WIS uses by default for what we consider to be the quality of Alexa a CPU in the class you’re referring to will take roughly 51 seconds to transcribe 3.8 seconds of audio - without language detection, which on CPU will almost double this. So you would quite literally be looking at response times measured in minutes for voice commands. It would be faster to walk around your house, look for your phone, unlock it, open an app, and use a GUI.

2b) Use a stripped down model with poor accuracy - CPU. This is the route many other projects take, we can certainly do this, but I’m reluctant to suggest it officially. With language detection on that class of machine you will still be looking at somewhere around three or more seconds for action with language detection. In the event the speech recognition is wrong (which is often the case with this model), you will repeat yourself, hope it gets it right, and at least double this time - not to mention cause a lot of frustration.

2c) Use commands on the device. The ESP BOX supports local detection for up to 400 specific commands of up to 60 characters each you can define. Response time is typically around 500ms and the audio doesn’t leave the device. Caveat is this currently only works for English.

2d) Use a $100 GPU that uses 10 watts of power and have Alexa grade locally self hosted speech recognition with dynamic language detection (walk up, speak your language, it figures it out, transcribes) that can essentially perfectly transcribe 3.8 seconds of speech in any of 30 languages in under one second. Longer speech segments are fully supported and the realtime multiple (speech length vs transcribe time) actually increases dramatically the longer the speech segment is. I use 3.8 seconds because it’s the lowest we test with and, when compared to realtime, the “slowest”.

I understand your concerns regarding energy usage but this is quite literally a physical limitation, practically physics. Some of the best minds in the world have worked on optimizing highly parallel tasks like speech recognition and other ML applications for CPU but there are such dramatic physical, architectural differences between CPU and GPU that CPUs just aren’t designed for these tasks. GPUs are, that’s why they exist for GPGPU and ML/“AI” tasks.

What I’m finding trips up people the most with GPU power consumption is hearing idle and then assuming 150 watts or so after that. This is accurate and it isn’t. First, ML inference rarely causes max TDP on GPUs the way gaming does. We typically see somewhere around 50-60% max TDP for speech commands, often much lower. Second, let’s look at these speech recognition spikes for voice commands:

A GTX 10X0 can do Alexa-grade speech recognition with language detection on 3.8 seconds of speech in (worst case) roughly 1 second. For 100 speech commands per day in any of > 30 languages that is 100 seconds a day at (roughly) 75 watts or so. Electricity is utilized and billed not as watts (kW) but as watts over time (kWh). So this comes out to 1.3 kWh/mo. With idle at 10 watts 24/7 added this is a total of roughly 10 kWh/mo. For electricity costs in Germany, some of the highest in the world, using $.40 / kwH this comes to a total worst case added electricity cost of $4/mo - assuming I did the math correctly here and that’s accurate (looks right to me). In practice it’s probably closer to half of that.

All in, I think $4/mo (all numbers rounded up, absolute worst case) is a more than worthwhile trade-off to have a locally hosted, completely under your control, private, Alexa-grade voice interface in your language(s) to your favorite platform (OH). But, of course, that decision is up to you and can pick between which of the options above best suits you.

In terms of language detection - all you do is add a parameter (detect_language) when configuring the WIS URL in Willow and the server will automatically detect the language and return the text (full Unicode) with an additional JSON parameter of “language” with the ISO language code. This works with Home Assistant. However, in looking through the OH API documentation I didn’t see a way using the API to provide the speech language dynamically. Currently, with OH, we pass the text in whatever language is detected (without language code specifier) and hope OH can figure it out. As far as response times, as you can see from the screed above total time from end of speech to command action with language detection is still under our target of 1 second with GTX 10X0 hardware.

Please correct me if I’m wrong on this last point, we would love to work with the OH community on this!

See my response above. With language detection enabled you can walk up to Willow, use the wake word, speak any of 30+ languages, and it will return results and execute your command in less than one second (with GPU).

We use proper nouns in other languages mixed in with English sentences all of the time too and have the same issues with Google, etc you describe here :slight_smile:. I’ve found Whisper medium (our default model for Willow) to be extremely accurate in this case.

3 Likes

A 1060 or 1070 can double as a plex transcoding solution, or empower an nvr as well. Not a bad trade off and opens other avenues of discovery.
Oh!

Fantastic

This has been another aspect of my position on this:

“Plus, now you have a GPU that’s sitting idle the vast majority of the time that you can also leverage for other applications”.

Because we are so memory efficient on models, you have plenty of VRAM left over for these types of use cases to run concurrently with Willow + WIS.

Have you done any tests with intel quicksync and ram memory as vram?
That would be the golden standard :wink:

I have to admit I don’t really know what you’re asking here. I’m familiar with quicksync, RAM, and VRAM, but I’m not understanding how they are related together in this sentence. Can you elaborate?

Maybe I’m using the wrong keywords. My bad.
So you know that intel igpus have preallocated I don’t know how many megabytes of memory to it, right?
You can, iirc at the bios level, assign up to half of the system memory to it, to act as, video memory.

So assuming that everyone and their mother already have intel igpus in their cpus, if the inference server (as I understood) uses about 6 gigs of vram on your gpu test… we could hypothetically add extra system memory to allocate to the igpu to account for the extra requirements.

So basically what I’m asking is, have you tested with an intel igpu? If so, how did it behave? If it lacked memory (my assumption) have you tried adding system memory to it?
If this works properly an intel nuc would immediately be able to act as a server for willow!
And with tons of memory too, for “cheap”. Since it’s easier to buy ddr4 memory than a new gpu.

Edit:
Can I increase the Dedicated Video Memory for Intel Graphics?.

Pending motherboard vendor’s BIOS decisions apparently…

Wow, thanks again for an incredibly elaborate and helpful answer. It is also very satisfying to see someone who really thinks things through, understand problems and figures them out. :star_struck:

So, just to understand how this works: does this mean that language is detected on a word-to-word basis, i.e. every word gets its language-tag?

Speaking of Dutch, unless this was you it’s quite the coincidence!

Yep!

This is a nice idea but it’s a non-starter for ML frameworks. iGPU (to my knowledge) has no support for for ML frameworks or even the concept of GPGPU. There is some early work being done for Intel ARC discrete GPUs but I stress early. It’s EARLY.

I’m aware that Nvidia/CUDA/etc can be a pain and it’s even controversial in some circles. However, Nvidia decided to throw full support into GPGPU with CUDA 15 years ago and that investment shows - they have 80%-90% market share with GPGPU/ML/AI for good reason…

Even AMD with ROCm is a decade behind. I attempt very basic PyTorch, etc projects with ROCm about once a year or so and frankly, it’s comically bad. I hope Intel can do a better job with ARC but they have the same issue of being 15 years behind Nvidia…

Language detection runs in a separate pass through the model. The model (like any model) returns probability of “language” based on the entire context of the audio passed (more or less) so you can give it fairly complex mixes of “language” and it will generally output the correct language to provide to the transcription step of the model. Interestingly, Whisper can also do translation as well. I only speak English (typical American, haha) but from what people tell me and from what I’ve seen it does quite well.

1 Like

OK, I note the singular. I may have gotten over ambitious. So that means not even whisper can handle „Play Non, je ne regrette rien by Edith Piaf in the Bathroom"?

I’m not sure if this is a stupid question but are there plans for satellite devices that don’t run on the ESP BOX but instead run on, say, Android? The use-case I have in mind is when away from home, I’d like to be able to execute voice commands using the same commands and syntax and getting the same responses as I would if at home. So I could switch on the lights or open the gates as I approach or control the heating while away without having two separate systems. I envisage a lightweight Android client with wake word detection but the language processing could be done on the Inference Server back at home.

Or am I getting ahead of myself?

Yes, Whisper is singular as it runs two passes:

  1. Detect language (singular)

  2. Transcribe with provided language from step 1

Meta has recently (this week) released a new VERY large multilingual model. We’ll be investigating it’s use with Willow.

WIS already has the support for this. There is a real time streaming WebRTC endpoint that can work from any browser/device/application that supports WebRTC. The WIS side is very stable but our current TypeScript client library has some browser compatibility issues (of course) because WebRTC is still a little fickle. It works very well with Chrome. You can try it but do note - it currently uses the highest possible speech recognition settings so it is “slow” compared to the default for Willow (medium, 1). We will make this easier to configure/select in the client library.

1 Like

Just a follow up to say that I had it up and running in less than an hour, very impressive work!
Now the hardest part for me will be to teach openHAB to handle the received text commands. No small feat, but I’m really impressed by the work you did on willow, and especially grateful that you made it opensource!

2 Likes

Just a word of caution:
Whisper ‘hears’ things that just aren’t there. OpenAI calls it ‘halluzination’, but IMHO it is just a bug inherent to the algorithms used. You cannot rely on the transcripts.

Example:
https://mediandr-a.akamaihd.net/progressive/2023/0314/TV-20230314-1537-1100.hd.mp4 (might be geo-blocked).

Extract audio:
ffmpeg -i TV-20230314-1537-1100.hd.mp4 -vn -map 0:a -f flac test.flac

Whisper it to whisper:
whisper --language German test.flac --model medium

Synoptic results for tiny (T), base (B), small (S), medium (M), large-v2 (2), -/o/+: my personal judgement:

[...]
-T[02:47.800 --> 02:51.280]  Ihr Ehemann John sucht er weiter nach dem Sohn.
-B[02:47.000 --> 02:51.000]  Ihr Ehemann John sucht der Wahl weiter nach dem Sohn.
+S[02:47.000 --> 02:51.000]  Ihr Ehemann John sucht derweil weiter nach dem Sohn.
+M[02:47.720 --> 02:51.160]  Ihr Ehemann John sucht derweil weiter nach dem Sohn.
-2[02:47.600 --> 02:50.600]  Ihr Ehemann John sucht weiter nach dem Sohn.

?T[02:51.280 --> 02:52.800]  Und es ist sehr gut.          <- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
xB ---
xS ---
xM ---
x2 ---

-T[02:52.800 --> 02:54.800]  Bitte, bitte mit mehr.
xB ---
-S[02:51.000 --> 02:54.000]  Bitte, redet mit mir.
+M[02:51.240 --> 02:52.880]  Bitte, rede mit mir.
+2[02:52.680 --> 02:54.000]  Bitte rede mit mir.

-T[02:54.800 --> 02:56.800]  Haupt ab, du näherst.
xB ---
-S[02:54.000 --> 02:57.000]  Hau ab, du nirst.
+M[02:54.880 --> 02:56.280]  Hau ab, du nervst.
-2[02:54.840 --> 02:56.080]  Hau ab, du närst.

[...]

-T[03:42.840 --> 03:46.640]  Sie sehen Sie am 16. April, wie immer im ersten.
oB[03:42.000 --> 03:46.000]  Sehen Sie am 16. April, wie immer, im ersten.
oS[03:43.000 --> 03:46.000]  sehen sie am 16. April, wie immer, im 1.
+M[03:43.840 --> 03:46.840]  sehen Sie am 16. April wie immer im Ersten.
+2[03:39.920 --> 03:42.880]  sehen Sie am 16. April wie immer im Ersten.

T ---
B ---
S ---
M[03:48.040 --> 03:50.080]  Copyright WDR 2021     <- !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2 ---

The last one is well known to the Whisper community, see Dataset bias ("❤️ Translated by Amara.org Community") · openai/whisper · Discussion #928 · GitHub.

I’m not sure about the relevance in the context of home automation though, where short commands prevail …

My full analysis (sorry for using a German example):
whisper_test_upload.txt (28.2 KB)

The anecdotes and in some (rare) cases data on hallucination with Whisper are a little overblown in my opinion.

I’ve easily done tens of thousands of hours of speech across languages through Whisper ranging from speech commands (a few seconds) to hours of transcripts. The only “troubling” hallucination I’ve seen is somewhat humorously always detecting no speech as “you”. We use voice activity detection with Willow on the audio front end so even this case is almost completely eliminated. Almost all of the cases I’ve see with hallucination come from poor quality audio and the AFE also provides extremely clean speech even at far distances in noisy environments.

Can you explain where the “caution” and somewhat alarming tone of your post comes in? I’m not understanding how this (very rare and unlikely) case is in anyway dangerous or of concern but I’d like to understand better.

In any case at this point many, many users around the world are successfully using Willow across a wide variety of languages (at least English, Spanish, Portuguese, Dutch, French, Korean, and German just from what I’ve heard/seen directly) and we have yet to hear a peep on this.

I’m also really excited because I happen to have done a significant amount of work on the Willow Inference Server this weekend that almost complete revamps multi-language support as well as the fundamental serving architecture that provides many improvements.

1 Like

Well, I’ve done only two tests (German only) - and I’m not quite convinced …

My second test:

Input:
https://download.deutschlandfunk.de/file/dradio/2023/05/20/lange_nacht_gustav_mahler_komponist_und_dirigent_drk_20230520_0005_db5a60ba.mp3 (might be geo-blocked - quite complex text, partly in ‘ancient’ German language, mixed and partially dubbed with classical music; professional production, so sound quality is excellent).

Output from Whisper (–model tiny, --language German):

[00:00.000 --> 00:28.000]  Deutschlandfunk Kultur, lange Nacht.
[00:29.000 --> 00:32.000]  In der Schule war Gustaf Vari, zerstreut.
[00:32.000 --> 00:36.000]  Einer seiner Lehrer am Gymnasium nannte ihn das leiphaftige Quaxilber,
[00:36.000 --> 00:39.000]  oft in seine eigenen Gedanken vertieft,
[00:39.000 --> 00:42.000]  wie versonnen und der Wirklichkeit entrückt,
[00:42.000 --> 00:45.000]  was ihm den Tatel seiner Lehrer zu zog.
[00:45.000 --> 00:49.000]  Wenn gleich er ist oft, wie ein kleiner Türen verstand,
[00:49.000 --> 00:53.000]  seinen Willen den Geschwistern und den Spielkammeraden aufzuzwing,
[00:53.000 --> 00:57.000]  war der Grundzugs eines Wesens ein hoher Gerächtigkeitsinn,
[00:57.000 --> 01:00.000]  der ihn weder unrecht tun, noch leiden lässt,
[01:00.000 --> 01:06.000]  andererseits aber wieder Dulltsamkeit, Menschenliebe und Mitleid mit Armut und Elend.

Absolutely stunning automatic separation of voice from music - I’m really impressed.

[02:50.000 --> 02:53.000]  Am 7. Juli 1860 kommt Gustaf Mala an.
[03:15.000 --> 03:18.000]  Am 7. Juli 1860 kommt Gustaf Mala.
[03:18.000 --> 03:24.000]  Kommt Gustaf Mala als Kind Jüdischer Eltern in dem Dorf Kalist in Bömen zur Welt.

Wrong reduplication.

[04:15.000 --> 04:29.000]  Gustaf Mala singt zuerst, was er hört.
[04:29.000 --> 04:32.000]  Dann spielte es auf seine Ziharmonikanach,
[04:32.000 --> 04:34.000]  und er hat einen Kugel.
[04:34.000 --> 04:38.000]  Es ist ein Kugel.
[04:38.000 --> 04:41.000]  Gustaf Mala singt zuerst, was er hört.
[04:41.000 --> 04:44.000]  Dann spielte es auf seine Ziharmonikanach,
[04:44.000 --> 04:47.000]  einige Zeit später auf einem Klavier.

Wrong reduplication and halluzination about a ‘Kugel’ (ball/sphere).

[06:34.000 --> 06:38.000]  Ich erlaube mir nun noch einmal um Schulgeldbefreiung anzusuchen,
[06:38.000 --> 06:42.000]  weil ich aus den oben angeführten Gründen nicht im Stande bin,
[06:42.000 --> 06:46.000]  das Schulgeld zu entrichten und also zu meinem größten Leidwesen
[06:46.000 --> 06:50.000]  gezwungen wäre, auf den mir so werten Unterricht im Konservatorium
[06:50.000 --> 06:52.000]  verzichtleistend zu müssen.
[07:58.000 --> 08:24.000]  Musik

Impressive transcription of ‘ancient’ German and automatic identification of a passage consisting of music only (but Whisper should mark ‘Musik’ as a comment …).

[13:36.000 --> 13:40.000]  So komm, tut reime Licht entrat.
[13:40.000 --> 13:44.000]  Was Milch du heute mit dem Mann?
[13:44.000 --> 13:46.000]  Du freundelst es gut.
[13:46.000 --> 13:52.000]  Du schämschet, weil das Milch zu den Lied fahre.

Whisper fails at transcribing a cantata - to be honest, I cannot understand the words sung either. Here Whisper shouldn’t provide a transcription at all.

Several more reduplications, e.g.:

[31:49.000 --> 31:57.000]  Ich bin Mittwoch um Donnerstag interesst.
[31:57.000 --> 32:03.000]  Von daraus gehe ich nach Wenedig.
[32:03.000 --> 32:10.000]  Die Reise mache ich mit Justi umsonst.
[32:10.000 --> 32:14.000]  Ich bin Mittwoch um Donnerstag interesst.
[32:14.000 --> 32:17.000]  Von daraus gehe ich nach Wenedig.
[32:17.000 --> 32:19.000]  Die Reise mache ich mit Justi umsonst.

Why not intersperse some Korean (despite explicitly being told to use German)?

[01:03:13.000 --> 01:03:18.000]  Hallo Herr hart,te Mus 사랑 abgibt …
[01:03:18.000 --> 01:03:29.000]  Tehrann habe ich, Tehrann habe gibt …
[01:03:29.000 --> 01:03:38.000]  Moan mit Wumpieces vermer página,
[01:03:39.000 --> 01:03:53.000]  Mich darum, die sich mitatcher ...

Well, doesn’t quite look like Mahler’s ‘Lied um Mitternacht’ … :slight_smile:

Factoring in halluzinations, Whisper’s tendency to produce text by hook or crook (even if the audio is unintelligible) and long range listening, I am curious about the results when using the ESP32-S3-BOX and Willow.

I will definitely test Willow (with focus on German) and then report back here. Unfortunately, the availability of the ESP32-S3-BOX[-LITE] is anything but good …

I should have gone back to the basics a bit on this. As noted in the README:

  • Great far-field performance. We’ve tested wake and speech recognition from roughly 25 feet away in challenging environments with good results.
  • Great audio quality - Willow provides features such as automatic gain control, acoustic echo cancellation, noise suppression, blind source separation, etc.

The ESP BOX is an acoustically engineered enclosure. It has dual microphones and everything for clean, high quality far-field speech down to the microphone cavities has been accounted for. Then ESP-SR goes to work. The AFE (audio front end) provides AGC, AEC, NS, and BSS. The wake, speech, and audio engines all get the three channel dual microphone input with a reference (calibration) channel.

The ESP SR library from Espressif, along with their audio and wake components, has been tested and qualified by Amazon themselves as an Alexa platform device. When we say Alexa quality or better we mean that literally.

We get good results from Whisper because we work very hard and very deliberately to get clean audio to Whisper. Many of the other open source solutions in this space feed trash audio to Whisper and get trash results.

We’re not repeating that mistake.

1 Like

Hi Kristian,

Thank you for your work on Willow, really cool, and even more for your consideration of openHAB! When I saw your first post I ordered an ESP32-S3 box which arrived a few days ago. Building and flashing Willow was easy as well as hooking it up to openHAB rule voice interpreter.

The voice recognition quality is very good, wake word sometimes takes a few tries. Would be really nice if you could get “Hey Willow” or even better “Hey openHAB” :wink: as wake word!

I also tried setting up Willow Inference Server on my Jetson Xavier NX but that fails on some dependencies (nvidia-cublas-cu11). No time to dive into this right now, all required packages exist for the Jetpack version. I read on your Github project page that you don’t recommend nVidia Jetson hardware but I think the Xavier NX is a nice option because of the low energy consumption and it should have enough processing power and memory, I use it in my rack to process live security video footage.

In our house voice control is the most important means of controlling things and currently I am using Google devices to take care of this. I am quite eager to replace them with a cloudless solution and Willow looks very promising (after the Mycroft debacle). I don’t have a lot of time or developer skills to help but let me know if there is anything I can do to support.

Next challenge is on the openHAB side, I need to figure out how to turn text sent to the “Voice Command Item” into meaningful actions, but that is outside of Willow’s scope. Wondering if I need to write rules to interpret commands or if I can pipe it into Habot or something, that’s a question to openHAB wizzards :slight_smile:.

For now thank you again for your work and for being very correctly responsive here and at the Github project discussions.

1 Like

My understanding, and I hope to actually start experimenting myself next week, is that the Willow integration is using openHAB’s /voice/interpreters REST API end point using the default OH NLP interpreter.

The good news is you can choose the default independently from Willow at Settings → Voice → Default Human language interpreter. Changing the default there will change where the text from Willow gets processed on the OH side.

I believe we have three options (if HABot is installed). See Multimedia | openHAB for details.

To use HABot you’ll need to have your semantic model will configured and perhaps some synonyms configured for certain items.

I don’t know much about the built in HLP beyond what’s said at the above link.

If you choose Rules-based the parsing of the text is up to you.

1 Like