openHAB crashing with out of memory exception (Sonos, 1.7.1)

steve1 · December 4, 2015, 7:00pm

If that’s the case, it might not be worth the effort to fix the OH1 binding since it will be obsolete relatively soon.

Marty56 · December 5, 2015, 6:29pm

You are an optimist. with respect of the quality of the Sonos in OH2.
I will start to believe when I see it working

DieterL · December 7, 2015, 7:17am

Hi

I personally start thinking that the root-cause for the problem is probably not the binding but something else. I came to this idea because I didn’t have problems with the binding myself so far, even if the system runs for weeks (hopefully it will stay this way…).
Even when I say that my OH runs on a HP Proliant Server contains 8GB RAM, I don’t think that it could handle 5000 threads. As OS I use the Ubuntu 14.04LTS server distribution but I don’t use OpenJDK but the Oracle Java (guess V8, but not sure) as there were other problems with the OpenJDK that comes with the distribution.
Also the problems seems to be “event triggered” when looking at the diagrams @steve1 provided. So probably something happens in your systems (network, runtime, (other) binding) that causes the sonos binding to crash but did not occur on my system.
What do you think about this?
Regards
Dieter

steve1 · December 7, 2015, 11:34am

I ran the binding for a couple of months before the problem started occurring regularly. There were several changes like adding additional Sonos devices and changing one of the wifi access points. After that, openHAB would crash about every 24-36 hours with an out of memory error. This is the same behavior that’s been reported by others.

I’m running on a Raspberry PI 2 with 1 GB RAM and it was able create 5000+ threads before crashing. Even then it didn’t always crash. Sometimes it survived although the memory consumption was higher than before the incident. I’d expect 8 GB RAM and the extra virtual memory availability would increase the survivability of the server.

I’m also using Oracle Java 8.

I’m 100% sure it’s the Sonos binding based on memory analysis (thread dumps) of the crashed process. It seems likely to be that it’s an interaction between the binding and network events/status related to the Sonos devices. In my opinion, that’s a binding problem. The current UPNP library used by the binding allows an unbounded task queue size. That’s dangerous but the root case is the binding generating too many tasks. Even with the new UPNP implementation that uses a bounded queue, the excessive task generation was indirectly consuming large amounts of disk space due to logged warnings about rejected tasks.

My wife and I run Sonos controllers on our Macs and iOS devices and don’t have this issue with that software in the same network context as the openHAB crashes. I happy the binding is working well for you and I hope your good fortune continues.

DieterL · December 7, 2015, 12:20pm

I don’t really doubt that the problem has something to do with the binding (or the upnp library) but in my oppinion there must be something else like you’ve said network issues that I don’t have yet. There might be one thing as I currently only have 1 sonos device. Maybe I have to buy some more to get the problem as well… (another good reason to spend the money ;-))

guessed · December 7, 2015, 3:52pm

It’s interesting to see what the Sonos Binding is reaching out to on your Network during it’s UPnP discovery phase.

I was installing something on Saturday, and it required me to run a special tool on my Mac. In the middle of this, I noticed a bunch of errors in my openHAB logs… all pointing to my Mac. It looked like the UPnP Discovery process used by the Sonos Binding was reaching out to a few more devices than I’d planned.

These problems were only visible because I’m running this in my logback.xml (to debug something else):

<logger name="org.apache.http" level="TRACE" />

These calls were occurring every minute, so perhaps there’s something on the Network that’s triggering a bug in the Sonos code.

I haven’t looked, but there may also be a way to disable the UPnP discovery, and instead rely upon the manually configured list.

Marty56 · December 8, 2015, 8:16pm

Confirmed. Also the new binding is not working.
Have to switch back to my php workaround.
a bit disappointing that the openhab poster child binding is not working.

steve1 · December 8, 2015, 8:21pm

The binding does discover all UPnP devices on the LAN. For example it discovers my Synology NAS, Fire TV, and so on. However, from what I see in the code it should be ignoring those devices once they are discovered. It’s not clear to me at this point if the discovery aspect of UPnP is the cause of the problems or just some periodic task in the binding that transitions to a bad state and starts generating numerous tasks.

guessed · December 9, 2015, 8:19am

@steve1,
My logic here was that, given the bogus calls it was making to my Mac, it’s possible it’s making similar calls to something else that causes it to mis-behave in some manner.

In my case, it seems benign, it’s making a call to myMac/upnp, which wasn’t part of the SSDP advertisement. If you’re seeing leakages over time, then it might pay to enable the extra logging, and see if there’s a correlation between the machine becoming overwhelmed, and what’s being called at/near that point in time.

Anyhow, the extra logs would give you the data to say either way, since it goes on silently otherwise.

steve1 · December 9, 2015, 11:33am

I have a memory dump of the JVM process when the issue is occurring. I used it to verify that the Sonos binding was the culprit. I also had a (very large) log file for the jupnp-based binding that showed the command that was overflowing the task queue but I deleted it since it was about to fill up my server disk. I should have noted the command but at that point I was convinced I wasn’t going to be running this binding any more or investing more time trying to fix it given it will be soon replaced by the OH2 version. I’ve also seen some indications that it’s not always the same command.

bernd_boehm · January 6, 2016, 5:16am

In the meantime an upgrade of the Java VM to the most recent 1.8 fixed this issue (for now)

steve1 · January 6, 2016, 11:09am

Can you give more details? I don’t know how a JVM upgrade can fix logic problems in the code.

DieterL · January 6, 2016, 11:38am

Even if I can’t really think off it as a solution, it might be possible…

As long as we don’t know the root cause (why the threads were created, or have i missed something) it could also be a problem within the network stack of the jvm.

I still did not get the problem with my installation but I’ve had a (strange) problem last week: I didn’t get status updates from the Sonos to my item, especially the Play/Pause switch. Interestingly for me the switch itself worked and could control the device, only if I changed the status manually on the device or via Sonos App, there were no events in OH. But after rebooting the server (due to kernel update) everything works again.

Regards
Dieter

steve1 · January 6, 2016, 12:32pm

Sure, it’s not impossible, but I think it’s very unlikely. In any case, it would be good to know the context of this claim (operating system, specific version of JVM, which JVM bug fix is believed to fix the bindign problem, etc.). On the other hand, it wouldn’t hurt to upgrade the JVM and see if openHAB continues crashing or not. I’m also wondering about the “for now” qualification. What behavior was he seeing before (maybe it was a different issue, like yours) and how did he verify it’s been fixed? It sometimes took 3-5 days before the Sonos binding crashed the server.

bernd_boehm · January 7, 2016, 9:20am

So my 2 cents about this: The problem is apparently a memory leak in the uPnP library. Since JVM 1.8 there were some changes in the garbage collection and the general memory management. It may be that these changes can compensate for the memory hole.
My weak statement: “for now” only means: I used the binding previously without any problems - even though it had the bug ever since. Then at some point I got the OoM issues - and I can’t tell the reason. →

Now I upgraded the JVM from 1.7 to 1.8_65 and the issue didn’t occur for 4 days now. This is a lot better than before where I could only run OH for about 12h without restart. Even more - I could not use the binding at all. Even if it didn’t crash yet - the sonos remained silent.

The specific setup is: Raspi 2 on Raspbian (most recent kernel, can’t check the exact version currently) OH 1.7.1, JVM 1.8_65 (Oracle)

steve1 · January 7, 2016, 11:36am

Thanks for extra information. The problem I analyzed was not a memory leak (where a “leak” is memory usage that should have been reclaimed, but wasn’t). The OOM condition occurred because 5000+ active threads were created to handle thousands of mostly identical tasks being created by the uPnP library in a short period of time. Each of these threads consumes native memory for the thread stack and this causes the crash. The native memory usage is not subject to garbage collection and since the threads were active, even the native memory could not be released through other memory management. The native memory usage is why we saw that the Java heaps were not excessively large at the time of the OOM condition.

My setup is RPi 2 (Raspbian/wheezy) , Java 8, OH 1.7.1 + 1.8.0 patches for several bindings.

As described earlier in this thread, I also have a modified version of the Sonos binding that uses the openhab fork of the jupnp UPnP library. However, the excessive task creation issue is still there. It just fills up the disk with log messages instead of running out of memory (because the task queue is bounded and writes to the log when it overflows).

At this point, based on the log file information about the type of tasks being initiated (mostly “SendingRenewal” tasks), it appears that the issue is in the UPnP library and not the binding layer. Unfortunately, if this is true it probably still exists in the ESH Sonos binding. Since the OH1 Sonos binding is no longer actively maintained, I’m planning to do some testing with the ESH Sonos implementation to determine if the problem is solved there or not.

bernd_boehm · January 9, 2016, 5:29am

Update: I guess it was wishful thinking… the OoM exception is back again. So you were right steve, my issue is the same. Myriads of threads…

steve1 · January 9, 2016, 1:24pm

@bernd_boehm I’m sorry to hear that. I was hoping for the best since I’d like to use this binding. I’m still planning to try it with OH2/ESH but I don’t know any reason why the UPnP layer will behave differently given I tried the same UPnP layer with the OH1 binding and had similar problems.

Zvika · July 14, 2016, 3:08pm

Hi,

My Sonos is on its way…
I assume this problem is still relevant? (at least, the Wiki says so)

If so, maybe it’s worthy to consider fixing it in OH1, since it seems that it will take time until OH2 will really replace OH1.

I’ve seen a possible workaround, mentioned at https://community.openhab.org/t/sonos-tts-and-say/7661 -

@Lolodomo

On my side, I continue using the Sonos Vera plugin through the MiOS binding. And so I can use Microsoft translator.

Does it mean that we can use the Sonos Vera plugin instead of the buggy default Sonos plugin?
If I’m correct, could someone modify the Wiki to recommend this (and explain how, for newbies ) ?

Thanks

Marty56 · July 14, 2016, 3:52pm

The songs binding 1.0 is not maintained and I have given up the hope that it will ever be fixed.
Whether the version 2.0 is more stable is not clear for me.

The Sonos Vera plugin would require a dedicated Mimos bridge.
This seems to be an very expensive workaround.

I am using PHP scripts which are started via http post.
This solution is not perfect but at least openhab is not crashing.