openHAB crashing with out of memory exception (Sonos, 1.7.1)

steve1 · November 15, 2015, 3:52pm

This seems similar to (reported by @Bernd_Pfrommer) :

However, at this point I have no reason to believe it’s related to Insteon or JSR223. I’m starting a new topic in case others are having this problem.

For some reason, in the last 48 hours my openHAB server has crashed several times after running out of memory. I have not seen this issue previously and cannot correlate it to any recent server changes. The memory consumption is shown in the following chart that covers the last week. The the two smaller spikes after the big one also killed the server.

The only hint is that that the org.teleal.cling.DefaultUpnpServiceConfiguration class (used by the Sonos binding) is the first to start complaining about being out of memory in each incident. Of course, this doesn’t mean that this is the cause of the excessive usage.

I’ve added support for JMX memory and thread monitoring to my openHAB server so I can see what’s happening before the crashes and support alert generation when memory usage is unusual. If needed, I can have a rule do a memory dump when a spike is detected. I’ve read that the Eclipse Memory Analyzer (MAT) can determine the “retained size” of memory for each thread from a memory dump. I may try doing this analysis if the problem continues.

Bernd_Pfrommer · November 15, 2015, 8:28pm

Ditching a complicated xtext based rule made the out-of-memory problem go away. Do you have any xtext based rules running?

steve1 · November 15, 2015, 8:32pm

No, only JSR223 rules. I’m watching it closely, but it seems to behaving normally for now.

Bernd_Pfrommer · November 15, 2015, 10:23pm

Then I’m wondering if it has anything to do with rules or bindings. Could be something with the GUI (or other) clients connecting and some caching mechanism leaking memory? What makes it run out of memory in one situation, but not in others?

I couldn’t find any classes gobbling up large amounts of memory with jmap / jhat. Did you try jmap/jhat?

steve1 · November 15, 2015, 11:17pm

I haven’t tried that yet. If it continues to happen, I’m going to configure the JVM to do a memory dump when the out of memory exception is raised and then do some analysis with the Eclipse Memory Analyzer.

The interesting thing is that this was not a slowly increasing consumption of memory. The virtual memory size jumped by > 500 MB in about a minute.

DieterL · November 16, 2015, 8:16am

HI
As you menioned the sonos binding, at least it caused a memory leak with the upnp library (see wiki). I’ve also found this thread about the problem: http://knx-user-forum.de/forum/supportforen/openhab/37306-upnp-verursacht-outofmemory-exception (German). There is a way to check the problem (reply #7).

As all information I’ve found about it where quite old (not after Jan. 2015) I’m not sure whether or not this is still a problem. I hope not, as I’m also using the binding but just for a few days now…

Regards
Dieter

steve1 · November 16, 2015, 11:20am

Thanks @DieterL, that was an interesting discussion. At this point I’m becoming more convinced that the problem is with the Sonos binding. This is what I saw in the log immediately before the OutOfMemoryError:

2015-11-15 23:02:33.072 [ERROR] [c.protocol.sync.ReceivingEvent] - Invalid subscription ID, no active subscription: (IncomingEventRequestMessage) NOTIFY /dev/RINCON_B8E937E4C19201400/svc/upnp-org/ZoneGroupTopology/event/cb.xml SEQUENCE: 0
2015-11-15 23:31:42.319 [ERROR] [c.protocol.sync.ReceivingEvent] - Invalid subscription ID, no active subscription: (IncomingEventRequestMessage) NOTIFY /dev/RINCON_B8E937E4C19201400/svc/upnp-org/ZoneGroupTopology/event/cb.xml SEQUENCE: 0
2015-11-15 23:31:42.486 [ERROR] [c.protocol.sync.ReceivingEvent] - Invalid subscription ID, no active subscription: (IncomingEventRequestMessage) NOTIFY /dev/RINCON_B8E937E4C19201400/svc/upnp-org/ZoneGroupTopology/event/cb.xml SEQUENCE: 0
2015-11-15 23:35:51.386 [WARN ] [efaultUpnpServiceConfiguration] - Thread terminated org.teleal.cling.registry.RegistryMaintainer@15a5352 abruptly with exception: java.lang.OutOfMemoryError: unable to create new native thread
2015-11-15 23:35:51.406 [WARN ] [efaultUpnpServiceConfiguration] - Root cause: java.lang.OutOfMemoryError: unable to create new native thread

The virtual memory spiked to 1.6 GB and dropped immediately, although not back to the level it was before the spike. None of the Java JMX heap nor nonheap memory usage metrics showed a significant increase at the time.

Real memory usage also spiked. Surprising, openHAB didn’t crash and has continued running.

Given the warnings I see around each memory incident about Upnp being unable to create native threads, I think the following chart of the number of active threads is interesting. Normally there are about 100 threads running, but that spiked to about 3000 threads right at the time of the OutOfMemoryError and then quickly dropped again to normal.

The 3000 threads could easily account for the spike in the virtual memory usage. I’m going to continue to monitor the situation, but I’m increasingly thinking that the Sonos binding is the culprit.

steve1 · November 16, 2015, 11:33am

I just had a thought given the other Sonos discussion. They mentioned the problem happened when the Sonos is offline. I replaced a secondary wifi AP over the weekend. I’m wondering if the Sonos went offline during that switch. The timing seems about right and we hadn’t used the Sonos for the last 48 hours. I’ve reconnected the Sonos and I’ll see if this memory/thread problem happens again.

DieterL · November 16, 2015, 12:22pm

HI Steve
I hope that solved the problem, keep me up to date. I’m not really interested in running in such a problem myself

Regards
Dieter

steve1 · November 18, 2015, 4:05am

I’m about 98% sure that was the problem. With the exception of a small spike earlier today, the virtual memory usage is flat for the last 24+ hours. I suspect the Sonos as the cause of the small spike because the thread count also spiked, but nothing like when the Sonos wasn’t reachable.

steve1 · November 18, 2015, 11:53am

Well, my server crashed again this morning. After doing more searching, it appears this is a known and relatively common problem with the Sonos binding and the old version of Cling library that it uses. I’m going to start investigating how to fix this, starting with upgrading the upnp library to the latest version. In the meantime, I may write some rules to stop the bundle during the night when we aren’t using the Sonos. In the worse case, I’ll just remove the Sonos binding until I or someone can find a fix.

As an aside, it’s slightly impressive that a RPi 2 Java process can be running 5000+ active threads before it dies.

Here are some links to related discussions. I was surprised that I could find no open issues related to this problem. If someone knows an existing issue number, please let me know. Otherwise, I create one in the next day or two.

Potential Memory leak? (Sonos?)
Redirecting to Google Groups

oliver_kuhl · November 19, 2015, 8:11am

Hi @steve1,

great to hear you are working on this issue. I have the same problem here and removed the binding.

I personally unplug some of my Sonos speakers, when I am away to avoid unnecessary standby power consumption. It would be great, if the binding could handle offline speakers.
If I can help testing, please tell me.

Best regards
Ollie

DieterL · November 19, 2015, 10:35am

Hi @steve1

That’s not what I wanted to hear from you
Until now, I didn’t have this problem but after your posting, I’m waiting for it…

For now, I’m interested how do you manage to deactivate the binding based on rules? Is it a real deactivation of the binding or just a deactivation of the rules? I’m also thinking about unplug the power of the sonos when I’m away but didn’t do it because of this problem. If that would be possible I also have other bindings to deactivate (onkyo av), that writes every 3 seconds a log message when the receiver has no power. Doesn’t cause any real problem, but…

Regards
Dieter

steve1 · November 19, 2015, 10:47am

@oliver_kuhl,

I probably will need some help testing.

I have found the cause of the symptoms I’m seeing. The Cling upnp library uses a thread pool with unbounded size. Based on information I’ve seen in their forums, it was intentionally designed to be able to create threads until the machine or process runs out of resources. The rationale for this decision isn’t clear but it’s obviously dangerous in an openHAB context. It is possible for the binding code to provide it’s own bounded threadpool. That would fix the current symptom but it doesn’t fix the root cause of why so many upnp tasks are being created in the first pace. We’d also need to decide what to do when the threadpool has hit it’s maximum capacity.

My server crashed again last night due to this problem. I’m going to add the -XX:-HeapDumpOnOutOfMemoryError and related arguments to my openHAB JVM startup script. Hopefully, this will do a memory dump when the error occurs. If so, I’ll be able to use the Eclipse Memory Analyzer to look at the thread stack traces and see which code is generating all the upnp requests. If we can identify the cause, then we’ll have a better idea if a Cling upgrade will fix the problem or not.

oliver_kuhl · November 19, 2015, 12:09pm

Hi,

I like the general idea of using upnp for an improved usability. However, the binding must work flawlessly with Sonos speakers, that are temporarily not available. If the thread issue cannot be solved right now, I would prefer a working binding without upnp, but a static IP configuration.

I personally may not have enough knowledge about threads and especially not upnp, so I am not a big help regarding coding unfortunately…

If you have a first version ready for testing, pls send me a jar or link to your git-repo and I’ll compile it myself.

steve1 · November 19, 2015, 1:04pm

AFAICT from the Sonos forums, upnp is the only choice for interacting with the Sonos devices. For now, I’m hoping that once we identify the source of excessive threadpool usage we’ll be able to fix the Cling library behavior.

oliver_kuhl · November 19, 2015, 1:45pm

You may be right. I just had the simple php-script here in mind, which works without upnp, but seems to be limited: Sonos PHP Script - HomeMatic-Forum / FHZ-Forum

So let’s try to fix upnp.

oliver_kuhl · November 19, 2015, 8:36pm

@steve1, maybe this here could be an alternative to cling?
http://bitletorg.github.io/weupnp/

hmerk · November 20, 2015, 12:11am

In openHAB 2.0 / ESH, the SONOS binding uses jUPnP, which is a fork of cling 2.0.
It might be worth to try refactoring the SONOS 1.x binding to use JUPnP as well.

Best
Hans-Jörg

steve1 · November 20, 2015, 11:40am

That’s a good idea. I can investigate that possibility.

I’m still waiting for the issue to happen again so I can hopefully get a thread dump and determine the root cause of the excessive thread creation.