What i can see is that the java process grow to 4Gb, and at some point, the web interface does not response anymore. At this point, even the ssh connection to karaf console go down.
Yesterday, I make a memory dump and jstack on the instance before it completly lock down.
What i find is many reference java.jmdns.impl.ServiceEventImpl (see the capture below).
Another precision.
I also experience more troubles if I enable the Chromecast binding.
In this case, the instance can lock down after 1 or 2 hour of uptime.
Don’t know if it’s related to the previous issue.
Also have forget to mention that I’m currently on 5.1 snapshot build from September 13.
But already experience this issue before on 5.0.
Are you saying that you didn’t have these problems with 4.x, or just that it took longer to consume all memory?
From your thread dump, it seems clear that the Chromecast binding was enabled and is at least part of the problem. We have a deadlock:
"OH-safeCall-1" #555 [2819525] prio=5 os_prio=0 cpu=4445,48ms elapsed=21369,30s tid=0x00007fba70014a30 nid=2819525 waiting for monitor entry [0x00007fbb8d6fa000]
java.lang.Thread.State: BLOCKED (on object monitor)
at su.litvak.chromecast.api.v2.ChromeCast.disconnect(ChromeCast.java:170)
- waiting to lock <0x00000000996d14d8> (a su.litvak.chromecast.api.v2.ChromeCast)
at org.openhab.binding.chromecast.internal.handler.ChromecastHandler$Coordinator.destroy(ChromecastHandler.java:264)
The Chromecast binding also uses jmdns internally, so it could potentially be the cause of all the jmdns objects. I haven’t read through all of the stack trace yet, I’ll give more comments if I find more “issues”. But, it would be very helpful if it was possible to capture something similar without the Chromecast binding running, if you still experience problems without it.
The Chromecast binding is a known problem, and myself and @lsiepel are currently trying to go through it and see if we can figure out what’s wrong with it. But, it has been like this for a long time, I concluded that I had to stop using it several years ago, because it made my OH unstable.
We recently did a similar job on the Network binding, which was merged September 7, so it should be in your snapshot version. I’m thus hoping that the Network binding isn’t in use by you, or part of your problem?
I just found another deadlock, this one is between jmdns and the discovery service:
"JmDNS pool-27-thread-1" #871 [2819883] prio=5 os_prio=0 cpu=2478,52ms elapsed=21361,56s tid=0x00007fbb35624bf0 nid=2819883 waiting for monitor entry [0x00007fba205fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl.thingDiscovered(DiscoveryServiceRegistryImpl.java:257)
- waiting to lock <0x00000000814f9128> (a org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl)
at org.openhab.core.config.discovery.AbstractDiscoveryService.thingDiscovered(AbstractDiscoveryService.java:318)
at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.createDiscoveryResult(MDNSDiscoveryService.java:227)
at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.considerService(MDNSDiscoveryService.java:214)
at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.serviceResolved(MDNSDiscoveryService.java:207)
at javax.jmdns.impl.ListenerStatus$ServiceListenerStatus.serviceResolved(ListenerStatus.java:117)
- locked <0x00000000911c3db0> (a javax.jmdns.impl.ListenerStatus$ServiceListenerStatus)
at javax.jmdns.impl.JmDNSImpl$1.run(JmDNSImpl.java:923)
Sadly, it’s not obvious which binding, if any, is involved in the deadlock. I can try to investigate the locks in question further and see if I can make sense of anything, but I think I already see a potential problem here: The fact that the jmdns threads (executing the listener code) are used to execute the discovery itself. Since the listeners are called while jmdns holds internal locks, we might just want to let it return ASAP and use other threads to “perform the discovery”.
I found another deadlock, this time with the freeboxos binding:
"OH-thingHandler-1" #1119 [2820187] daemon prio=5 os_prio=0 cpu=1209,68ms elapsed=21352,48s tid=0x00007fbb0c19df80 nid=2820187 waiting for monitor entry [0x00007fba2b8fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.openhab.binding.freeboxos.internal.api.rest.FreeboxOsSession.getManager(FreeboxOsSession.java:162)
- waiting to lock <0x0000000085e328b0> (a org.openhab.binding.freeboxos.internal.api.rest.FreeboxOsSession)
at org.openhab.binding.freeboxos.internal.handler.FreeboxOsHandler.getManager(FreeboxOsHandler.java:109)
at org.openhab.binding.freeboxos.internal.handler.ApiConsumerHandler.getManager(ApiConsumerHandler.java:151)
at org.openhab.binding.freeboxos.internal.handler.ServerHandler.fetchSystemConfig(ServerHandler.java:168)
at org.openhab.binding.freeboxos.internal.handler.ServerHandler.internalPoll(ServerHandler.java:131)
at org.openhab.binding.freeboxos.internal.handler.ApiConsumerHandler.lambda$2(ApiConsumerHandler.java:236)
at org.openhab.binding.freeboxos.internal.handler.ApiConsumerHandler$$Lambda/0x0000000102e01618.run(Unknown Source)
I know nothing of this binding, but this seems to be unrelated to the other deadlocks, as far as I can tell. Since code is supposed to be written so that deadlocks are impossible, I’d say that somebody should take a look at this binding as well.
My oh my, here is another deadlock involving yet another binding, Teleinfo. This could potentially be the one causing the discovery service (and thus jmdns) to hang:
"OH-binding-teleinfo:serialcontroller:teleinfoserial" #1165 [2820248] daemon prio=5 os_prio=0 cpu=9998,65ms elapsed=21349,28s tid=0x00007fbb0c114430 nid=2820248 waiting for monitor entry [0x00007fba1b8fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl.thingDiscovered(DiscoveryServiceRegistryImpl.java:257)
- waiting to lock <0x00000000814f9128> (a org.openhab.core.config.discovery.internal.DiscoveryServiceRegistryImpl)
at org.openhab.core.config.discovery.AbstractDiscoveryService.thingDiscovered(AbstractDiscoveryService.java:318)
at org.openhab.binding.teleinfo.internal.TeleinfoDiscoveryService.detectNewElectricityMeterFromReceivedFrame(TeleinfoDiscoveryService.java:132)
at org.openhab.binding.teleinfo.internal.TeleinfoDiscoveryService.onFrameReceived(TeleinfoDiscoveryService.java:111)
at org.openhab.binding.teleinfo.internal.handler.TeleinfoAbstractControllerHandler.lambda$0(TeleinfoAbstractControllerHandler.java:49)
at org.openhab.binding.teleinfo.internal.handler.TeleinfoAbstractControllerHandler$$Lambda/0x0000000101341188.accept(Unknown Source)
I’m not sure about 4.x, it’s a long time that I’m on 5.x version as I always use snapshot version.
I’m also not sure that my setup in not in some regards are not trigerring the issue more faster : I’ve got a lot of upnp / dlna / chromecast device at home, and also some squeezbox device / lyrion media server with bridge between upnp / chromecast / airplay / etc. So a pretty complex setup.
Yes, for the dump I made it before uninstalling the chromecast addon.
But the issue also occurs without it.
I restart my instance this morning, and I find it deadlock right now.
I will try to make another snapshot.
As far as I remember, no, I don’t use the network binding.
But I will check that after restarting, because can verify it right now !
I found multiple threads being deadlocked in the same places as the ones I’ve already posted, so all the issues seem to have happened repeatedly. That itself isn’t surprising, though.
A thread dump without the Chromecast binding would be useful, but I suspect that we will still find the Teleinfo and FreeboxOS bindings having deadlocked. It would be interesting if you could be able to run OH without these as well, and see it the problem went away. I realize that you might “need” these, but if you can at all do without them, it would be helpful to pin down the problems.
I found 21 deadlocks in your thread dump in total, and while that is a lot, it doesn’t explain why you run out of memory. I found 424 threads in all, which isn’t that high, and should only consume some… 400 MB of memory. But, what probably happens is that these threads starve some of the thread pools so that these thread pools stop performing tasks, making the queues grow indefinitely.
Those queues, in the end, will consume all available memory. Thus, exactly which objects that exist in the highest numbers might not be that important, they probably exist because they are stuck in a non-moving queue. This probably isn’t technically a memory leak (which is very difficult to achieve in Java), but the effect is much the same.
..I realized that the Teleinfo binding might not be a part of the problem, but merely a victim. So, maybe disabling the FreeboxOS binding is all that is needed (together with the Chromecast binding).
I’m looking at the Freeboxos binding, and I’ve found some “synchronization errors” that can lead to deadlocks there. Whether or not it actually happens might depend on local factors, but I’m not in any doubt that there are problems there. I’m currently trying to make a fix without having to get to know the binding intimately, I don’t know if that’s possible, but I’ll try.