OH4 runs out of memory

I think this stack exchange discussion may provide more clues and be relevant to the behavior you are seeing and why some devices are resolving to ipv6 address since ipv6 when enabled resolution is tried first by default.(debian seems to have enabled it for both by default) at least in my debian 12 clean build that line in nsswitch.conf is this .hosts: files mdns4_minimal [NOTFOUND-return] dns mylocalhostname

and of course, this wikipedia also adds to the understanding of the behaviors being seen.

This link talks about unix socket waiting behaviors as well.

http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html#time_wait

All of which is true.

The main point here is that I didnā€™t deliberately enable IPv6, and had no expectation that my devices would communicate on IPv6. They all have IPv4 addresses. For whatever reason the discovery routines are finding them in IPv6, not noticing theyā€™re an existing device that we already know on IPv4, attempting to communicate with them on IPv6, failing, then leaking the socket.

We can probably agree that itā€™d be better in my case if it wasnā€™t trying to use IPv6. I can control that myself. But for other people that may be beyond their experience or what they want to do. And ultimately, irrespective of the cause, when the discovery fails for whatever reason it should tidy up after itself.

Interestingly, I turned off IPv6 on my Raspberry Pi. And Iā€™m still getting the leak, but for a different reason:

2023-10-14 16:43:31.071 [DEBUG] [discovery.ShellyDiscoveryParticipant] - shellyplus1-083af2029730: Shelly device discovered: IP-Adress=[fe80:0:0:0:a3a:f2ff:fe02:9730], type=shellyplus1
2023-10-14 16:43:31.086 [DEBUG] [g.shelly.internal.api2.Shelly2ApiRpc] - shellyplus1-083af2029730: Connect Rpc Socket (discovery = true)
2023-10-14 16:43:31.087 [DEBUG] [helly.internal.api2.Shelly2RpcSocket] - shellyplus1-083af2029730: Connect WebSocket, URI=ws://[fe80:0:0:0:a3a:f2ff:fe02:9730]/rpc
2023-10-14 16:43:31.103 [TRACE] [shelly.internal.api.ShellyHttpClient] - shellyplus1-083af2029730: HTTP POST for http://[fe80:0:0:0:a3a:f2ff:fe02:9730]/rpc {"id":787696111,"src":"shellyplus1-083af2029730","method
":"Shelly.GetConfig"}
Accept-Encoding: gzip
User-Agent: Jetty/9.4.50.v20221201
Content-Type: application/json; charset=UTF-8


2023-10-14 16:43:31.104 [DEBUG] [g.shelly.internal.api2.Shelly2ApiRpc] - shellyplus1-083af2029730: WebSocket error
2023-10-14 16:43:31.105 [TRACE] [shelly.internal.api.ShellyHttpClient] - shellyplus1-083af2029730: API call returned exception
org.openhab.binding.shelly.internal.api.ShellyApiException: java.util.concurrent.ExecutionException: java.nio.channels.UnsupportedAddressTypeException
        at org.openhab.binding.shelly.internal.api.ShellyHttpClient.innerRequest(ShellyHttpClient.java:213) ~[?:?]
        at org.openhab.binding.shelly.internal.api.ShellyHttpClient.httpPost(ShellyHttpClient.java:144) ~[?:?]
        at org.openhab.binding.shelly.internal.api2.Shelly2ApiRpc.rpcPost(Shelly2ApiRpc.java:1209) ~[?:?]
        at org.openhab.binding.shelly.internal.api2.Shelly2ApiRpc.apiRequest(Shelly2ApiRpc.java:1153) ~[?:?]
        at org.openhab.binding.shelly.internal.api2.Shelly2ApiRpc.getDeviceProfile(Shelly2ApiRpc.java:169) ~[?:?]
        at org.openhab.binding.shelly.internal.discovery.ShellyDiscoveryParticipant.createResult(ShellyDiscoveryParticipant.java:149) ~[?:?]
        at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.createDiscoveryResult(MDNSDiscoveryService.java:222) ~[?:?]
        at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.considerService(MDNSDiscoveryService.java:214) ~[?:?]
        at org.openhab.core.config.discovery.mdns.internal.MDNSDiscoveryService.serviceResolved(MDNSDiscoveryService.java:207) ~[?:?]
        at javax.jmdns.impl.ListenerStatus$ServiceListenerStatus.serviceResolved(ListenerStatus.java:106) ~[?:?]
        at javax.jmdns.impl.JmDNSImpl$1.run(JmDNSImpl.java:911) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: java.nio.channels.UnsupportedAddressTypeException
        at org.eclipse.jetty.client.util.FutureResponseListener.getResult(FutureResponseListener.java:118) ~[?:?]
        at org.eclipse.jetty.client.util.FutureResponseListener.get(FutureResponseListener.java:101) ~[?:?]
        at org.eclipse.jetty.client.HttpRequest.send(HttpRequest.java:732) ~[?:?]
        at org.openhab.binding.shelly.internal.api.ShellyHttpClient.innerRequest(ShellyHttpClient.java:180) ~[?:?]
        ... 15 more
Caused by: java.nio.channels.UnsupportedAddressTypeException
        at sun.nio.ch.Net.checkAddress(Net.java:161) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.checkRemote(SocketChannelImpl.java:816) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:839) ~[?:?]
        at org.eclipse.jetty.client.AbstractConnectorHttpClientTransport.connect(AbstractConnectorHttpClientTransport.java:100) ~[?:?]
        at org.eclipse.jetty.client.HttpClient$1.connect(HttpClient.java:665) ~[?:?]
        at org.eclipse.jetty.client.HttpClient$1.succeeded(HttpClient.java:642) ~[?:?]
        at org.eclipse.jetty.client.HttpClient$1.succeeded(HttpClient.java:634) ~[?:?]
        at org.eclipse.jetty.util.SocketAddressResolver$Async.lambda$resolve$1(SocketAddressResolver.java:186) ~[?:?]
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[?:?]
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[?:?]
        ... 1 more
2023-10-14 16:43:31.121 [DEBUG] [discovery.ShellyDiscoveryParticipant] - shellyplus1-083af2029730: Device discovery of device  with address [fe80:0:0:0:a3a:f2ff:fe02:9730] failed: java.nio.channels.UnsupportedAddressTypeException(class java.nio.channels.UnsupportedAddressTypeException)

Itā€™s still finding the device on IPv6, but now the network stack gives an exception and refuses to attempt to connect. Which appears to still be leaking the WebSocket.

Yes, I agree the bindings behavior has changed. Without knowing what changed (if any thing) in your specific environment we can only guess as to why it suddenly started leaving web sockets open. I have been trying to repro this condition since the original post from the other user and reviewing his heap dump and other data also as an additional data point his issues also seemed to point back to discovery related problems.
I do not use shelly devices in my production home automation solution I use ZigBee and Zwave.
That was why I previously mentioned that discovery did not work for me in 4.x version but worked fine in 3.4x version. Since I had never had shelly devices in my environment, I found it even more interesting that it auto discovered fine on 3.4x and not at all on any 4.0 version using same hardware again another data point to look at since the transition from 3.4x to 4.0x included an application change as well as a java version change.
However, the issue was interesting enough to me to go out on amazon and purchase 2 shelly devices and add them to my test environment just to see if I could repro the problem. I still suspect this is a larger issue then just shelly devices. But until I could repro that I cannot say for sure if that is true. The point I am making here is the timed wait on the sockets is a known issue that occurs in other situations and can have more then 1 cause as the last link I pointed out discusses network packet drops such as what a weak Wi-Fi induces with random packet drops or network saturation interference from other wireless transmitters and so on. Those scenarios are all possible causes of web sockets hanging around for extended periods of time. Improving the error handling on this binding is definitely needed and the binding maintainer has a few updated versions in the pipeline based on what I read so perhaps this web socket clean-up is another one that will get implemented.

Agree in general.

Iā€™ve reverted my IPv6 disabling, as it makes my MQTT unhappy (itā€™s trying to bind to an IPv6 address for some reason, which I could probably also disable, but since it doesnā€™t fix the leak anyway not much point).

I think the key, which started this overall thread I think, is to make sure all exception handlers clean up the Web sockets, otherwise we get leaks. And I think this particular one is traceable to an exception that didnā€™t clean up the Web socket.

I have no real ability to build and test, and Java isnā€™t a language Iā€™ve spent any time with. I could probably have a go at a pull request, but in reality Iā€™d just be chucking stuff at a wall that someone who knows what theyā€™re doing could do way better.

All of which is fine. I can put in a daily/overnight restart of openhab and it will probably work adequately for the interim period.

Did one of you test the most recent binding change which is about web socket closing?
This change was merged 4 days ago.

OK. Working on that. It triggers a dependency on Gson, so working through how to resolve that.

Resolved Gson. But now have an unresolved dependency on slf4j. It looks to me like slf4j should be always included already, so Iā€™m suspecting Iā€™m breaking my installation. I canā€™t find just org.slf4j.jar, only things like org.slf4j-api.jar.

I think Iā€™m out of my depth here unfortunately. Iā€™m not sure I want to just randomly pull in additional modules until it works.

Iā€™ve also downloaded the entire snapshot build, and unpacked it. Thereā€™s no slf4j jar in there, so Iā€™m dubious that itā€™s required. Can anyone provide a hint as to what I may be missing?

OK, looks like I can install an entire snapshot version using apt. That seems safer in terms of it working, less safe in terms of upgrading my whole installation. Ah well.

Iā€™m completely out of my depth and amazed by your debug attempts. :slightly_smiling_face:

The only thing I can add is that there was a change with slf4j, although if you are on a later snapshot it was reported as fixed. openHAB 4.1 Milestone discussion - Setup, Configuration and Use / News & Important Changes - openHAB Community

I will wait to see what Paulā€™s results are as well but for me using the fresh snapshot install on a freshly built Debian 12 fully updated image with zulu17jdk installed via apt and install of Openhab snapshot version Build 3675 latest installed via apt added the shelly binding during initial setup console reports binding as version 4.1.0.202310130405 . I see no change in discovery working manually adding a new shelly thing works ok. Still unable to repro a leak no orphan socketā€¦
On a separate note using Openhab 3.4 and java 11 discovery works fine. using open 3.4 and java 17(yes I know it was not a supported configuration) discovery does not find any device.
As far as I have been able do from a testing perspective.
Openhab 3.4 using recommended dependencyā€™s normal build seems to work and discover shelly devices fine
any version of Openhab 4.x I have tested (using the exact set of instructions in the documentation) does not work with discovery on a freshly built clean load of Openhab (not a upgrade).
regardless of the underlying OS .
Saw same behavior on Linux as well as windows.
Same hardware for all tests.
Heap dump review does not indicate any anomalyā€™s
and 168 threads counted.
On a side note I really like the new 4.1 it is sweet being able to toggle the logging via the GUI.

For me, same behaviour with the snapshot build. I have openHAB 4.1.0 (build Build #3675).

The Shelly binding reports being 202310130405:

openhab> list -s |grep shelly
280 x Active x  80 x 4.1.0.202310130405     x org.openhab.binding.shelly

Iā€™m seeing thread growth in bites of 8, and same symptom - associated with failed discovery and ā€œWebSocket errorā€ in the log at the time where the threads grow.

I think it makes sense to do some sort of WebSocket destroy as I noted in the above comment OH4 runs out of memory - #79 by PaulL1

Sorry guys, I was traveling and therefore couldnā€™t participate in debugging, but a great community made already progress

Thatā€™s a good finding and matches the symptoms with the Shelly Wall Display, which is not active as a thing, but causes the problem ā†’ because it sends mDNS discovery packets

I need to place the api.close() in a finally block so it gets called when request was processed, but also when it failed.

            try {
                ShellyApiInterface api = gen2 ? new Shelly2ApiRpc(name, config, httpClient)
                        : new Shelly1HttpApi(name, config, httpClient);
                api.initialize();
                profile = api.getDeviceProfile(thingType);
                logger.debug("{}: Shelly settings : {}", name, profile.settingsJson);
                deviceName = profile.name;
                model = profile.deviceType;
                mode = profile.mode;
                properties = ShellyBaseHandler.fillDeviceProperties(profile);
                logger.trace("{}: thingType={}, deviceType={}, mode={}, symbolic name={}", name, thingType,
                        profile.deviceType, mode.isEmpty() ? "<standard>" : mode, deviceName);

                // get thing type from device name
                thingUID = ShellyThingCreator.getThingUID(name, model, mode, false);
            } catch (ShellyApiException e) {
                ShellyApiResult result = e.getApiResult();
                if (result.isHttpAccessUnauthorized()) {
                    logger.info("{}: {}", name, messages.get("discovery.protected", address));

                    // create shellyunknown thing - will be changed during thing initialization with valid credentials
                    thingUID = ShellyThingCreator.getThingUID(name, model, mode, true);
                } else {
                    logger.debug("{}: {}", name, messages.get("discovery.failed", address, e.toString()));
                }
            } catch (IllegalArgumentException e) { // maybe some format description was buggy
                logger.debug("{}: Discovery failed!", name, e);
            } finally |
                api.close();
            }

and also this is a good catch

The binding is currently not supporting IPv6 handling. Therefore I need to check the address family on discovery requests and refuse IPv6

Iā€™ll look into that at the weekend.

1 Like

Thatā€™s awesome thanks @markus7017. Let me know when itā€™s built through into snapshot for testing and I can try it out.

I try to work on this at the weekend

3 Likes

Please try this build, itā€™s based on the latest merge fixing one issue + 2 changes mentioned above, but doesnā€™t not include my pending PR changes. If the fix works,Iā€™ll merge this with the latest changes. So use this build as a temporary trial (save the old jar, then remove from adding folder and check with bundle::list).
https://github.com/markus7017/myfiles/blob/master/shelly/org.openhab.binding.shelly-4.1.0-SNAPSHOToom.jar?raw=true

IPv6 issue should be fixed, but note that IPv6 is not supported by the binding

Thanks, Iā€™ve installed and am testing.

Sometime in the last couple of days the thread count stopped increasing. Not associated with a config or install change, it just stopped. Which is very unusual. Which is another way of saying that a test may not be proof, if it can just stop increasing on its own without code change. Having said that, it also looks like the message ā€œWebSocket errorā€ also stopped occurring in the log. So presumably if I have that message without thread number change, then it works. Iā€™ll see today if that message is still occurring.

For those who may also want to test, the process here is:

  1. Assuming youā€™re using apt to install - you want to change your sources.list to include unstable. In my case thatā€™s in /etc/apt/sources.list.d/openhab.list, and the line looks like:
deb [signed-by=/usr/share/keyrings/openhab.gpg] https://openhab.jfrog.io/artifactory/openhab-linuxpkg unstable main
  1. Upgrade, which should give you the snapshot version

  2. Go to the openhab console: ssh -p 8101 openhab@localhost
    (password is habopen)

  3. Turn on trace logging for the binding

log:set TRACE org.openhab.binding.shelly
  1. Upgrade the binding to the version provided by Markus above
[openhab> bundle:list |grep Shelly
280 x Active x  80 x 4.1.0.202310211629     x openHAB Add-ons :: Bundles :: Shelly Binding Gen1+2
[openhab> bundle:update 280 https://github.com/markus7017/myfiles/blob/master/shelly/org.openhab.binding.shelly-4.1.0-SNAPSHOToom.jar?raw=true
  1. I always restart my openhab so I can start clean. sudo service openhab restart

  2. Then you need to get the process id of your openhab instance

ps -ef |grep openhab
  1. Use the process id to get a count of the threads that refer to ā€œWebSocketā€
while true; do date; sudo jstack 9868 | grep WebSocket | wc -l; sleep 60; done >> threadcount2.txt &  tail -f threadcount2.txt
  1. Check the openhab log to see if there are any ā€œWebSocket errorā€ messages turning up
fgrep "WebSocket error" /var/log/openhab/openhab.log

Thatā€™s about it.

dis you installed the new build?

I did. No errors nor additional threads so far. Would there be anything different in the log that would tell me Iā€™m definitely running the right version?

thatā€˜s the expected behavior :heart_eyes:
check bundle:list if shows you the build timestamp

OK, bundle timestamp: 4.1.0.202310211629

So itā€™s the new one. Would the changes have removed the WebSocket error, or just caused it to not leak when there was the error? I havenā€™t seen any discovery messages at all since the upgrade - which fixes it I guess, just not what I expected.

the 2 mentioned above and another one on disconnect handling, which was already merged