Openhab 4.0.1 System unstable from "class org.openhab.core.internal.items.ItemUpdater"

Scriptwriter · August 10, 2023, 8:00am

Hi all,
since the Upgrade to Openhab 4.0.1 which i did end of july i have all about 2 or 3 days the situation that openhab is not relyable. it ends up with about 6 messages from first line i copyed and than 6 messages from the second and so on some thousands of it and openhab then does not do something what it should do then

2023-08-10 03:09:24.440 [WARN ] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.openhab.core.internal.items.ItemUpdater@6a092748' takes more than 5000ms.
2023-08-10 03:09:24.550 [WARN ] [ab.core.internal.events.EventHandler] - The queue for a subscriber of type 'class org.openhab.core.internal.items.ItemUpdater' exceeds 5000 elements. System may be unstable.

my workaround is then to restart openhab.
Could this come from a rule or is it possible to make it more stable by a config which i can do?

I have seen that another person reported same issue https://community.openhab.org/t/can-i-downgrade-from-4-to-3-4-x/148124 but i dont want to do downgrade

Best regards
Andreas

rlkoshak · August 10, 2023, 1:42pm

What is the memory and CPU situation when these messages appear?

Scriptwriter · August 11, 2023, 5:40am

Hi Rich,
yesterday 19:00 it stopped again with same errormessages
the time im guessing from lost values in all graphs:

and here a screenshot from cpu and mem. The cpu-spikes one hour later at 20:00 cames from daily backup.

The graphs not showing special things.
If this appears then it writes heavyly to openhab.log and the rotated files too so it is hard to see what was before because it overwrites all also the rotated files with the errors i wrote.
I have also the Dashboard which was provided for openhab metrics.

From my point of view it shows only on JVM memory something what could give a hint, not on machine itself because openhab is running in docker container.

rlkoshak · August 11, 2023, 12:09pm

This is not a true statement. A service running in a docker container is most definitely running on the machine where it is deployed. And if there is some sort of problem on that machine and it’s resources it will impact the containers running on that machine. A container is not a VM.

What about the system load?

All one can surmise from the error is that something posted an update event to an Item and it took more than 5 seconds for the Item to receive and process that update, resulting in the update being thrown away.

Scriptwriter · August 11, 2023, 12:56pm

yeah, you are right. i wrote a litte bit courious.
The 5 mins Load is here:

Here i have also a screenshot of IO. The SSD i have makes about 500 MB/s (or less from what kind of writing to it and maybe cache rolled over)

What i would expect if something would be overloaded is somewhere to see a rectangle in graphs from something is in saturation.

What i would also expect if something would be overloaded that the situation clears itself and not writing the full night same errormessages. That is the reason that i thought about that “something” maybe a rule or a heavy itemchange triggers the situation. Hope you have an idea how i could get more near to reason that i can make a workaround for stopping the triggersituation.

Thats all from my point of view i this time have.

rlkoshak · August 11, 2023, 1:31pm

On Linux, the system load is a measure of how many processes/threads are stuck waiting around for access to a shared resource (e.g. file IO). Anything over 2 I consider to be unacceptably high. You appear to be averaging above 5 (the graph is really hard to read) and then it jumps to an average around 15. That’s a huge load and likely the problem.

For comparison, my openHAB machine’s five minute load average is 0.58 right now with 653 threads. A load of 20, in my case, would mean that around 3% of all my running processes are stuck waiting around for access to something (usually files). That’s a huge impact on system performance.

Based on that graph

You’ll have to figure out what’s causing that load. It may be OH but most likely it’s something else.

All that openHAB knows is it tried to send an event to an Item. That attempt took longer than 5 seconds which is longer than OH allows which results in a warning. You want these warnings rather than OH just throwing away these Item state updates silently. Every warning log message you see is an Item state update that was lost.

Scriptwriter · August 11, 2023, 2:30pm

the first thing i tried out is to see the impact of shuting down openhab (at 15:55):

The Load sinks down to 1.8 So the most of the load the host has seems to come from openhab. I guess that my load is not so much deep as yours because i have to use zram to extend my ram.

rlkoshak · August 11, 2023, 3:00pm

I don’t think zram should affect load or, if it does affect load it’s going to make the load less rather than more because RAM is way faster than disk IO. But load isn’t specific to file systems. Load can be raised based on other things like network, serial devices, or really any resource that only one process/thread is allowed to access at a time.

I don’t have an RPi running OH right now but even a load of 1.8, particularly when not running OH, seems pretty high. What else do you have running on this machine? If it’s pure openHABian the load should be well under 1 with OH not running.

Scriptwriter · August 11, 2023, 3:10pm

its debian on rpi4/8GB and docker install. No Openhabian. On docker running openhab, influxdb, Ngxinx-Reverse-Proxy, Pihole, Unifi, Grafana with mimir and so on, freepbx, Gitea are the most used Containers.

Openhab reads from serial port the smartmeter which sends i guess about 30 seconds some metercodes to openhab. and knx is via network router connected and the most senders on knx are changes of currents (about 15 measurements, which sends on 5% change to bus) Humidities, temperatures too (about 10). The rest of the knxstates are not changing much often.

Just got out a dns-problem from docker-daemon and fixed it, but seemed no effect to have on load

rlkoshak · August 11, 2023, 6:00pm

There’s more!

Even on an RPi 4 that’s probably too much for one machine. Unless and until you can get that load under control I would not expect this problem to be fixed. There is nothing that can be done from the OH side of things.

My first suspect as a problem causer is freepbx which, if it’s anything like Asterisk could consume a number of resources at a high priority.

If you want OH to run properly, you need to move stuff off this machine or move OH to another machine.

Scriptwriter · August 11, 2023, 7:12pm

ok, thanks a lot for your help. i will try to get the load down or bye another rpi4/8GB (hopefully that its again available in the world)

alexkarageorgis · August 14, 2023, 9:49am

Hi Andreas, did you manage to figure it out? similar problem. Clean install of Openahab 4.0 from SD card image provided by Openhab
updated to 4.0.1 release and have restored my previous install.

Release = Raspbian GNU/Linux 11 (bullseye)
Kernel = Linux 6.1.21-v8+
Platform = Raspberry Pi 4 Model B Rev 1.4/8gb

after about a day it freezes (will check the memory question when it crashes next)

Scriptwriter · August 14, 2023, 11:18am

Hi @alexkarageorgis ,
we have exactly the same Kernel and release running. So im not alone with my problems.
We figured out that it seems that there i a lot of load which could make the Problems.
So what i figured out in the meantime is that : https://community.openhab.org/t/openhab-4-0-1-and-network-pingdevice-on-docker-install-results-in-massive-threads/148658
So im pinging my network-devices to get the info if they are online or not. Openhab is making each ping-profile in a result of about 20 threads of arping per profile because it wants to ping the device from each existing bridge network which exists on the host. But i have this time no solution to only ping it from main network. You can see that if you open an interactive shell into container and make an “ps aux” and repeat this sometimes.

Maybe other impacts existing too but i did not figured out more.

So please let me know what you get out.

alexkarageorgis · August 14, 2023, 11:25am

while I get the general idea of what you are saying I am not using a docker image. the only thing i can think of causing network traffic is the ip camera binding. - i will let it crash one more time (i am loosing about 1% of memory every hour or so.) and then disable it and see if that fixes things

Thanks Andrea, will keep a look on your post.

A

Scriptwriter · August 14, 2023, 11:29am

im using the ip camera binding too (with 2 cams)

on my system load seems more problem than memory. So possibly we have different problems but maybe some the same. future will show what we learn…

alexkarageorgis · August 14, 2023, 1:01pm

IP camera always increased my system load a fair bit. in any case i disabled it and will wait and see if the issue comes back.

Alex

Scriptwriter · August 15, 2023, 12:06pm

today i got the load down:

The only thing i did is disabling the about 10 ping-checks profiles from network binding. Now it is the same load as if openhab is not running. But its still running.
I have opened another thread for the network binding: Openhab 4.0.1 and Network Pingdevice on Docker-Install results in massive threads

rlkoshak · August 15, 2023, 2:51pm

Now that the load is down, have the EventHandler warnings gone away?

Scriptwriter · August 15, 2023, 3:56pm

Hi Rich,
the last days before my change today with disabling network-binding i also did not had such en “explosion” with thousands of there messages.
It could also be that some conversions from older ecma-scripts to actual javascript had some errors which caused them. Possibly. I guess i have now all rules actualized that they dont raise errors and i guess it should not run rules any more on the nashorn engine which i installed directly after upgrade to run the most rules.

Another person told me about his impacts of ip camera binding which im also using. But disabling did not gave me a hint of impacts of this.

Thats why i logged interactively to container and had a look at the threads which are running and thnking about them.

These very much threads of arping gaves me a hint to disabling the network binding. Then it lowers the load completely as i have shown in graph. So what we have seen and seems also some years before someone got out (linked the other cases) is that pinging via network-binding is not useful if you are using it in docker container. It makes absolutely no sence that it starts pingthreads from each network which is existing in docker when in openhab is defined take only the network XXX.
I guess there would be a little change nessecary to avoid the unnessecary pings.
For me, if i would need it in future this means to make a little shellscript and start that instead of network-binding.

So fazit: load can be impacted from openhab itself also if someone tells it isnt

rlkoshak · August 15, 2023, 4:00pm

I believe you already filed an issue but if not, make sure you do so. The network binding should not do that.

But I also recommend using the best tool for the job. OH is not a very good IT system monitoring system. If you need to monitor a bunch of devices and services via ping, you might be better off deploying a system like Zabbix, ELK stack, Prometheus, etc.