Large system - suddenly OpenHAB has become unreliable

kevin · August 9, 2022, 4:28am

My large OH system running on Debian has started to crash several times a week. To try and alleviate this I’ve updated the OH, the OS and the JVM which has if anything made it worse. The underlying operating system survives the crash and in the messages file it says something along the lines of “the main thread has exited, restarting.” As there was a recent power cut I’ve booted up another copy of the OS and checked the disks but this hasn’t fixed the problem. The last time it died was in the early hours of the morning so user interaction can be ruled out.
What information is needed to help with debugging this? I have a suspicion I’ve hit some sort of resource problem as it seems to have worsened with the last couple of rules I’ve enabled. The crash information taken from syslog is:

Aug  9 03:41:36 MySystem karaf[17630]: # A fatal error has been detected by the Java Runtime Environment:
Aug  9 03:41:36 MySystem  karaf[17630]: #
Aug  9 03:41:36 MySystem  karaf[17630]: #  SIGSEGV (0xb) at pc=0x00007f0381cf34e3, pid=17630, tid=17762
Aug  9 03:41:36 MySystem  karaf[17630]: #
Aug  9 03:41:36 MySystem  karaf[17630]: # JRE version: OpenJDK Runtime Environment Zulu11.58+15-CA (11.0.16+8) (build 11.0.16+8-LTS)
Aug  9 03:41:36 MySystem  karaf[17630]: # Java VM: OpenJDK 64-Bit Server VM Zulu11.58+15-CA (11.0.16+8-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
Aug  9 03:41:36 MySystem  karaf[17630]: # Problematic frame:
Aug  9 03:41:36 MySystem  karaf[17630]: # V  [libjvm.so+0xa5a4e3]  LoaderConstraintTable::purge_loader_constraints()+0xc3
Aug  9 03:41:36 MySystem  karaf[17630]: #
Aug  9 03:41:36 MySystem  karaf[17630]: # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
Aug  9 03:41:36 MySystem karaf[17630]: #
Aug  9 03:41:36 MySystem karaf[17630]: # An error report file with more information is saved as:
Aug  9 03:41:36 MySystem  karaf[17630]: # /home/pi/openhab/var/lib/openhab/hs_err_pid17630.log
Aug  9 03:41:36 MySystem  karaf[17630]: #
Aug  9 03:41:36 MySystem  karaf[17630]: # If you would like to submit a bug report, please visit:
Aug  9 03:41:36 MySystem  karaf[17630]: #   http://www.azul.com/support/
Aug  9 03:41:36 MySystem  karaf[17630]: #
Aug  9 03:41:36 MySystem  systemd[1]: openhab.service: Main process exited, code=killed, status=6/ABRT
Aug  9 03:41:36 MySystem systemd[1]: openhab.service: Failed with result 'signal'.

System details:
X86 Intel Based Fanless PC with M2 SSD and 8GB RAM running:
Linux 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64
OH 3.3 release build, 177 things, 1320 items, 344 rules.
Problem was present on one of the recent milestones, unfortunately didn’t note it before I updated. JVM seems to be the latest Java 11 version.
Released Bindings loaded:
Astro, Automower, Daikin, Exec, HTTP, Hue, ipcamera, Modbus, MQTT Network, OpenWeatherMap, Pushover, RFXCOM, TR-064, Wemo, Z-Wave
Marketplace bindings:
Samsung TV Beta
Manually loaded Bindings:
iRobot, LG Thinq

Saved file from the crash is attached.
I have only serial port settings in /etc/default/opehab and have added nothing to /usr/lib/systemd/system/openhab.service. The system has 8GB RAM and currently after 12 1/2 hours uptime top shows this:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
26643 openhab   20   0 6657672   2.0g  28952 S  28.4  26.3 178:24.51 java

hs_err_pid17630.log (238.2 KB)

Hopefully someone can point me in the right direction to fixing this.

Thanks in advance!

In the meantime I’ve removed the marketplace Samsung TV binding to see if anything changes as it’s the last one I added

JimT · August 9, 2022, 7:17am

Just giving some ideas: Have you tried running RAM / disk diagnostics? Also monitor the cpu temperature?

This is mine, running inside docker (under ubuntu) - I haven’t noticed any instabilities.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7133 openhab   20   0   25.0g   5.4g  12220 S   2.0   8.6 442:41.62 java

openhab 3.4-snapshot.

Lolodomo · August 9, 2022, 7:35am

You can try removing the Samsung TV binding and check if that helps.
And maybe also the 2 manually loaded binfings.

rlkoshak · August 9, 2022, 2:21pm

That error is a segfault. It’s unlikely to be something in your config nor something in openHAB itself. Those would generate exception stack traces in openhab.log. This is the underlying JVM attempting to do something illegal (e.g. access memory at an invalid memory address) and the kernel killing it.

This usually points to a bad JDK install. I would try moving to a different version of Java, either go back to the version of Zulu that worked without problem and skip the current update, or switch to AdoptOpenJDK instead of Zulu and see if that helps.

If you still get a segfault with a different JDK then my theory is wrong and something else is going on.

kevin · August 9, 2022, 7:23pm

Thanks Rich, I wondered why there was nothing in OpenHAB.log. I’ll switch to another JVM and update this in a couple of days, or next time it crashes. Historically I’ve just auto rebooted it every couple of weeks and its been flawless

kevin · August 10, 2022, 5:49am

I changed to AdoptOpenJDK, which seems to be called Temurin.

JRE version: OpenJDK Runtime Environment Temurin-11.0.16+8 (11.0.16+8) (build 11.0.16+8)

I then added back the Samsung TV Binding from the marketplace and it crashed after 2 hours and 11 minutes. This is the shortest time between crashes.

Reinstalled Zulu 11 and the thing won’t start.

So I’ve restored the disk image and I’m back on released version 3.2.0. Will update this thread the next time it crashes - which was after a couple of hours or so!

What’s the best way of testing memory on Linux?

Wolfgang_S · August 10, 2022, 8:56am

doesn’t the BIOS of your x86 PC come with a build in memory self-test that can be started via the BIOS ?

kevin · August 10, 2022, 7:04pm

@Wolfgang_S It doesn’t have this, but Memtester which managed to test 6GB free memory out of 8GB total after a reboot didn’t find any faults.

@rlkoshak After the recent restore back to late June it crashed 3 times overnight, can you get any idea from these crash files whether its the same fault each time?
hs_err_pid8094.log (306.8 KB)
hs_err_pid12383.log (247.3 KB)
hs_err_pid14748.log (330.0 KB)
Perhaps I’ve hit some limit somewhere that I need to change?

rlkoshak · August 10, 2022, 9:39pm

It has been decades since I last looked through one of those types of error logs. The SIGSEV error code indicates a memory access violation.

Since in a Java program you don’t really have access to memory in a way that you can generate a SIGSEV from your program, that indicates that it’s the underlying JRE that is doing the bad stuff with memory. However, you’ve run it with two separate implementations of Java so either there is something wrong with the Java spec and both include the same bug, or there is something wrong with your hardware. @Wolfgang_S’s idea is a great one to look into. If you’ve a chunk of memory space that’s not performing properly that could definitely be the issue.

An Intel PC with 8 GB of RAM is almost certainly going to have plenty of RAM for OH to work comfortable in so unless you have lots and lots of other stuff running on this machine, and have disabled swap, lack of RAM is almost certainly not the problem. But running top or htop while the system is up should possibly point out if you are near the limits of your RAM. But I think it’s a different error if you ask for more RAM and there is not left to give.

Beyond that I can’t offer any more help. The fact that it happens with old version and new versions of the software and across multiple JREs I’m suspecting hardware is the problem.

JimT · August 10, 2022, 9:40pm

Have you tried reinstalling openhab? Or extract openhab to a separate dir and do a file by file comparison, eg using rsync dry run or diff

kevin · August 10, 2022, 11:58pm

@rlkoshak thanks, I too suspect hardware especially as the problem has got more frequent over time. I’ve just replaced the Power Supply and re-seated RAM and SSD. Just the reseat alone has cured temperamental PC’s in the past, so who knows. I’ll see what the next few hours brings and see what else happens. I have an even older backup that doesn’t seem to have any error files on it so if it crashes again I will see if I can roll back to that one. I’ll also try another memory test on all free memory with nothing running, maybe that will show the culprit if its RAM. Fortunately I’ve been busy on non technical things for a while so there are limited changes.

Wolfgang_S · August 11, 2022, 6:00am

in case your BIOS does not support it you may have a look at
https://www.system-rescue.org/
You can boot directly into running a memtest from bootable USB/CD/DVD if you haven’t done that yet.

kevin · August 11, 2022, 6:47pm

Thanks I will download a copy and run the test the next time it crashes. It’s been up for 16 hours now which is better so it could well be a hardware problem that I may have fixed. Time will tell.
Thanks everyone!

jgreene · August 12, 2022, 6:04am

The same can happen if your JVM tries to overrun its configured max memory. Find your jvm/java opts and increase the memmax (-XMax 4096)[assuming my meory is correct] as it is usually something like 1024 … a large install with several mods may need that extra JVM allocated memory… rmember Java is emulated everywhere…it is a java virtual machine…

rlkoshak · August 12, 2022, 1:28pm

Usually that will throw an OutOfMemoryException and Java will kill the program itself, not a SEGSIV from the kernel. But increasing the heap allocated to Java couldn’t hurt.

kevin · August 16, 2022, 10:40pm

Since changing the Power Supply and reseating the Memory and SSD, OH and the JVM has not crashed for 6 days. Looks like this is what likely fixed it, and for those following on, the original fault was the Java process restarting every few hours without the OS or anything else running on the PC failing.

I’ve also just replaced a 5V phone charger that suddenly struggled to power 1 of the 2 ESP8266’s connected to it. Perhaps the recent power cut, could have included a massive spike that passed through surge protectors when the power was restored and caused these intermittent annoyances