[smartmeter] Openhabian karaf[...]: * stack smashing detected *: terminated

jamie_townsend · November 18, 2023, 10:17pm

I recently decided to delve into the world of ZigBee. Rather than going to something a little old, I decided to upgrade my R-Pi3 to OH4 as a good start. After resolving the memory issues by going 32-bit (Memory usage with Openhabian OH4, Raspberry Pi 3 Model B), I almost got the ZigBee binding working, but not completely ([zigbee] Sonoff Zigbee 3.0 USB Dongle "ZBDongle-E" not working) and was working on implementing zigbee2mqtt when I noticed this really cool message in the daemon.log file:


 openhabian karaf[...]: *** stack smashing detected ***: terminated

… is a 3, 4 or 5, digit number. I’m assuming it’s probably the karaf process id. This happens 10-15 minutes or so.

Every time this message occurs, the java process restarts. The context around this message in the daemon.log looks like this:
Nov 18 11:49:21 openhabian karaf[685]: DEBUG: Sending {“request message”: {“device address”: “”}} Nov 18 11:49:22 openhabian karaf[685]: DEBUG: Received {“identification message”: {“manufacturer ID”: “LGZ”, “protocol mode”: “C”, “baud rate”: 4800, “meter ID”: “ZMF100AC.M29”, “enhanced ID/capability”: “”}} Nov 18 11:49:22 openhabian karaf[685]: DEBUG: Sending {“acknowledge message”: {“protocol control character”: “NORMAL”, “baud rate”: 4800, “acknowledge mode”: “DATA_READOUT”}} Nov 18 11:49:22 openhabian karaf[685]: DEBUG: Sleeping for : 250ms before changing the baud rate Nov 18 11:49:23 openhabian karaf[685]: DEBUG: Changing baud rate from 300 to 4800 Nov 18 11:49:24 openhabian karaf[685]: get_java_var: invalid file descriptor Nov 18 11:49:24 openhabian karaf[685]: *** stack smashing detected ***: terminated Nov 18 11:49:25 openhabian systemd[1]: openhab.service: Main process exited, code=killed, status=6/ABRT Nov 18 11:49:25 openhabian systemd[1]: openhab.service: Failed with result ‘signal’. Nov 18 11:49:25 openhabian systemd[1]: openhab.service: Consumed 8min 45.733s CPU time. Nov 18 11:49:30 openhabian systemd[1]: openhab.service: Scheduled restart job, restart counter is at 1. Nov 18 11:49:30 openhabian systemd[1]: Stopping Frontail openHAB instance, reachable at http://openhabian:9001… Nov 18 11:49:30 openhabian systemd[1]: frontail.service: Succeeded.

Thinking that this may be caused by the new Sonoff ZigBee USB device and/or binding fighting with the long running smartmeter binding, I both removed the Sonoff device and removed the ZigBee binding from OH, but the smashing continues.

I discovered a bunch of hs_err_…log files in the /srv/openhab-userdata directory, like this one:
hs_err_pid18052.log (144.0 KB)

The contents of this file points me towards org.openmuc:j62056, version 2.2.0 according to the smartmeter/pom.xml. Despite its name, this artefact (no longer?) appears to be open source as the website https://www.openmuc.org/ points to GitHub, but this project is not open.

I found this reference Crashes caused by libNRJavaSerial · Issue #136 · NeuronRobotics/nrjavaserial · GitHub to the error message “get_java_var invalid file description” and the (same?) issue was know about in 2020 and fixed in version 3.19.0, around the same time as the code was duplicated into the OH smartmeter binding by @Kai.

Not having access to any of the openmuc source, I can’t see what the fix was or if it may have been applied already. Any help would be most welcome

splatch · November 19, 2023, 4:21pm

I can confirm that issue remains in the way jrxtx/librxtx handles things. I observed the same issue with unofficial wmbus binding made by Kugu Home back in 2018. Restarting of binding (reopening of serial port) caused JVM to crash.
Recently I made an attempt to bind openmuc’s jrxtx with nrjavaserial native library to avoid that trouble. I haven’t tested it in a wild yet, but at least communication with serial port works fine.

Edit - following up nrjavaserial fix - applies to native part. So if smartmeter uses jrxtx within nrjavaserial it should be fine.

jamie_townsend · December 5, 2023, 10:31am

@splatch just wondering how you’re getting along with your nrjavaserial fix. Might it work in my case?

splatch · December 5, 2023, 10:45am

I think it could work, cause jrxtx is just a lower level part. It sits between binding and serial port with some native code to tinker with serial port itself.
Fixes were tested by @JensH in unofficial wm-bus binding. Jens, can you spin a test instance once more and try to restart wm-bus binding after it goes online. Just to see if it will crash openHAB as it happens for smartmeter binding.

JensH · December 5, 2023, 11:44am

So only try to restart binding? With active thing I believe, right?

Could try at the weekend hopefully.

splatch · December 5, 2023, 11:56am

Yes, restart should be sufficient. This operation always causes JVM crash with Kugu binding.

jamie_townsend · December 5, 2023, 3:22pm

Just to be clear about a couple of things:

the smartmeter binding runs happily on my RPi-3+ (32-bit OH) for anything between 5-45 minutes before it crashes. It’s usually (but not always) after 15-20 minutes.
when it crashes (“stack smashes”), the JVM dies and the whole of openhab is gone. It automatically restarts, but needs a little while to do so

Interestingly, while testing various things, I noticed that after setting the logging level for org.openhab.binding.smartmeter to debug (“openhab-cli console” → “log:set debug org.openhab.binding.smartmeter”), the problem disappears - at least during the last couple of hours.

I also noticed that the “stack smashing” always occurs 2 seconds after the message about changing the baud rate (which it does from 4800 to 300 and back to 4800 again every cycle = 10 seconds), so perhaps there’s a link there. I’ve just tried increasing the “Delay of Baudrate Change” from 250ms (default) to 1000ms and changing the logging level back to “info” to see if that helps, so I’ll let you know how that goes.

jamie_townsend · December 5, 2023, 3:56pm

Further interesting observation, with Delay of Baudrate Change set to 250ms & debug logging on org.openhab.binding.smartmeter, the following logs are created:

In the openhab-cli console log:tail, the following messages appear at intervals that could be the same as when stack smashing happens:

16:49:10.627 [WARN ] [l.iec62056.Iec62056_21SerialConnector] - 1. reading attempt failed: null. Retrying /dev/ttyUSB0...
16:49:10.632 [WARN ] [l.iec62056.Iec62056_21SerialConnector] - Delaying retry by PT10S

Around the same time in the /var/log/daemon.log

Dec  5 16:48:51 openhabian karaf[6608]: DEBUG: Sleeping for : 250ms before changing the baud rate
Dec  5 16:48:51 openhabian karaf[6608]: DEBUG: Changing baud rate from 300 to 4800
Dec  5 16:48:53 openhabian karaf[6608]: DEBUG: Received data message.
Dec  5 16:48:53 openhabian karaf[6608]: DEBUG: Changing baud rate from 4800 to 300
Dec  5 16:49:04 openhabian karaf[6608]: DEBUG: Sending {"request message": {"device address": ""}}
Dec  5 16:49:05 openhabian karaf[6608]: DEBUG: Received {"identification message": {"manufacturer ID": "LGZ", "protocol mode": "C", "baud rate": 4800, "meter ID": "ZMF100AC.M29", "enhanced ID/capability": ""}}
Dec  5 16:49:05 openhabian karaf[6608]: DEBUG: Sending {"acknowledge message": {"protocol control character": "NORMAL", "baud rate": 4800, "acknowledge mode": "DATA_READOUT"}}
Dec  5 16:49:05 openhabian karaf[6608]: DEBUG: Sleeping for : 250ms before changing the baud rate
Dec  5 16:49:05 openhabian karaf[6608]: DEBUG: Changing baud rate from 300 to 4800
Dec  5 16:49:30 openhabian karaf[6608]: DEBUG: Changing baud rate from 4800 to 300
Dec  5 16:49:30 openhabian karaf[6608]: DEBUG: Sending {"request message": {"device address": ""}}
Dec  5 16:49:31 openhabian karaf[6608]: DEBUG: Received {"identification message": {"manufacturer ID": "LGZ", "protocol mode": "C", "baud rate": 4800, "meter ID": "ZMF100AC.M29", "enhanced ID/capability": ""}}
Dec  5 16:49:31 openhabian karaf[6608]: DEBUG: Sending {"acknowledge message": {"protocol control character": "NORMAL", "baud rate": 4800, "acknowledge mode": "DATA_READOUT"}}
Dec  5 16:49:31 openhabian karaf[6608]: DEBUG: Sleeping for : 250ms before changing the baud rate
Dec  5 16:49:32 openhabian karaf[6608]: DEBUG: Changing baud rate from 300 to 4800
Dec  5 16:49:34 openhabian karaf[6608]: get_java_var: invalid file descriptor
Dec  5 16:49:46 openhabian karaf[6608]: DEBUG: Sending {"request message": {"device address": ""}}
Dec  5 16:49:47 openhabian karaf[6608]: DEBUG: Received {"identification message": {"manufacturer ID": "LGZ", "protocol mode": "C", "baud rate": 4800, "meter ID": "ZMF100AC.M29", "enhanced ID/capability": ""}}
Dec  5 16:49:47 openhabian karaf[6608]: DEBUG: Sending {"acknowledge message": {"protocol control character": "NORMAL", "baud rate": 4800, "acknowledge mode": "DATA_READOUT"}}
Dec  5 16:49:47 openhabian karaf[6608]: DEBUG: Sleeping for : 250ms before changing the baud rate
Dec  5 16:49:47 openhabian karaf[6608]: DEBUG: Changing baud rate from 300 to 4800
Dec  5 16:49:50 openhabian karaf[6608]: DEBUG: Received data message.
Dec  5 16:49:50 openhabian karaf[6608]: DEBUG: Changing baud rate from 4800 to 300
Dec  5 16:50:00 openhabian karaf[6608]: DEBUG: Sending {"request message": {"device address": ""}}
Dec  5 16:50:01 openhabian karaf[6608]: DEBUG: Received {"identification message": {"manufacturer ID": "LGZ", "protocol mode": "C", "baud rate": 4800, "meter ID": "ZMF100AC.M29", "enhanced ID/capability": ""}}
Dec  5 16:50:01 openhabian karaf[6608]: DEBUG: Sending {"acknowledge message": {"protocol control character": "NORMAL", "baud rate": 4800, "acknowledge mode": "DATA_READOUT"}}
Dec  5 16:50:01 openhabian karaf[6608]: DEBUG: Sleeping for : 250ms before changing the baud rate
Dec  5 16:50:01 openhabian karaf[6608]: DEBUG: Changing baud rate from 300 to 4800
Dec  5 16:50:04 openhabian karaf[6608]: DEBUG: Received data message.
Dec  5 16:50:04 openhabian karaf[6608]: DEBUG: Changing baud rate from 4800 to 300
Dec  5 16:50:14 openhabian karaf[6608]: DEBUG: Sending {"request message": {"device address": ""}}

There’s our get_java_var, but it doesn’t seem to stack smash when the debug logging is enabled.

JensH · December 8, 2023, 1:46pm

bundle:restart org.connectorio.addons.io.transport.serial.jrxtx

works without seen any issues the next 5 Minutes. So I believe it works without problems.

[smartmeter] Openhabian karaf[...]: *** stack smashing detected ***: terminated

[smartmeter] Openhabian karaf[...]: * stack smashing detected *: terminated