3 times lost all but 2 commands in openhab when bindings failiing since 2.5.0 update

Bruce_Osborne · January 24, 2020, 5:49pm

A possible solution would be DHCP reservations then.

Saracen · January 24, 2020, 7:25pm

Interesting. I’ve had virtually the same thing yesterday. I lost internet access yesterday morning and power cycled the router after a few hours when I got back from work. The internet came back (not sure whether it was external or perhaps a power cut for a few seconds).

Anyway, my openHAB has been behaving very strangely since then - MQTT not working, RFXCOM binding stopped working, Home Connect, Nest and Aqara Mii bindings not working too. A restart of openHAB fixed all this.

It’s a bit of a worry though.

I’m running on a RPi 3 with an SSD card (no SD card) with a USB power pack.

I suppose I ought to remove the router again and see whether there are any repeats of this behaviour or whether my USB power pack is not doing the job and there was indeed a power cut.

Andrew_Rowe · January 24, 2020, 7:36pm

my system has a backup battery and the internet router is plugged into the battery backup as well. It seems like a reasonable expense to insure a short power outage (typical here in south Florida USA) doesn’t cause OpenHAB or the router to restart unexpectedly.
I’d also look at setting up DHCP reservations for wifi devices as Bruce suggests

Saracen · January 24, 2020, 7:47pm

All mine have static IP addresses set on the router

rossko57 · January 24, 2020, 7:55pm

So the devices rely on the router feeding them an address. What happens if a device boots up quicker than the router after a power outage? (I think that’s going to be terribly device dependent) What happens to devices that stay active while the router is out?

rlkoshak · January 24, 2020, 8:20pm

But that shouldn’t make the Things just disappear. They’d just be marked as offline forever.

My main concern is the disappearance of the configs. I can’t imagine any scenario where that would be expected to happen given the above events.

I can only guess that what might be happening is that OH is trying to start a flush of it’s current state to the JSONDB but something is preventing it from finishing resulting in partially written or empty JSONDB files. Perhaps all the errors from the loss of the network is impacting this, but if so OH should be able to handle this more gracefully.

It has for me in the past when I’ve had to restart my pfSense box. All the networking is lost but I don’t get bombarded with errors, I don’t lose any of my configs, and everything connects back up nicely when networking is restored.

Saracen · January 24, 2020, 8:34pm

Up to now it has never been a problem - fairly recently we’ve had a lot of outages as they’re been digging up the road to put in fibre.
Everything has come up fine - every time.

May be I should do some tests at the weekend.

rossko57 · January 24, 2020, 8:35pm

True enough. I tend to take a large pinch of salt with fault descriptions.

Info only, I have a few IP devices that boot up, look for DHCP … in case of none, default to a known IP. Because that’s handy for configuring on the workbench.
That’s all very well until you have two of the same model.
These get real fixed IPs from me.

marcelser · January 27, 2020, 6:13am

To answer some questions that come up in this thread:

I’m not running off an SDCard but an eMMC (it’s not a raspi but an odroid xu4) which is highly unlikely to fail
I didn’t have power losses at the odroid xu4 but the network it’s connected to
java processes on the XU4 have been pinned to specific cpu cores to ensure stable java operation
I already use DHCP reservations so that the devices get the same ip once they come back online

So in case it happens again I will look at jsondb and jsondb/backups to check what’s going on and I will also check if just things & items are gone and bindings are still there. If I have enough time to play around I can forcefully pull the plug and see what happens.

I have now also upgraded to 2.5.1-2, maybe that fixes something but I don’t know

rossko57 · January 27, 2020, 11:45am

If the router is online first. If it isn’t online at that moment, behaviour is up to each device. Different things have different strategies to deal with missing DHCP.

rlkoshak · January 27, 2020, 3:38pm

That’s what Tesla thought too. UPDATE: Tesla Cars Have A Memory Problem That May Cost You A Lot To Repair

eMMC is still flash and it still can wear out. Loss of power can corrupt any flash based memory.

marcelser · March 7, 2020, 8:35pm

Ok. It happened again today that I lost everything and also again there was a network outage because I exchanged one of the switches so it’s very unlikely that it has something to do with emmc. It’s just too connected to network outages, every time it occurred it only happened after network was gone. Also this time everything was 100% ok until network was gone. I made quite a lot of screenshots now. Hopefully somebody has any idea what’s going on here.

So here’s how it looks like in paperui. All but 2 items of type (command) are gone. Interestingly it’s always these 2 commands that survive when oh2 goes crazy. You can also see a lot things now lurking in the inbox. And yes the bindings are still installed. It seems repeating whenever there’s a network outage. However I can absolutely not understand what is happening there as the machine itself keeps running perfectly. No reboots no nothing it’s just OH2 that goes bye bye. And I’ve been working with linux for many years and I never encountered some issue like this and I have no clue what to look at other then OH2.

And here’s the contents of jsondb & backup (partially) directory which looks quite ok but I think jsondb is screwed up:

Any help would be greatly appreciated cause I really can’t think of how network outage could have this effect.

marcelser · March 7, 2020, 8:58pm

wow logs show probably what’s messed up.
I have 3 screenshots of event logs. First I see over 23000! lines of tcp binding say it’s connected to some channel. Then java errors out with out of memory. Then I see a lot of errors see 2nd screenshots for some of them including post errors, socket errors, memory errors, timeouts etc. And then after some time (3rd screenshots) oh2 just reboots for ~3 times and then it’s in current state. So here’s the log excerpts.

UPDATE:
when taking a closer look turns out that thing org.eclipse.smarthome.core.thing.Thing.json has 0 bytes and was modified today. Copying back that file from backup dir actually restores OH2’s state. Question is still what’s going so wrong in OH2 that it writes 0 bytes to things file. It’s strange that it starts updating things file anyhow, why should it there’s not change in things but I guess everything is connected somehow.

rossko57 · March 7, 2020, 8:59pm

I am deeply suspicious of your DHCP reservation method, when you are severing pathways to the router etc. Because it still feels like your openHAB loses existing device connections and then discovers a “new” network (populated with similar devices over a "new"pathway).

At least give openHAB a fixed IP…

marcelser · March 7, 2020, 9:07pm

@rossko57, I appreciate that you’re trying to help but look at the pics of the logs, it’s definitly nothing to do with dhcp and dynamic ip.

And probably I have choosen a bad term by saying I use “DHCP reservations”. My router just assigns IP by MAC address, so IP is mapped to MAC Address. That’s about the most normal thing to do and every consumer router offers this feature and that’s just what I do. I go into the router look for the mac address and assign a fixed ip. And what other network should oh2 discover if I pull the network plug? There’s nothing to discovery anymore. It’s disconnected. I could give it a fixed ip and check if it happens again but if that’s the case then something would be really screwed up in OH2’s network stack and I actually doubt that. I think the tcp binding goes just berserk and messes up everything.

H102 · March 8, 2020, 2:40pm

Are you using an Ethernet cord and have Wi-Fi also enabled on the Pi?

rlkoshak · March 8, 2020, 6:16pm

I’ve seen this myself in the is week and it has nothing to do with the network.

What appears to be happening is OH runs out of memory right when it does one of its periodic writes to the JSONDB. When it runs out of memory it can’t do anything so it fails to write and oh crashes.

I have no idea why OH is suddenly running out of memory but this is new since OH 2.5.2. Since core didn’t change it has to be one of the bindings or add-ons. I’ve not had time to instate which it might be.

Please don’t post screenshots of logs. That’s all but useless to us. Please post the text and use code fences.

rossko57 · March 8, 2020, 7:37pm

To be fair, a network disaster (or recovery from) could result in lots of activity, contributing to OOM

Observation from reports; there is something new in 2.5 involving unexpected Thing updates. Not status updates, property updates - which should trigger JSONDB writes of course.

marcelser · March 8, 2020, 8:00pm

I think it runs out of memory because the tcp binding goes crazy.

As I wrote already above i have >23000 lines of this log entry “The channel for /192.168.10.43:4998 is connected” (see yellow line number in front of the logfile and it starts with this message right away on line 1) and this happens within few minutes until it finally crashes. So probably it’s even more messages (as they start on line 1) but log got rotated.

So there seems to be a bad loop when network goes down and then recovers that seems to send tcp binding into infinite loop which eats up the memory slowly until OH is out of memory finally after so many loops. I don’t if loosing the network triggers the loop or recovering but I can also upload the log file which I have backed up in case anyone is interested in it or if it helps fixing stuff.

rlkoshak · March 8, 2020, 10:05pm

Or the TCP problems are just the first symptom of some other problem. I don’t use the TCP binding yet I’m having an out of memory error and occasionally lose my Things as a result.

As I said above. Screen shots of logs are useless. I’m on my phone, I can read them. And even if I weren’t, i’m not going to go through the effort to try and read them.