3 times lost all but 2 commands in openhab when bindings failiing since 2.5.0 update

marcelser · January 24, 2020, 12:12pm

Hi,

I’m beginning to question my current 2.5.0 installation of OpenHAB2. It now happened to 3 times that logs were full of errors (mainly because we had power losses on upper flor) and lots connected devices were not reachable for some minutes, including the router for the whole network so dhcp down as well, but the host where OpenHAB2 runs in the basement it continued to run. After power was restored I noticed that still tons of errors were in the logs of bindings, reconnecting to channetls etc and the devices not really responding also openhab dashboard or paperui was completely down (not reachable).

So I decided to restart OpenHab2 which took maybe 2-3 minutes but in the end it started loading rules, panels, etc. but then I noticed still most of the things not working and checked contents of OpenHAB and except of 2 things of type “COMMAND” created from "scripts all the other things were gone as well as the items!

Luckly I take openhab-cli backups weekly and just restored last backup which restored openhab2 functionality. And I though well once it could happen.

2 weeks ago we again had a power loss of some devices(although not same ones as last time because of faulty power plug). Also here the same big problems. Openhab full of binding and channel errors, not recovering and panels/paperui down again. And guess what after restarting again everything was gone.

Last time was yesterday. I deliberatly powered down most devices in upper floor because I installed 2 x shelly1 relay underneath some push buttons. This time also router was gone again and guess what? OpenHab2 went berserk again and after restart everything was gone.

Before upgrade I was running 2.4.0 and never had a loss of jsondb. So what changed in 2.5.0 that it seems so brittle and killing jsondb when lots of errors occur or does it kill while restarting? I want to make clear the in all 3 cases I didn’t kill anything manually. I just used “systemctl restart openhab2” and waited patiently until it restarted which usually took quite long but in the end it came back to life.

Although I always can recover from the backups I’d like to know if this can be investigated and solved and what tells me that openhab2 will not go berserk anytime in the future without me noticing it?

I really don’t trust openahb2 2.5.0 at the moment because of loosing everything 3 times especially as 2.4.0 was not working fine for months. also with previous versions I never had the issue that openhab2 looses everything.

Does anyone have ideas on how to track down the issue or how to debug it at least?

H102 · January 24, 2020, 3:35pm

If your running OH on a RaspberryPi with an SD and having power losses you may have a corrupted SD card. Even if running OH with USB or SSD power losses will eventually corrupt the system. It’s recommended to use a UPS to prevent sudden shutdowns so OH does not get corrupted.

rlkoshak · January 24, 2020, 4:40pm

Just to be clear, the OH machine isn’t losing power, it’s other devices on your network. Correct?

If this happens again, look in the files at /var/lib/openhab2/jsondb and see if the data is there at that time or see if they are already gone when OH is going crazy.

Is it just the stuff you’ve done in PaperUI that get’s lost? Just Things and Items or do you have to reinstall and reconfigure bindings? Do you have anything lost from any other folders?

You should have backups made of the JSONDB file made automatically in /var/lib/openhab2/jsondb/backups. When OH goes crazy, do a diff between those backups and the “live” versions of the files for any differences. Does anything look troubling?

As H102, my first thought is that there is something odd going on with the SD card but we can’t rule out any issues with OH 2.5 itself. Though this is the first report of anything like this so I don’t really have anything corrective I can offer.

rossko57 · January 24, 2020, 5:12pm

I’m inclined to think the router restart causes enough changes at your target devices (e.g. DHCP) that your existing Things are unreachable.

Bruce_Osborne · January 24, 2020, 5:49pm

A possible solution would be DHCP reservations then.

Saracen · January 24, 2020, 7:25pm

Interesting. I’ve had virtually the same thing yesterday. I lost internet access yesterday morning and power cycled the router after a few hours when I got back from work. The internet came back (not sure whether it was external or perhaps a power cut for a few seconds).

Anyway, my openHAB has been behaving very strangely since then - MQTT not working, RFXCOM binding stopped working, Home Connect, Nest and Aqara Mii bindings not working too. A restart of openHAB fixed all this.

It’s a bit of a worry though.

I’m running on a RPi 3 with an SSD card (no SD card) with a USB power pack.

I suppose I ought to remove the router again and see whether there are any repeats of this behaviour or whether my USB power pack is not doing the job and there was indeed a power cut.

Andrew_Rowe · January 24, 2020, 7:36pm

my system has a backup battery and the internet router is plugged into the battery backup as well. It seems like a reasonable expense to insure a short power outage (typical here in south Florida USA) doesn’t cause OpenHAB or the router to restart unexpectedly.
I’d also look at setting up DHCP reservations for wifi devices as Bruce suggests

Saracen · January 24, 2020, 7:47pm

All mine have static IP addresses set on the router

rossko57 · January 24, 2020, 7:55pm

So the devices rely on the router feeding them an address. What happens if a device boots up quicker than the router after a power outage? (I think that’s going to be terribly device dependent) What happens to devices that stay active while the router is out?

rlkoshak · January 24, 2020, 8:20pm

But that shouldn’t make the Things just disappear. They’d just be marked as offline forever.

My main concern is the disappearance of the configs. I can’t imagine any scenario where that would be expected to happen given the above events.

I can only guess that what might be happening is that OH is trying to start a flush of it’s current state to the JSONDB but something is preventing it from finishing resulting in partially written or empty JSONDB files. Perhaps all the errors from the loss of the network is impacting this, but if so OH should be able to handle this more gracefully.

It has for me in the past when I’ve had to restart my pfSense box. All the networking is lost but I don’t get bombarded with errors, I don’t lose any of my configs, and everything connects back up nicely when networking is restored.

Saracen · January 24, 2020, 8:34pm

Up to now it has never been a problem - fairly recently we’ve had a lot of outages as they’re been digging up the road to put in fibre.
Everything has come up fine - every time.

May be I should do some tests at the weekend.

rossko57 · January 24, 2020, 8:35pm

True enough. I tend to take a large pinch of salt with fault descriptions.

Info only, I have a few IP devices that boot up, look for DHCP … in case of none, default to a known IP. Because that’s handy for configuring on the workbench.
That’s all very well until you have two of the same model.
These get real fixed IPs from me.

marcelser · January 27, 2020, 6:13am

To answer some questions that come up in this thread:

I’m not running off an SDCard but an eMMC (it’s not a raspi but an odroid xu4) which is highly unlikely to fail
I didn’t have power losses at the odroid xu4 but the network it’s connected to
java processes on the XU4 have been pinned to specific cpu cores to ensure stable java operation
I already use DHCP reservations so that the devices get the same ip once they come back online

So in case it happens again I will look at jsondb and jsondb/backups to check what’s going on and I will also check if just things & items are gone and bindings are still there. If I have enough time to play around I can forcefully pull the plug and see what happens.

I have now also upgraded to 2.5.1-2, maybe that fixes something but I don’t know

rossko57 · January 27, 2020, 11:45am

If the router is online first. If it isn’t online at that moment, behaviour is up to each device. Different things have different strategies to deal with missing DHCP.

rlkoshak · January 27, 2020, 3:38pm

That’s what Tesla thought too. UPDATE: Tesla Cars Have A Memory Problem That May Cost You A Lot To Repair

eMMC is still flash and it still can wear out. Loss of power can corrupt any flash based memory.

marcelser · March 7, 2020, 8:35pm

Ok. It happened again today that I lost everything and also again there was a network outage because I exchanged one of the switches so it’s very unlikely that it has something to do with emmc. It’s just too connected to network outages, every time it occurred it only happened after network was gone. Also this time everything was 100% ok until network was gone. I made quite a lot of screenshots now. Hopefully somebody has any idea what’s going on here.

So here’s how it looks like in paperui. All but 2 items of type (command) are gone. Interestingly it’s always these 2 commands that survive when oh2 goes crazy. You can also see a lot things now lurking in the inbox. And yes the bindings are still installed. It seems repeating whenever there’s a network outage. However I can absolutely not understand what is happening there as the machine itself keeps running perfectly. No reboots no nothing it’s just OH2 that goes bye bye. And I’ve been working with linux for many years and I never encountered some issue like this and I have no clue what to look at other then OH2.

And here’s the contents of jsondb & backup (partially) directory which looks quite ok but I think jsondb is screwed up:

Any help would be greatly appreciated cause I really can’t think of how network outage could have this effect.

marcelser · March 7, 2020, 8:58pm

wow logs show probably what’s messed up.
I have 3 screenshots of event logs. First I see over 23000! lines of tcp binding say it’s connected to some channel. Then java errors out with out of memory. Then I see a lot of errors see 2nd screenshots for some of them including post errors, socket errors, memory errors, timeouts etc. And then after some time (3rd screenshots) oh2 just reboots for ~3 times and then it’s in current state. So here’s the log excerpts.

UPDATE:
when taking a closer look turns out that thing org.eclipse.smarthome.core.thing.Thing.json has 0 bytes and was modified today. Copying back that file from backup dir actually restores OH2’s state. Question is still what’s going so wrong in OH2 that it writes 0 bytes to things file. It’s strange that it starts updating things file anyhow, why should it there’s not change in things but I guess everything is connected somehow.

rossko57 · March 7, 2020, 8:59pm

I am deeply suspicious of your DHCP reservation method, when you are severing pathways to the router etc. Because it still feels like your openHAB loses existing device connections and then discovers a “new” network (populated with similar devices over a "new"pathway).

At least give openHAB a fixed IP…

marcelser · March 7, 2020, 9:07pm

@rossko57, I appreciate that you’re trying to help but look at the pics of the logs, it’s definitly nothing to do with dhcp and dynamic ip.

And probably I have choosen a bad term by saying I use “DHCP reservations”. My router just assigns IP by MAC address, so IP is mapped to MAC Address. That’s about the most normal thing to do and every consumer router offers this feature and that’s just what I do. I go into the router look for the mac address and assign a fixed ip. And what other network should oh2 discover if I pull the network plug? There’s nothing to discovery anymore. It’s disconnected. I could give it a fixed ip and check if it happens again but if that’s the case then something would be really screwed up in OH2’s network stack and I actually doubt that. I think the tcp binding goes just berserk and messes up everything.

H102 · March 8, 2020, 2:40pm

Are you using an Ethernet cord and have Wi-Fi also enabled on the Pi?