Strange partwise failure - rules, persistence, logging outage

This morning dark november morning we could not turn on our living room lights. It turned out that the OpenHAB system partly went down. I could log in via WinSCP, a few sitemaps were still available (including some item values).

But rules, persistence (influxDB) and logging stopped working last night around 1:00 am. I could not login via Putty. It only helped in the end to unplug and plug in again my Raspberry Pi 3 (with a 8 weeks old fresh OpenHAB 2.2.0 Build #1037).

Since I cannot find anything in the logs, I really want to know, what went wrong, so I can avoid that wife and kids are standing in the dark another time.

Thanks for your help!

we can only tell that it’s October from your post… not much more :stuck_out_tongue:

Thank you for your reply! It really feels like November today. :upside_down_face:

So if nobody had a similar experience, I’ll have to hope, that it just won’t occur again. Or is there any experience in partwise system failure (and its cause)? I really would like to find out and reach system reliability again.

without any logs… it’s extremely hard to identify what could have gone wrong…
are you sure that there is nothing in the logs that can help you identify the possible issues?

I’ve seen similar behavior with out of memory problems, but you should be able to see out of memory exceptions in the oh logs when the system comes back up.

@kisseler I think Rich is right. I have a similar setup and a similar issue. In my case it seems to be due to InfluxDB slowly taking more memory. Every few hours InfluxDB runs a compaction process that spikes memory and CPU usage. It should go back to a lower level afterwards, but it looks like there is some loss in doing this. The freeze typically happens in one of these compactions. I can still connect to frontail and see when the log stopped, but cannot do anything else (including logging in through ssh). So I don’t see a way to restart the raspberryPI the nice way.
To diagnose I started logging memory consumption, swap and CPU usage in the infuxDB using the systeminfo binding. That is what revealed the pattern above.
I tried setting up a watchdog to restart the Pi if it freezes, but have not been able to make this work so far. I only see one proper solution: get InfluxDB off the Pi and run it on other hardware. A second Pi just for InfluxDB might just do.

1 Like