OHv2: Machine running OHv2 hangs w/o warning after 4 to 6 weeks

It depends on the symptoms when it stops. Is just OH unresponsive or is the whole machine unresponsive? That will determine if you can put the monitor on the same machine or not.

Do calls to the web server time out or get refused? Is there unusually high CPU usage? We need to find something we can check to determine when OH has become unresponsive. Once we have that we can write a script that gets triggered by cron and when it sees that OH is unresponsive it will restart OH or reboot the machine if necessary.

Are the spikes in the graph pre restarting OH or post restarting OH? It looks like it gradually uses more and more memory until becoming unresponsive until you restart. Correct?

IIRC you moved to OH 2 because your OH 1 instance had the same problem. This points to that pesky memory leak still existing. What OH1 bindings are you using? It’s gotta be one of those causing the problem. If we can identify it perhaps we can get this fixed for good. But, given your graph, you are no where near using up enough RAM to cause a system wide failure so I think that is probably a red herring.

It looks like the most amount of memory consumed is 58% so you should have room to expand the amount of RAM that Java allows itself to consume. This thread hopefully can help you get that working. If not it should give you the terms to search for. But this would only be a band aide and really only give you an extra week or two before OH crashes. And it would only work if it were just OH that was running out of memory. You are experiencing a near complete system failure.

This is critical information. The problem is much larger than just OH, it’s the whole machine freezing or at least becoming degraded.

Are you writing your syslog to persistent storage? If there is an error discovered it’s going to be logged out there. Focusing solely on OH is not going to reveal the problem. This is a bigger system wide problem.

Are you running Grafana by chance?

The thing is if you install openHABian then we will have a basis of comparison. openHABian will give you the same configuration from the operating system on up that many hundreds or thousands of other OH users are using.

And there is nothing restrictive about using openHABian. It’s just a stock Raspbian Lite with a bunch of scripts to install and configure OH and Mosquitto and some other third party applications.

Because the problem is affecting the entire system, I don’t think the problem is caused by openHAB itself. It is probably caused by something else outside of openHAB, and we don’t know the full set of configuration or changes you’ve made to this machine. If you use openHABian, we will know because it’s all scripted and standard. It’s also very well tested given the number of users.

I should note that if you want to do something more hands on of an installation of openHABian, you can download a Raspbian Lite SD card image, then follow the manual installation instructions for openHABian. That too will give you a near standard configuration.

And once openHABian is installed, you can add in any other software or packages you may need. Though, with the information we have now, I’d guess that it’s one of those that might be causing the problem.

This is a good one. If the RPi is plugged into a screen then there might be something written to the screen (e.g. Kernel panic!) that might not end up in the logs.

The syslog and other general Linux places to monitor. The problem isn’t that openHAB failed, it’s that your entire machine is failed or at least degraded. That means the problem is outside of OH. Since you are not running openHABian we do not and can not have enough information about your configuration to guess at the cause.

No, it’s embedded. As is mapdb.