OHv2: Machine running OHv2 hangs w/o warning after 4 to 6 weeks

[rPi3B, Stretch Lite, Zulu Java (zulu8.31.1.122-jdk1.8.0_181-linux_aarch32hf), mosquitto, samba,
OH2.4.0-1 (apt-get), text-based config,
binding = expire1,fritzboxtr0641,mqtt1,weather1,astro,exec,network,ntp,systeminfo,logreader
ui = paper,basic,classic
persistence = rrd4j,mapdb
action = mail,mqtt
transformation = map,javascript,xslt,scale,jsonpath]

I migrated from OHv1 to OHv2 on Jan 1st, 2019, by installing a new machine and copying items, rules, etc. files to the new 2.4.0
Since then OHv2.4.0-1 sh!t itself twice = stopped working entirely.

[peaks in the graph reflect the required reboots]

01/01/2019 Started OHv2
28/01/2019 stopped working
14/04/2019 stopped working

Both time, the system was running all by itself; meaning no console, SSH, update, config activity of sort. I cannot find any hits in syslog, OH logs as to why the OH died.

Is there a way I can monitor OH (other than through OH) to send a message somewhere to have the system rebooted?

Well, the real questions are, — and based on what I read on the forum, that his seems to be a occurrence at a level not experienced on OHv1 — is the root cause known? Will it be fixed?

I could go back to OHv1, because I have the complete system decommissioned. Not what I want. But six weeks uptime for rPi running OH and MQTT is not a stellar uptime.

Any hints appreciated.

Have a look at this command in the karaf console, it will give some memory and garbage collection info which over time may show a trend…


It may be a binding causing it so first step would be to remove half the bindings and test, then if it still happens remove the other half until you narrow down the cause.

Perhaps read this post, it may be the same…

Here it is:

openhab> shell:info
  Karaf version               4.2.1
  Karaf home                  /usr/share/openhab2/runtime
  Karaf base                  /var/lib/openhab2
  OSGi Framework              org.eclipse.osgi-3.12.100.v20180210-1608

  Java Virtual Machine        OpenJDK Client VM version 25.181-b122
  Version                     1.8.0_181
  Vendor                      Azul Systems, Inc.
  Pid                         421
  Uptime                      21 days 3 hours
  Process CPU time            2 days 13 hours
  Process CPU load            0.03
  System CPU load             0.04
  Open file descriptors       229
  Max file descriptors        102,642
  Total compile time          9 hours 3 minutes
  Live threads                140
  Daemon threads              82
  Peak                        159
  Total started               429340
  Current heap size           135,983 kbytes
  Maximum heap size           243,520 kbytes
  Committed heap size         173,936 kbytes
  Pending objects             0
  Garbage collector           Name = 'Copy', Collections = 50243, Time = 35 minutes
  Garbage collector           Name = 'MarkSweepCompact', Collections = 241, Time = 4 minutes
  Current classes loaded      19,343
  Total classes loaded        222,309
  Total classes unloaded      202,966
Operating system
  Name                        Linux version 4.14.79-v7+
  Architecture                arm
  Processors                  4

Can these values / stats collected automatically and piped to a file or similar?

Thanks for the other post… since my system is a production system, it would “disturb” a lot of things we rely on to be working… and removing any binding would fail the whole system, as rules (though in separate files for separate functionality) have a lot interdependencies. E.g. one sensor might be an input for many other things (ambient light, battery state of charge, other things setting proxy switches).

To monitor your system over a long time you should use Telegraf as a monitoring tool, InfluxDB to store the data and Grafana to visualize all the stuff. Telegraf is based on plugins and there is a plugin called “procstat” which allows monitoring specific processes like Java or OH. Please do not run InfluxDB on your Pi because of the heavy I/O for your SD-Card (I/O isn’t really heavy but your SD-Card won’t survive this for long time :laughing:).

The peak of memory usage in your graph is pretty interesting but not really critically for the system. With a look to my own monitoring data I guess a memory usage of 60-70% is pretty usual for a Pi3.

Of course you don’t want to read something like this, but you should question yourself what would be easier: To find the root cause for the memory peak which leads to the faulty OH or to do a clean install of OH2?

I will update my post to indicate these are the dates the system died, and reflects the memory free after a reboot.

As for the Pi, it has a 250GB SSD connected to it :slight_smile: so no SD card issues. (In fact all my rPis have an SSD attached to it.

Ah, your graph also shows the free memory and not the memory in usage. So the peak of memory in usage grows over several weeks. There are one or more processes which do evil stuff and leads OH2 to freeze. But it will be hard to find such processes because of this long term. It’s a classical race condition. So you should follow @matt1’s advice: Run OH in a kind of “safe mode” which means disable everything (rules, bindings, etc.) and enable them one by one while looking at the memory consumption.

Ok, but do you still use the internal SD for booting ?

If you did search the forum you would get to notice that there’s no general problem like that known.
So your problem is specific to your setup, and most likely it’s not OH but the underlying OS or HW.
Sorry but asking for help like that without giving details of the problem first is a little naive.

What do you mean by “OH sh!t itself” ? Did the java process still run[hang around] ? Was OH restarted ? Is your system setup to restart it … ?
Do you use openHABian ? How are the java -Xmx and -Xms parameters set ?

Expand logging to debug to eventually get more log data on next occurence. Use something like
sudo strace -fp <java pid> to see if java is still doing anything.

And upgrade to 2.5M1 while you’re at it.

I see no evidence that this is related to memory. Heap sizes look great after 21 days up time and the main graph probably accounts for cache which Linux does not free back up unless more ram is needed. As said I see no evidence so looking at ram is most likely a waste of time.

Agree with Markus that the milestone build is worth it as a next step.

What you need to do is find something that either makes the issue worse or makes it go away.

no; boots off SSD :slight_smile:

OH sh!t itself = OH stopped working… in fact the whole machine failed; no SSH, ping replied (I have to write this down in the future, as I can’t remember exactly).

Set-up to restart it? No, wouldn’t know how to do that.

java -Xmx and -Xms parameters? = whatever the default is…

“did search the forum you would get to notice that there’s no general problem like that know”
Did read up on it in January but had no time to dig deeper; a google search returnes over 100k hits on openhab stops working, which led me to believe there is a problem.

My java version is:

java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (Zulu (build 1.8.0_181-b122)
OpenJDK Client VM (Zulu (build 25.181-b122, mixed mode, Evaluation)

sudo strace -fp … no idea how to read the output… :frowning:

If I’d only knew how … I installed per instructions, nothing extra other than Rasbian Lite, Zulu Java, mosquitto, OH2.4 …

What am I expected to do when OH no longer works?

In your first post you wrote “I migrated from OH1 to OH2”. Did you do a completly fresh install or did you use the upgrade function from “openhabian-config” (if you use openHABian)?

… with a fresh install on a new machine, and copied the items, rules, etc.

Are you using openHABian?

Just looked up my installation notes…

This is what I did:

181110-1240	downloaded 2018-10-09-raspbian-stretch-lite

# add the openHAB 2 Bintray repository key to your package manager and allow Apt to use the HTTPS Protocol:
wget -qO - 'https://bintray.com/user/downloadSubjectPublicKey?username=openhab' | sudo apt-key add -
sudo apt-get install apt-transport-https

# Add the openHAB 2 Stable Repository to your systems apt sources list:
echo 'deb https://dl.bintray.com/openhab/apt-repo2 stable main' | sudo tee /etc/apt/sources.list.d/openhab2.list

# resynchronize the package index:
sudo apt-get update

# install openHAB with
sudo apt-get install openhab2

# When you choose to install an add-on, openHAB will download it from the internet on request.
# If you plan on disconnecting your machine from the internet, then you will want to also
# install the add-ons package.
# sudo apt-get install openhab2-addons

Systems based on systemd (e.g. Debian 8, Ubuntu 15.x, Raspbian Jessie and newer):
sudo systemctl start openhab2.service
sudo systemctl status openhab2.service
sudo systemctl daemon-reload
sudo systemctl enable openhab2.service

# The first start may take up to 15 minutes, this is a good time to reward yourself
# with hot coffee or a freshly brewed tea!

# You should be able to reach the openHAB 2 Dashboard at http://openhab-device:8080 at this point.

If you are using a Pi I highly recommend you to use openHABian and then try the migration of your configuration again.

my understanding is this was a valid installation option at the time.
I had installed openHabian before but it seemed too restrictive, can’t remember what the issues where I had with it, which made me decide against it.
I think it wasn’t the lite version, hence a lot of things I do not need; something with the username/password that had to be used (I can’t recall).

Just checked, this is what I used: https://www.openhab.org/docs/installation/linux.html

It is still a valid installation option, but you have to do the whole setup with its dependencies manually, which is of course fail-prone. I don’t know, if the manually installation and an accidently missconfiguration is the root cause of your problem. But I also struggled with the manually installation and then decided to give openHABian a chance. Since then I never experienced problems, everything just works flawlessy. And you can use and configure it like a normal Rasbpian. It is just a very clever script which reduces lots of manually effort and failures :wink:

If you’re posting for help, you at least need to provide a comprehensive and detailed description of WHAT does not work, and provide all the details that might be of relevance.

All we can know from your description (on 2nd attempt) is that your machine is broken (or “hangs”) at that point in time and that you deliberately chose a custom (non-openHABian, SSD) OS+HW setup.
A machine to hang does not have anything to do with openHAB in the first place.
To answer your question, if I was harsh I’d say I expect you to google for help outside of this forum …

Well as said it’s not OH to stop working but your machine. And we’re in May now.

I am sorry, but - still willing to help - I don’t think anyone can with that little information - it’s digging in the dark.
I’d check system logs again for hints. And I’d connect a console to be prepared for logging into the system next time this happens to view/analyse the state it is in.

That’s the thing and reason why I’d suggest moving to openHABian as well.
You can but don’t have to use the RPi image. You can also install it on top of your (custom) OS as that’s Debian derived, too.

Thank you all for your help.

My question in essence was: OH died twice this year; something I did not experience in OHv1. The logs do not give me any pointers… where should I look, or what should I monitor / install to get more info when it happens next time.
I have searched and read the forum; googled the issue as well.

What I am hearing is: use openHabian for better support…

Thanks… I may come back if it happens again or have more information.

What bindings do you run?

binding = expire1,fritzboxtr0641,mqtt1,weather1,astro,exec,network,ntp,systeminfo,logreader
ui = paper,basic,classic
persistence = rrd4j,mapdb
action = mail,mqtt
transformation = map,javascript,xslt,scale,jsonpath

Happy to update any bindings if required or recommended.