SSH timeouts on high CPU load / How to change the nice level of openHAB?

Hi,

often when openHAB is causing a high CPU load, I’m kicked out of my SSH session. Is there any possibility to increase the nice level of openHAB to reduce the process priority? I installed openHAB via apt and tried several things now to change the nice level, but all my changes are ignored, it still runs with the default nice level (0).

Platform information:

  • Hardware: Raspberry Pi 3
  • OS: Raspbian Stretch
  • Java Runtime Environment: Java™ SE Runtime Environment (build 1.8.0_65-b17), Java HotSpot™ Client VM (build 25.65-b01, mixed mode)
  • openHAB version: 2.3

Don’t bother messing with that. Can’t offer a remedy right away, but I’m certain you’re making the false assumption that SSH/Karaf as the SSH “server” part kicks you out because of high load (I assume you’re talking about using the Karaf console and not OS-level SSH logins). Karaf does not do that because of load. To the best of my knowledge there’s not even such “functionality” in neither Karaf nor SSH clients nor SSH servers. High loads happen a lot with OH on RPis (there’s a couple of reasons to that in turn, bugs being among them) so it is much more probable that you have a timely coincidence with SSH timeouts (or failures), but that’s what this essentially is - coincidence, not the reason.
So if you get kicked out you have to accept that and/or need to look for the true reason why.
It can depend on the client config (how do you access the console ?).
I also regularly keep getting kicked out of my console session. The default SSH session timeout is pretty short and all attempts to change it I know of failed so far (sshIdleTimeout is said to be 30 minutes by default, and I already even have disabled that, i.e. set to 0). But there have also been a number of bugs with this in the past and I don’t know if they’re all fixed by now.
Try enabling debug level for org.apache.karaf for a start. Play with the settings in /var/lib/openhab2/etc/org.apache.karaf.shell.cfg. Maybe @splatch Lukasz do you have another hint to share ?

In fact I meant SSH on OS-level. Otherwise it would not make sense to reduce the process priority of openHAB. :wink:

So you log into the machine (not Karaf console) and get kicked by sshd out after some time ?
That now for certain does not happen due to load, there’s no such sshd or OS functionality. It’s probably another timeout thing, just that you have to look for the reason in your ssh clients or sshd server side.
Better g**gle for a solution. I’d suggest to run another sshd in standalone debug mode on some non-22 port (e.g. sshd -d -p 2222) and grab the output, then analyse what it’s saying around the time when you get kicked out.
It still does not make sense to fight symptoms by trying to reprioritize things … if you don’t believe just set your sshd’s pid to -20 (which means realtime execution priority) using renice --priority -20 --pid <sshd-pid> und see it won’t make any substantial difference.
Unless you’re on a Pi 1 you will have multiple cores available and there will always be plenty of CPU power to be available to sshd because under normal operating conditions, openHAB never uses more than 50% CPU (=100% on one core for recompilation plus another at most 100% of one core of regular load)

It’s quite common for ssh to timeout when CPU usage is very high. At least it happens to me occasionally.

Try setting the nice level through the systemd service file:

/lib/systemd/system/openhab2.service

Under [service]:

Nice=5

Not at all. A lot of hosting providers would be having a lot of trouble if that was true in general.
How do you come to think there’s a causal relation and it’s not just coincidence ?

Indeed it may be just a coincidence, and I’d agree that there would be a larger issue at play here.

But it can’t hurt to answer pfink’s question, at least to confirm or disprove his suspicions. I am interested to see if has any effect :slight_smile:.

Well that’s what my suggestion above should do as well (and it’ll even prioritize sshd over non-OH processes).
But this is unprofessional trial-and-error digging. Better go systematic, first thing would be to run sshd and ssh in debugging verbose mode to get a proper hint what’s causing the connection to get closed.
The first suspect would rather be a timeout. Check and eventually try the parameters explained in the link below. There’s also known suspects in misconfiguration to cause DNS to fail or keys being constantly regenerated and more, but I’d start here:

There’s no such functionality of course and the connection is not closed by sshd. You just get kicked out by the client due to a timeout because the server-side CPU is busy with openHAB instead taking care of the other services like sshd. And its definitely no coincidence, its reproducable. Just need to restart openHAB, open PaperUI and I have a SSH timeout with 80% probability while I have no issues otherwise for hours. Btw, this applies to servers as well of course. I work with > 100 servers at work and it’s normal to get timeouts when you have high CPU or memory overload. Decreasing just the nice level of sshd won‘t work because there are many other processes involved (network, crypto, dns, bash, …). I need a solution to increase the nice level of openHAB in the apt setup (no systemd, but an init.d script). I already tried to alter the init.d script by adding -N 10 to the start-stop-daemon call but that did not work, it still runs with nice level 0.

… and I’ve spent years as a sysadmin.
Your statement is not right. It may happen on massively overloaded servers at times but then because something else is badly wrong there (massive delays due to paging, network problems, file server gone, …). Or it is in fact a timeout or other cause unrelated to load that you by coincidence believe to be the cause just like you still seem to do here.
But that definitely is not normal behavior and never caused by CPU shortage.
Assuming there’s no CPU shortage on your server (and there will not be if you only run OH !), any properly setup SSH server process will get enough CPU shares to respond to heartbeats and the client will not kick you. There’s quite a number of different timeouts and other parameters that can cause being kicked, but on a properly configured system [sic] this will NOT happen, so find out why it does on yours (and have a chat with your system administrator at work once you did :wink: )

Again: re-prioritizing is just fighting symptoms, it is not a proper solution.
Find the root cause instead, start with the hints I gave (run sshd in debug mode).
And tweak the SSH timeout values (see link), the defaults indeed are too short.

You’re right, normally the scheduler is clever enough that not one process can overwhelm all the others. I also only know this issue in case of extremely heavy and constant overload. But I thought on the Raspberry (which has much less strong hardware than our servers) such issues can may occur more often.

As always with such issues, they’re gone when you try to debug them. Now I can’t reproduce it anymore at all :frowning: Anyhow, thanks for your support so far! I’ll try to run sshd debug if the issue occurs repeatedly again.

If I was to guess, it’s due to paging, or you’re already seeing first symptoms of a worn out SD card.
Check out this post.
Get another storage medium and move all writing stuff there as explained in the post (including swap !).

1 Like

…or maybe power adapter? :thinking:

Nope, that’s more of a myth, it’s not a problem unless you have attached some really power hungry peripherals (or if your power supply is far below spec of course).
If that’s the case you should be seeing messages in kernel log complaining about undervoltage.

2 Likes

Ah! I searched the syslogs now for the term voltage and found 16 messages from yesterday like this:

Oct 29 13:06:12 galactica kernel: [49335.602826] Under-voltage detected! (0x00050005)
Oct 29 13:06:18 galactica kernel: [49341.842758] Voltage normalised (0x00000000)

Today no such messages… Not sure why I didn’t see that earlier :confused:

Well under-voltage usually won’t cause these types of problems, even I saw those messages on my system.
But of course it shouldn’t do harm to spend a proper power supply (Raspberry foundation recommends 2.5A to be on the safe side).

2 Likes

News:

  • First, I found out that the openhab init.d script is not used at all. Instead /usr/lib/systemd/system/openhab.service is used.
  • Anyhow, the issue is not related to CPU overload and it’s not related to SSH. Instead, it seems to be the whole network that is going down.

Knowing that, I activated systemd-networkd debugging according to this guide and afterwards I was able to catch 2 occurences now. The first occurence looked like this:

Nov 27 01:49:02 galactica systemd[1]: Stopped openHAB 2 - empowering the smart home.
Nov 27 01:49:02 galactica systemd[1]: Started openHAB 2 - empowering the smart home.
Nov 27 01:49:03 galactica systemd-networkd[3403]: NDISC: Received Router Advertisement: flags OTHER preference medium lifetime 1800 sec
Nov 27 01:49:03 galactica systemd-networkd[3403]: NDISC: Invoking callback for 'r'.
Nov 27 01:49:03 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Started in Information request mode
Nov 27 01:49:03 galactica systemd-networkd[3403]: wlan0: Acquiring DHCPv6 lease on NDisc request
Nov 27 01:49:03 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Sent INFORMATION-REQUEST
Nov 27 01:49:03 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Next retransmission in 1s
Nov 27 01:49:03 galactica systemd-networkd[3403]: NDISC: Received Router Advertisement: flags OTHER preference medium lifetime 1800 sec
Nov 27 01:49:03 galactica systemd-networkd[3403]: NDISC: Invoking callback for 'r'.
Nov 27 01:49:03 galactica systemd-networkd[3403]: wlan0.milight1: Acquiring DHCPv6 lease on NDisc request
Nov 27 01:49:03 galactica systemd-networkd[3403]: wlan0: Updating address: 2a02:8106:1:700:ba27:ebff:fe20:d3c6/64 (valid for 1h 30min)
Nov 27 01:49:03 galactica systemd-networkd[3403]: wlan0.milight1: Updating address: 2a02:8106:1:700:211:bff:fe05:74e9/64 (valid for 1h 30min)
Nov 27 01:49:03 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Recv REPLY
Nov 27 01:49:34 galactica kernel: Under-voltage detected! (0x00050005)
Nov 27 01:49:40 galactica kernel: Voltage normalised (0x00000000)
ov 27 01:50:43 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Sent INFORMATION-REQUEST
Nov 27 01:50:43 galactica systemd-networkd[3403]: DHCPv6 CLIENT: Next retransmission in 1min 49s

The second occurence looked quite similar, but on this one there was no message by the kernel regarding under-voltage, just the router advertisement messages (on the 2nd occurence, openHAB was NOT restarted). I could not very yet if the network is always down when the router advertisement messages appear. Anyhow, I noticed the issue two times within 10 minutes, and during these 10 minutes, this log sequence is starting right before the network downtime.

So it’s definitely not openHAB-related. But if somebody has an idea, I’d be happy for input :slight_smile:

I’m coming nearer… By disabling IPv6 I could get rid of the log messages above. Problem still persists, but instead it seems to have another cause now: wlan0: WPA: Group rekeying completed with xx:xx:xx:xx:x:xx [GTK=CCMP]

I tried the steps from the link below and see if that helps…

I don’t see any signs of network outage in your log. It’s not a comprehensive output to tell for sure but “NDISC” outputs indicate that possibly an interface of your system obtained a new IPv6 address, maybe because your internet address changed on your router. Maybe some OH component cannot handle that condition.

I’d suggest you disable IPv6 on your box if you don’t need it to see if that helps
(various ways to do that, see https://www.raspberrypi.org/forums/viewtopic.php?t=138899 or https://www.leowkahman.com/2016/03/19/disable-ipv6-raspberry-raspbian/ )

PS: one occurence of under-voltage per boot seems to be normal on Pi, so ignore that.