OH3 - Network gone - no way to recover

mstormi · May 26, 2021, 8:33pm

There’s no plan or special activity dedicated to “un-beta” the 64 bit version but why would you want to use it anyway ? 64bit has nothing but drawbacks. AFAIK the 32bit version should also work on a RPi4/8G, too.

NCO · May 27, 2021, 5:20am

alright, thanks!

NCO · June 9, 2021, 5:28pm

Normally:
The red LED is solid ON.
The green LED is blinking 4 times quickly

According to RPi forums the 4 flashes mean:
4 flashes: start.elf cannot launch, so it’s probably corrupted. Alternatively, the card is not correctly inserted, or card slot is not working.

However, usually with this the RPi works normally (I currently don’t have any SD-card as I am booting from SSD).

After a while OH still seems to work (I get Telegram notifications) and Alexa still can controll my zwave lights.
But neither MainUI nor Habpanel are available.
ssh does not work either.

This time I have a display and KB / MS connected, but the screen is black. Even Mouse or Keyboard can’t wake it up.
The green LED is solid ON now.

I don’t see any other chance to make a hard reset again

After that I will check the logs and report back. But it’s really frustrating.

rlkoshak · June 9, 2021, 5:32pm

That indicates that it is in fact on the network. It can’t send a Telegram message if it can’t connect to the network. So it’s having problems accepting connections.

Are you certain that its IP address hasn’t changed? Was the monitor plugged in from boot time? I’ve had cases where plugging in a monitor after boot doesn’t work and the monitor never receives a signal.

NCO · June 9, 2021, 5:45pm

Yes, it must be on the network. And it is - according to my router - still on the same IP address which I have fixed for the raspberry pi.

Unfortunately I just connected the display after it locked up again. So this might be the same effect you have seen.
I will try to login blindly and shut it down.

Thank you, Rich.

NCO · June 9, 2021, 7:08pm

the syslog shows:

Jun  9 18:48:08 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:08 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 78007044-c942-11eb-a118-000000000000 6134
Jun  9 18:48:08 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:08 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 78115171-c942-11eb-a119-000000000000 5913
Jun  9 18:48:08 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:08 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 7822a188-c942-11eb-a11a-000000000000 5059
Jun  9 18:48:09 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:09 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 78fb0e54-c942-11eb-a11b-000000000000 5796
Jun  9 18:48:20 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:20 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 7f02e35c-c942-11eb-a11c-000000000000 9785
Jun  9 18:48:38 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:38 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 89f671a2-c942-11eb-a11d-000000000000 5834
Jun  9 18:48:41 homer influxd[709]: ts=2021-06-09T16:48:40.998671Z lvl=info msg="Executing query" log_id=0UabpdN0000 service=query query="SELECT value FROM openhab_db.autogen.Presence WHERE time >= '2021-06-09T16:38:40.994947Z' ORDER BY time ASC"
Jun  9 18:48:41 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:40 +0200] "GET /query?db=openhab_db&epoch=ms&q=SELECT+value+FROM+autogen.Presence+WHERE+time+%3E%3D+%272021-06-09T16%3A38%3A40.994947Z%27+ORDER+BY+time+ASC%3B HTTP/1.1" 200 123 "-"$
Jun  9 18:48:41 homer influxd[709]: ts=2021-06-09T16:48:41.006620Z lvl=info msg="Executing query" log_id=0UabpdN0000 service=query query="SELECT value FROM openhab_db.autogen.Presence WHERE time <= '2021-06-09T16:38:40.994947Z' ORDER BY time DESC LIMIT 1"
Jun  9 18:48:41 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:41 +0200] "GET /query?db=openhab_db&epoch=ms&q=SELECT+value+FROM+autogen.Presence+WHERE+time+%3C%3D+%272021-06-09T16%3A38%3A40.994947Z%27+ORDER+BY+time+DESC+LIMIT+1%3B HTTP/1.1" 20$
Jun  9 18:48:48 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:48 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 8fefb536-c942-11eb-a120-000000000000 6183
Jun  9 18:48:54 homer python3[5214]: [2021-06-09 18:48:54] Failed to retrieve data from Mi Flora sensor "FlowerCare" (80:EA:CA:89:66:A7), success rate: 79%
Jun  9 18:48:55 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:48:55 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 9466d00c-c942-11eb-a121-000000000000 22726
Jun  9 18:49:00 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:49:00 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 971a92b6-c942-11eb-a122-000000000000 4381
Jun  9 18:49:08 homer influxd[709]: [httpd] 127.0.0.1 - openhab [09/Jun/2021:18:49:08 +0200] "POST /write?db=openhab_db&rp=autogen&precision=n&consistency=one HTTP/1.1" 204 0 "-" "okhttp/3.14.4" 9bc58a8f-c942-11eb-a123-000000000000 6402
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
Jun  9 18:48:00 homer systemd-fsck[140]: e2fsck 1.44.5 (15-Dec-2018)
Jun  9 18:48:00 homer systemd-fsck[140]: rootfs: clean, 82745/1911168 files, 2599168/7748608 blocks
Jun  9 18:48:00 homer systemd[1]: Started File System Check on Root Device.
Jun  9 18:48:00 homer systemd[1]: Starting Remount Root and Kernel File Systems...
Jun  9 18:48:00 homer systemd[1]: Started Set the console keyboard layout.
Jun  9 18:48:00 homer systemd[1]: Started udev Coldplug all Devices.
Jun  9 18:48:00 homer systemd[1]: Started Remount Root and Kernel File Systems.
Jun  9 18:48:00 homer systemd[1]: Starting Flush Journal to Persistent Storage...
Jun  9 18:48:00 homer systemd[1]: Starting Load/Save Random Seed...
Jun  9 18:48:00 homer systemd[1]: Starting Create System Users...

I have now two questions:

Could it be that the influxdb is corrupt? Means: would it help to delete it and create it anew?
How to recognize this issue and trigger a reboot?

rlkoshak · June 9, 2021, 7:37pm

I doubt that InfluxDB could cause the whole machine to stop accepting connections. It’s going to show up a lot in the logs because it does a ton of writing (and reading) and apparently logs each one to the syslog.

The fact that you’ve been pulling the plug could mean that it’s corrupted, but in that case, again, it shouldn’t cause the machine to stop receiving incoming connections.

I think that may just be what it happened to be doing when the machine was rebooted. I’m assuming the ^@ line was caused by yanking the plug, which seems to be what happened since it went through a file system check.

But you need to be looking at the logs around the time that it fell off the network, not when you pulled the plug. That is where, hopefully, there is some error or something illuminating.

You might need to bring in some heavy monitoring of the machine if you can’t find anything in the logs. The SNMP plugin might be sufficient or even the systeminfo binding.

I’ve been using Zabbix which is super heavy weight for something like this, but I have learned a ton about my servers and network since setting it up. In fact I discovered that I had a raspberry pi rebooting itself about once an hour that I didn’t even know was having a problem. It was just silently rebooting periodically and it boots so fast I never had to ask it to do something while it was rebooting so never noticed. Once I installed Zabbix I started getting emails about it; oh not we can’t have that! A new power supply fixed that problem. Looking for the solid red light is fresh in my memory.

Another machine was sending way too much out to swap causing the whole machine to slow down. Again, I got emails from Zabbix and had to fix that. So I adjusted the swap policy to favor keeping stuff in memory rather than swapping stuff out and I’ve had no more problems (or emails).

Now I’m seeing some periodic long pings and such which is showing me I’ve got some networking problems.

tl;dr, it’s super heavy weight and a lot of work but I found it well worth while to install and configure Zabbix. But I’m also running a whole lot more than just home automation.

NCO · June 9, 2021, 8:20pm

This scrutiny of my system is far above my head, I guess.

About pulling the cord:
I just plugged it in after I pulled it and this entry let’s me assume that the system completely crashed while the ^@ were written:
(otherwise after time server sync this would have been much closer, and not 2 hrs later):

Jun 9 18:48:06 homer dhcpcd[641]: eth0: pid -577552738 deleted default route via 192.168.68.1
Jun 9 20:51:06 homer systemd-timesyncd[639]: Synchronized to time server for the first time 136.243.7.20:123 (0.debian.pool.ntp.org).

I know pretty well, when I had problems to connect. This was around 3:03 pm (15:03 in the logs).
So it obviously happend while I was accessing habpanel remotely using openVPN. But there is nothing obvious except this warning:

WARNING: 'link-mtu' is used inconsistently, local='link-mtu 1549', remote='link-mtu 1521'

So, thinking about the potential culprit low-voltage it’s probably NOT the issue because the red LED is still solid in the lockup-situation.

rlkoshak · June 9, 2021, 8:29pm

But you don’t have an OpenVPN client running on this machine, right? It couldn’t possible be OpenVPN connecting to the VPN and therefore locking out access from your LAN?

NCO · June 9, 2021, 8:35pm

Interesting point, but I am using openVPN since quite a while.
I start the server, whenever I leave home:

sudo systemctl start openvpn@server

I have setup the server, never a (local) client.
The only client is my iphone.

NCO · June 10, 2021, 7:03am

This network crash issue (incoming connections are not working - sometimes mainUI is still ok - see above) is really annoying and I have followed some other threads about similar issues.
There are a few rough ideas which might be related to this issue:

Corrupt file system: Originally it happend on my SD-card, but it’s happening now also using the SSD
Low voltage: Slow zwave (USB) and SSD (USB) cause the SSD to disconnect (?)
(network disconnect was also mentioned for RPi1-3, but the RPi 4 Ethernet is not connected via USB bus). I am using the official RPi4 Power supply. Furthermore the red LED is solid ON.
side note: The SSD was connected as backup, even when running from sd-card.
One thread mentioned the zwave stick beeing not connected properly to a RPi4 (via USB2), but this does not seem to be the case on my setup (it connects properly to /dev/ttyACM0).

The next step is to start from scratch (what I already did a few times) and just hope.
I bought a brand new SSD and will go for the ZRAM option with UPS.
Usually I am an optimistic guy, but this keeps me awake at night

Any more ideas are still greatly appreciated!
Thanks for all your help guys!

An important reference is this by the way (thank you for this @mstormi)

NCO · June 10, 2021, 8:33am

By the way: What about IPv6?
It’s obviously active on my RPi, but I don’t use it.
Could this be the culprit?

I checked and it is active, so I deactivated it using this guide:

Unfortunately IPv6 came back after reboot.
Running this removed them again.

sudo sysctl -p

Also adding this to /boot/cmdline.txt did not help

 ipv6.disable=1

I created a workaround in my rules to run sudo sysctl -p after startup.

mstormi · June 10, 2021, 8:35am

There’s 2 possibilities: a HW or SW reason.
If SW, you should be finding corresponding error messages in openhab logging if only you turn it up enough that should give you a direction for further investigations.
If HW, you might be having a physical network problem on your RPi, the remote switch, or the cable inbetween. Check system logs and network stats for errors (g**gle…) , if you have a manageable switch also check the server port on that switch.
I’d exchange the RPi with your spare box and use a new, short cat 6 cable just to exclude that as a possibility.

NCO · June 10, 2021, 2:13pm

Talking about an UPS:
Due to a (kind of) bad experience with USB, I wonder if NUT works better using ethernet than USB?
I have found an Eaton, which I like from its size, but this is controllable by USB only:
https://www.amazon.de/dp/B0054WIYXY/?coliid=IS0HSO3HNWXE1&colid=SHVS8CK2G46I&ref_=lv_ov_lig_dp_it&th=1

mstormi · June 10, 2021, 2:16pm

This is unrelated. Read the HW section please.

NCO · June 16, 2021, 3:51pm

It happend again and yes, the red Power LED is still ON (solid).

However, USB does not seem to have power:

Cannot connect via ssh
USB SSD indicator does not flash anymore
Connecting a USB mouse does not power up the mouse.

So maybe it IS power related and after the system switched OFF USB (to save power), the power is sufficient again (solid red LED)?
Does this make sense?

mstormi · June 16, 2021, 5:59pm

It’s guessing that leads nowhere. Grep /var/log/syslog for ‘voltage’ (IIRC, r google for the right string).
Ever tried another RPi ? Yours may be broken, too.

NCO · June 16, 2021, 6:24pm

That’s guessing, too, right?
SICNR

Thank you, I will order another Raspberry for a crosscheck.

Voltage…
That’s the thing.
I did not find anything about voltage in syslog.

mstormi · June 16, 2021, 6:44pm

I’d call it systematic. Everyone should be having a 2nd box at hand as a spare anyway so it would be as easy as to boot from that to exclude this possibility.

NCO · June 17, 2021, 5:58am

Fair enough

That totally makes sense.
Again, thanks for your help…