Degraded Pi Some Services Fail to Load

BigGeorgeTx · August 7, 2022, 7:39pm

My OpenHAB installation is running more or less. But it appears some other services are not. I had noticed that logs were not being rotated. I have tried rebooting the Pi, but that didn’t make any difference.

I would appreciate some direction on how to troubleshoot this. All that I know about Linux comes from playing around with OpenHAB, so I have limited knowledge.

Thanks

Platform information:
- Hardware:Pi 4B 4 GB
- OS: _openhabian

Release = Raspbian GNU/Linux 10 (buster)

Kernel = Linux 5.10.103-v7l+

Platform = Raspberry Pi 4 Model B Rev 1.2

Java Runtime Environment: which java platform is used and what version
openHAB version: 3.4.0 SNAPSHOT 3033
Issue of the topic: I noticed that systemctl status showed that openhab was degraded. Systemctl -a —failed shows the following:


openhabian@openhab:~ $ systemctl -a --failed
  UNIT                      LOAD   ACTIVE SUB    DESCRIPTION                                             
● certbot.service           loaded failed failed Certbot                                                 
● hciuart.service           loaded failed failed Configure Bluetooth Modems connected by UART            
● logrotate.service         loaded failed failed Rotate log files                                        
● nginx.service             loaded failed failed A high performance web server and a reverse proxy server
● systemd-timesyncd.service loaded failed failed Network Time Synchronization

Looking on the OpenHAB community pages and GitHUB, the closest I found was degraded systemd units · Issue #1406 · openhab/openhabian · GitHub

Searching the Internet, I wonder if I’m seeing SD card wear, but I’m not sure. The next step I’m thinking of trying is to back up OpenHAB and start with a fresh openHABian load on a new SD card, to which I would restore the back up. (I have mirroring running, but that is to a bigger card.

yfaway · August 7, 2022, 8:24pm

It could also be running out of space.

BigGeorgeTx · August 7, 2022, 9:27pm

I think this means I’m ok on disk space, but, as I said, I am not a Linux expert.


openhabian@openhab:~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G   16G   12G  59% /
devtmpfs        1.8G     0  1.8G   0% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G  1.5M  1.9G   1% /run
tmpfs           5.0M  8.0K  5.0M   1% /run/lock
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/mmcblk0p1  253M   49M  204M  20% /boot
/dev/sda3        29G   45M   28G   1% /storage
/dev/zram1      870M  318M  492M  40% /opt/zram/zram1overlay1        870M  318M  492M  40% /var/lib/openhab/persistence
/dev/zram2      575M  119M  414M  23% /opt/zram/zram2overlay2        575M  119M  414M  23% /var/log
tmpfs           389M     0  389M   0% /run/user/1000
openhabian@openhab:~ $

Wolfgang_S · August 8, 2022, 7:26am

Have a look at what the command

sudo journalctl

returns.

That could be long output. It should give more information about why the different services failed to load. Could be permission problems of directories; could be missing directories or other root causes but more info about the reason should be there.

BigGeorgeTx · August 8, 2022, 7:19pm

I tried restarting one of the services that was failing, logrotate. Here is what I found:

openhabian@openhab:~ $ systemctl restart logrotate
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to restart 'logrotate.service'.
Authenticating as: ,,, (openhabian)
Password: 
==== AUTHENTICATION COMPLETE ===
Job for logrotate.service failed because the control process exited with error code.
See "systemctl status logrotate.service" and "journalctl -xe" for details.
openhabian@openhab:~ $ journalctl | grep logrotate
Aug 08 14:09:26 openhab polkitd(authority=local)[7568]: Operator of unix-process:4998:15715719 FAILED to authenticate to gain authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1.282 [systemctl restart logrotate] (owned by unix-user:openhabian)
Aug 08 14:13:46 openhab polkitd(authority=local)[7568]: Operator of unix-process:5269:15741918 successfully authenticated as unix-user:openhabian to gain ONE-SHOT authorization for action org.freedesktop.systemd1.manage-units for system-bus-name::1.285 [systemctl restart logrotate] (owned by unix-user:openhabian)
Aug 08 14:13:46 openhab logrotate[5285]: error: skipping "/var/log/exim4/mainlog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
Aug 08 14:13:46 openhab logrotate[5285]: error: skipping "/var/log/exim4/rejectlog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
Aug 08 14:13:46 openhab logrotate[5285]: error: skipping "/var/log/exim4/paniclog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
Aug 08 14:13:46 openhab systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 14:13:46 openhab systemd[1]: logrotate.service: Failed with result 'exit-code'.

Seems like you are right about permissions being an issue. But I’m not sure what to do about it

Wolfgang_S · August 8, 2022, 8:46pm

You need to run

sudo systemctl restart logrotate

as the command needs to be executed with root privileges.

If you still get the error then check permissions of e.g. exim4 directory with:
ls -ld /var/log/exim4
Which user/group does the directory belong to ?

In case it is owned by openhabian which not should be the case then I would try to

stop zram
re-install exim4
start zram again

BigGeorgeTx · August 8, 2022, 11:24pm

Thanks for the help. I appreciate the specific instructions.

Adding sudo didn’t help:

openhabian@openhab:~ $ systemctl status logrotate
● logrotate.service - Rotate log files
   Loaded: loaded (/lib/systemd/system/logrotate.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2022-08-08 18:11:58 CDT; 15s ago
     Docs: man:logrotate(8)
           man:logrotate.conf(5)
  Process: 21672 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
 Main PID: 21672 (code=exited, status=1/FAILURE)

Aug 08 18:11:58 openhab systemd[1]: Starting Rotate log files...
Aug 08 18:11:58 openhab logrotate[21672]: error: skipping "/var/log/exim4/mainlog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to teAug 08 18:11:58 openhab logrotate[21672]: error: skipping "/var/log/exim4/rejectlog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to 
Aug 08 18:11:58 openhab logrotate[21672]: error: skipping "/var/log/exim4/paniclog" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tAug 08 18:11:58 openhab systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 18:11:58 openhab systemd[1]: logrotate.service: Failed with result 'exit-code'.
Aug 08 18:11:58 openhab systemd[1]: Failed to start Rotate log files.

This is the result of the ls command:

openhabian@openhab:~ $ ls -ld /var/log/exim4
drwxrws--- 1 Debian-exim adm 4096 Jan  4  2022 /var/log/exim4

I’m not sure how to tell which user/group the directory belongs to, but since I don’t see openhabian in the output, I think I don’t have that issue.

yfaway · August 9, 2022, 12:06am

You can run this command to remove the group write permission:

chmod -R g-w exim4

BigGeorgeTx · August 9, 2022, 12:43am

Thanks.

I ended up running
sudo chmod -R g-w /var/log/exim4
which ran with no errors.

I then restarted the logrotate service. It is running. Now there are only 4 failed services. I will look and see if they have similar problems.

I really appreciate your help

BigGeorgeTx · August 9, 2022, 12:49am

When I tried restarting systemd-timesyncd.service, the error was that the file had changed. I did the systemclt daemon-reload command, and then it successfully loaded.

openhabian@openhab:~ $ systemctl restart systemd-timesyncd.service
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to restart 'systemd-timesyncd.service'.
Authenticating as: ,,, (openhabian)
Password: 
==== AUTHENTICATION COMPLETE ===
Warning: The unit file, source configuration file or drop-ins of systemd-timesyncd.service changed on disk. Run 'systemctl daemon-reload' to reload units.
openhabian@openhab:~ $ systemctl daemon-reload
==== AUTHENTICATING FOR org.freedesktop.systemd1.reload-daemon ===
Authentication is required to reload the systemd state.
Authenticating as: ,,, (openhabian)
Password: 
==== AUTHENTICATION COMPLETE ===
openhabian@openhab:~ $ systemctl restart systemd-timesyncd.service
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to restart 'systemd-timesyncd.service'.
Authenticating as: ,,, (openhabian)
Password: 
==== AUTHENTICATION COMPLETE ===
openhabian@openhab:~ $ systemctl status systemd-timesyncd.service
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendor preset: enabled)
  Drop-In: /lib/systemd/system/systemd-timesyncd.service.d
           └─disable-with-time-daemon.conf
   Active: active (running) since Mon 2022-08-08 19:46:24 CDT; 23s ago
     Docs: man:systemd-timesyncd.service(8)
 Main PID: 28211 (systemd-timesyn)
   Status: "Synchronized to time server for the first time 192.241.146.233:123 (0.debian.pool.ntp.org)."
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/systemd-timesyncd.service
           └─28211 /lib/systemd/systemd-timesyncd

Aug 08 19:46:24 openhab systemd[1]: Starting Network Time Synchronization...
Aug 08 19:46:24 openhab systemd[1]: Started Network Time Synchronization.
Aug 08 19:46:24 openhab systemd-timesyncd[28211]: Synchronized to time server for the first time 192.241.146.233:123 (0.debian.pool.ntp.org).
openhabian@openhab:~ $

2 down, 3 to go

BigGeorgeTx · August 9, 2022, 12:53am

Moving on to nginx, I found:

openhabian@openhab:~ $ sudo systemctl restart nginx.service
Job for nginx.service failed because the control process exited with error code.
See "systemctl status nginx.service" and "journalctl -xe" for details.
openhabian@openhab:~ $ systemctl status nginx
● nginx.service - A high performance web server and a reverse proxy server
   Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2022-08-08 19:51:59 CDT; 14s ago
     Docs: man:nginx(8)
  Process: 28569 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=1/FAILURE)

Aug 08 19:51:59 openhab systemd[1]: Starting A high performance web server and a reverse proxy server...
Aug 08 19:51:59 openhab nginx[28569]: nginx: [alert] could not open error log file: open() "/var/log/nginx/error.log" failed (2: No such file or directory)
Aug 08 19:51:59 openhab nginx[28569]: 2022/08/08 19:51:59 [emerg] 28569#28569: open() "/var/log/nginx/access.log" failed (2: No such file or directory)
Aug 08 19:51:59 openhab nginx[28569]: nginx: configuration file /etc/nginx/nginx.conf test failed
Aug 08 19:51:59 openhab systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Aug 08 19:51:59 openhab systemd[1]: nginx.service: Failed with result 'exit-code'.
Aug 08 19:51:59 openhab systemd[1]: Failed to start A high performance web server and a reverse proxy server.
openhabian@openhab:~ $

A while ago I tried using the reverse proxy for remote access. I ended up going another route, so I’m not aware of anything I need it for at this time, but I’m not sure

BigGeorgeTx · August 9, 2022, 12:55am

The hciuart service restarted without error.

I entered sudo systemctl restart certbot
It took a long time (several minutes), but eventually:

openhabian@openhab:~ $ systemctl status certbot
● certbot.service - Certbot
   Loaded: loaded (/lib/systemd/system/certbot.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2022-08-08 20:00:05 CDT; 34s ago
     Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
           https://letsencrypt.readthedocs.io/en/latest/
  Process: 28915 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
 Main PID: 28915 (code=exited, status=1/FAILURE)

Aug 08 19:56:02 openhab systemd[1]: Starting Certbot...
Aug 08 20:00:05 openhab certbot[28915]: Attempting to renew cert (openhabspring.ddns.net) from /etc/letsencrypt/renewal/openhabspring.ddns.net.conf produced an unexpected error: Failed authorization procedure. openhabspring.ddns.net (hAug 08 20:00:05 openhab certbot[28915]: All renewal attempts failed. The following certs could not be renewed:
Aug 08 20:00:05 openhab certbot[28915]:   /etc/letsencrypt/live/openhabspring.ddns.net/fullchain.pem (failure)
Aug 08 20:00:05 openhab certbot[28915]: 1 renew failure(s), 0 parse failure(s)
Aug 08 20:00:05 openhab systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 20:00:05 openhab systemd[1]: certbot.service: Failed with result 'exit-code'.
Aug 08 20:00:05 openhab systemd[1]: Failed to start Certbot.
lines 1-16/16 (END)

From what I found on the internet, this could be related to the nginx issue above

Wolfgang_S · August 9, 2022, 8:27am

Right. Vor certbot you need nginx.
Does the directory /var/log/nginx/ exist ? What are the permissions ?

ls -ld /var/log/nginx/

BigGeorgeTx · August 9, 2022, 1:33pm

Apparently it does not exists:

openhabian@openhab:~ $ ls -ld /var/log/nginx/
ls: cannot access '/var/log/nginx/': No such file or directory

Wolfgang_S · August 9, 2022, 5:41pm

Re-installing nginx should fix it.
In case the directory is missing after a reboot again then I would temporarily stop ZRAM and then re-install nginx after that ZRAM needs to be enabled again.

BigGeorgeTx · August 23, 2022, 2:11am

Now that I have Tailscale working well, I don’t use the Nginx to reverse proxy any more, so I plan to remove it following this procedure at Openhabian: How to remove Nginx?

I would think I should also remove the Certbot service because I think it was installed with nginx, but I’m out sure.

Wolfgang_S · August 23, 2022, 5:22am

Yes, certbot is part of the nginx_setup() routine in openhabian-config.
As the standard setup ( there are other methods as well ) of certbot uses a reverse proxy to setup/renew certificates it will fail once the reverse proxy is removed.