OH3, SerialUPB, HomeKit & a reboot (now, NoResponse)

kristiandg · September 28, 2021, 5:26pm

Good afternoon. Up until now, our UPB usage on OH has been through an OmniPro controller (by basically “HomeKit-enabling the Omni using OH”). However, in our office, we don’t have an Omni and I wanted to add our lighting to HomeKit, figuring OH would be a great way to do that (on a PI 3b+ with PoE). This is using the OH3 UPB Serial binding, also.

After getting the initial config built and adding all the lights and appropriate HomeKit Metadata, I linked OH to HomeKit and moved all the lights to their appropriate rooms.

The first thing I noticed was, I can control the lights, but OH doesn’t track their status when changed from the switches themselves. The switches are indeed set to report their level when controlled locally, and you can see the updates within the UPB Config software (UPStart) whenver a switch is manually controlled, it’s just that status change isn’t being reflected in OH.

Second, after letting OpenHABian run for a week to ensure stability, I then power-cycled (to simulate it losing power unexpectedly). Expecting it to come back and be accessible in HomeKit again, when I went to access the Home app, I discovered everything exposed to HomeKit in OH was now reporting as “Updating” eventually becoming “No Response” after a very long timeout period. What confused me about this is this was literally just a reboot, and if I go into OH, I see the status of all the UPB items as ONLINE, so whatever this is, it appears to just be with getting that info over to HomeKit. On the off-chance the reboot may have wiped the HomeKit binding for some reason, I even tried restoring the backup I took right after I finished configuring OH. It restored, and again all items came back as ONLINE, but HomeKit still shows them as Updating/NoResponse.

Also, a graceful reboot didn’t seem to help (as a last resort - and yes, I hate the “Microsoft Fix”, but after manually restarting the services with no positive effect, I thought it couldn’t hurt to try.

I also confirmed the IP is still the same as it was previously (and that the OH IP is matched in the HomeKit config).

After all of this, I then ran updates for the latest OH3 to see if that corrected - it did not.

If I remove the HomeKit addon, the NoResponse is almost instant. Additionally, if I do something blatantly obvious to fail out the HomeKit config (like add /24 to the binding IP), the NoResponse is instant. This suggests it’s talking with HomeKit but the conversation isn’t going well.

Does anyone have any thoughts why this seems to have lost it’s connectivity to HomeKit with just a power-cycle? Or, where I might look in order to troubleshoot this further, since this is making no sense to me currently?

Thanks…

rlkoshak · September 28, 2021, 8:02pm

So did you run sudo reboot or did you literally just yank the power. If the latter you may have corrupted your file system. It is not safe to suddenly power off any computer that is running off of flash storage. The way flash memory works you risk losing not only the file that is being written but any other file that is in the same sector. And since it uses wear leveling that means you could lose any file, part of the kernel, a hardware driver, etc.

If the machine needs to be resilient to power outages and uses flash memory, an UPS must be employed.

This implies that your problem was not caused by a loss of power then. That’s good as it means you don’t have to rebuild. And it eliminates one potential cause of the problem.

Beyond that I don’t use HomeKit and don’t even know what a UPB is so I can’t help with the rest.

kristiandg · September 28, 2021, 8:29pm

This was an intentional hard power cycle (as, even on UPS, at some point it would exhaust). Most systems have some tolerance for that type of thing (load off drive into a ram drive and run from that, or something). And, really, it was somewhat of a test for that very problem before going into production.

I suppose it’s possible some part of it ate itself. My only thing is, considering everything else seems to function (UPB is a type of lighting control, and from within OH, I still have control over those devices), it seems a bit odd that would be the case and only impact this one thing. I even removed and readded the HomeKit Addon, so in theory it got re-installed. I guess it’s possible the config file for it got trashed and I may need to remove and let it rebuild…

yfre · September 28, 2021, 9:55pm

probably homekit.json got corrupted.
homekit binding stores in the homekit.json the encryption keys of the paired devices. without these keys the binding will reject connections from ios device.
check whether you have “user” entries in homekit.json
if not,

try to find a backup of homekit.json with user entries
remove openhab bridge / home from home app and re-add/re-pair again

rlkoshak · September 28, 2021, 11:08pm

A properly configured UPS is more than just a battery. It reports to the machine when it’s running off of battery and when the battery level drops below a certain level the machine shuts down normally with no risk of loss of data.

There are all sorts of HATS and USB mini-ups systems that work with RPi and other SBCs. NUT is usually the software that monitors and shuts down the machine (and there is a NUT OH addon too).

No system that suddenly loses power in the middle of a write to flash memory has any tolerance for this. SSDs often have some capacitors to add a little bit of tolerance but even that isn’t full proof. SD cards and USB thumb drives have absolutely no tolerance what so ever. If it’s in there middle of a write when power is lost, there will be some file system corruption. It may be benign (e.g parts of a log file), or the machine may not even be able to boot (e.g. part of the kernel).

As I said though, you restored from backup from before the power loss so a corrupt file system is unlikely to be the problem this time around. The power loss causes loss of data, it doesn’t physically destroy the medium. As long as it’s a full backup, restoring should recover any files that became corrupted too. If it’s not a full backup though, you could have corrupted some important random file somewhere that only affects Homekit integration. Though in OH’s case that would have been included as part of the backup so it’s still unlikely to have caused this problem.

But lots of power could cause problems in the future.

marcusb · September 29, 2021, 12:33am

Please grab a debug log (with the org.openhab.binding.upb logger set to DEBUG) while you are manipulating the switches. Either post it here along with your UPB thing and item config files, or open an issue with this information. That should let me see what is going on.

kristiandg · September 30, 2021, 7:26pm

You gave me an idea, so I did some digging. I forgot to mention, in addition to doing a little power cycle test, I was also moving to it’s fixed IP. So, when it came back up, it was on a different address than when initially bound

Now, past experience had told me I just needed to go into the HomeKit config section of the GUI and tell it the new address (which I did, but this time it didn’t seem to work). So I did look at homekit.json. Indeed, it had the same file date as the restored file - several days before the power cycle (so in theory it wasn’t corrupt, as it had been restored during diagnosing and was the known “good” file).

But this made me go ahead and unbind/rebind to HomeKit, and indeed, everything came back. So, at this point, I’m guessing the IP change was the culprit. That confuses me a bit, as HomeKit (I thought) worked on mdns broadcasts to locate, and in the past I’ve been able to make such a change w/o issue. What I can’t confirm this time is if my IP change to HomeKit (in the GUI) maybe didn’t update the IP properly in another file somewhere, thus having the old IP broadcast with nothing to answer. This is theory, since I don’t have a strong understanding of how HomeKit “talks” to other devices or how it adapts to network changes. But, it’s back up and running for now, so I’m guessing it was the IP change.

@rlkoshak I agree. A more robust UPS would have a network maintenance card in it allowing it to broadcast it’s state to network agents. Being in IT, I worry about that “sweet spot” of the battery reaching the point of telling the devices to gracefully shut down, then the power coming back on before actual power is lost (preventing auto boot from kicking in). I do agree with you - anytime you can gracefully shut down, that’s always preferred. And indeed, I could see a read/write action being interrupted causing the issue, I just thought the odds of that occuring at that time were pretty slim. I don’t fully understand it, but I thought zram was designed to severely reduce read/write activity to the SD. I assumed it did this by booting into a ramdisk and basically running from there, only synching every hour or so to hard storage to keep the (for lack of a better term) “boot image” true to current. Had that assumption been correct, then my thought was, sure you’d lose some state data or any config changes you made since the last sync, but that it could always boot from the last “known” working config upon power loss. That was my bad.

@marcusb I’ll get this to you in a little while. I have to head out, but I’ll grab what you’re wanting and do manual switch activity so you can hopefully see what’s going on there. I don’t know if this is the case or not, but my experience with UPB is from the HAI/Leviton Omni controlling it. One of the things they did was put a little buffer in to queue messages up before they’re sent to the UPB PIM (preventing a singnal from getting lost). On the devices themselves, there’s an option to retry a command up to x times (3, I think), if the sending device didn’t receive confirmation of the sent command. I didn’t know if your binding had these two items, but I’m hoping so, as I did notice sending 3-4 switch commands back to back from HomeKit, occasionally one or two of the UPB loads “misses” the command and remains in it’s current state. With that previous Omni experience, the other issue we would routinely see is that things would get out of sync (what the Omni thought was on, was really off at the switch). We “resolved” this by building Omni rules to check status every x interval (literally, ask each device for it’s status every so often). It’s possible your binding has all these capabilities, and they might not be working for me because of the status oddity I mention above, but if not, it would be great if somehow those items might be able to be added.

Thanks all!

rlkoshak · September 30, 2021, 7:39pm

That is indeed the intent of ZRAM but an RPi 2/3 doesn’t have enough RAM to run everything out of ZRAM. It runs OH and InfluxDB out of ZRAM but there is a whole lot of other stuff running and writing outside of ZRAM. It may not be as likely to cause problems when losing power but it’s still a matter of chance.

And it does not flush to the SD card except at a clean shutdown. I think that’s a limitation in ZRAM that hasn’t been fixed. So if you are up and running for a week, you’ll lose a week’s worth of data.

You don’t need anything quite so involved as that. A small consumer level UPS will have a USB connection that NUT can use to get all the info it needs. And NUT acts as a service so if you have more than one machine that needs to know that the UPS is running on battery they will get that broadcast. The RPi UPS hats sometimes do the same things through GPIO pins instead of USB.

I have an old RPi attached to my UPS running NUT server and I have NUT running on my VMs so my VMs know to shutdown nicely too when the time comes.

kristiandg · October 1, 2021, 1:14pm

Ahhh, gotcha. I had no idea it was dynamic like that (that’s pretty cool). That also makes total sense to me. Keeping that in mind, it looks like (currently) the Pi 4b can be ordered in 2/4/8gb RAM. So, if i were to get a 4 or 8 gb version, would that make it run more of it’s foundational components in ZRAM?

Well, that explains a LOT. In the past I had found issues where the box seemed to “revert” to a previous state. Remember, for me, it was just to HomeKit-enable something that natively wasn’t HomeKit supported, so I’d see it in the form of device names changing. I just assumed it was some HomeKit bug. So, if I schedule nightly graceful reboots, that should then correct for this, I would expect (does that sound right)? I haven’t quite figured out how to run a scheduled CLI command from OH3’s interface though, as I wanted to schedule weekly backups (then a purge of old ones). Admittedly, my Linux (and OH) skills are very limited in comparison to other folks on here.

I may do that. We used to do home tech - like big installs, so we’d install APC UPS’s with multiple expandable batteries, etc (which took that extra network card), so that hardware was kind of already there. But yes, in my own home, it’s just a 1500va APC with the typical usb connection, so what you’re saying would likely work just fine, Being in IT, I’ve always worried about that stupid “sweet spot” I mentioned, where the device(s) decide it’s time to shut down, but then power comes back and the UPS never actually powers off. I wish there was a way that once a device halted, it would restart in an hour if it didn’t power off, but of course there’s no services running to do such work. Just the geek in me, I guess.

Thanks.

kristiandg · October 1, 2021, 1:39pm

I believe this is what you’re looking for.

2021-10-01 09:26:30.234 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 5055303830343634303030333836323045370D
2021-10-01 09:26:30.243 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=0 cmd=DEVICE_STATE
2021-10-01 09:26:30.550 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 5055303830353634303030333836323045360D
2021-10-01 09:26:30.558 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=0 cmd=DEVICE_STATE
2021-10-01 09:26:33.534 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 5055303830343634303030333836303030370D
2021-10-01 09:26:33.542 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=0 cmd=DEVICE_STATE
2021-10-01 09:26:33.851 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 5055303830353634303030333836303030360D
2021-10-01 09:26:33.858 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=0 cmd=DEVICE_STATE
2021-10-01 09:26:36.889 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303436343044303332304646464645310D
2021-10-01 09:26:36.896 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=ACTIVATE
2021-10-01 09:26:37.239 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303536343044303332304646464645300D
2021-10-01 09:26:37.246 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=ACTIVATE
2021-10-01 09:26:40.306 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303436343044303332314646464645300D
2021-10-01 09:26:40.313 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=DEACTIVATE
2021-10-01 09:26:40.655 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303536343044303332314646464644460D
2021-10-01 09:26:40.663 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=DEACTIVATE
2021-10-01 09:26:44.705 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303436343044303332304646464645310D
2021-10-01 09:26:44.713 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=ACTIVATE
2021-10-01 09:26:45.056 [DEBUG] [.upb.internal.handler.SerialIoThread] - UPB Message: 50553839303536343044303332304646464645300D
2021-10-01 09:26:45.064 [DEBUG] [b.binding.upb.internal.UPBController] - received message, network=100 src=3 dst=13 cmd=ACTIVATE

This was me turning on/off the fan in my office, as well as me triggering a scene that turns on/off 3 lights. I haven’t built the scene in OH yet, so that I expect to be fully ignored (since these UPB devices don’t report their status after a scene is triggered). But the fan on/off should do something since that is the load itself. Log seems to parse it correctly (net ID 100 and Device ID 3).

Just in case it was an issue making it’s way to HomeKit, I also checked OH’s status for the device, it remained OFF the whole time (even reloaded the page). So, from what I can tell, UPB isn’t telling OH the updated status, which is what’s leading to HomeKit having no clue either.
org.openhab.core.items.Item.json (4.8 KB)
org.openhab.core.thing.Thing.json (26.6 KB)

Please let me know if you need anything else. Thanks.

rlkoshak · October 1, 2021, 2:36pm

If you change the default configuration to do so. It’s not really dynamic in that it adjusts based on the amount of RAM available. openHABian is configured to support running openHAB and InfluxDB to support the widest variety of RPi hardware.

I think so but I’m not that much of an expert on ZRAM and the openHABian configuration. I mainly run in VMs so have little practical experience.

Though a nightly reboot can be disruptive in other ways so be careful. I don’t know that a reboot of the whole machine is necessarily required. You can probably get away with shutting down the services running out of ZRAM first and then cycling ZRAM itself.

For something like this you are probably better off using OS tools rather than using OH for this. Google “cron” which is used in Linux to schedule commands to be run periodically. For backups openHABian comes with Amanda. It might be less effort to just use that instead of rolling your own.

But to run commands from OH look at the Exec binding and/or executeCommandLine Action in rules. But note that the commands will be running as the user openHAB is running as, which is usually openhab and this user is a limited rights user.

One could probably make something like this work if the RPi supported WOL. Unfortunately it does not. But the over all concept would be a watchdog running on some other machine would monitor the UPS status and the RPi’s status. If a power outage was detected and the RPi was shutdown this watchdog would wait for the UPS to report power is back on and issue a WOL command to the RPi to wake it back up.

But since RPis don’t support WOL, the best you could do is to do this with smart plugs. But then your system that monitors the automation system starts to get as complicated as the automation system itself. At some point it’s just not worth mitigating the problem.

marcusb · October 4, 2021, 1:35am

The binding was supposed to send a message up to 3 times, until it is ACKed by the receiving device. However I just re-read the code and the UPB spec, and it turns out that didn’t work. I made a patch, and you can find a jar file here in case you want to give it a try.

Thanks for sending the debug log, I will take a look at that next.

marcusb · October 6, 2021, 1:15pm

@kristiandg I have a pull request that should fix the device state tracking. If you would like to test it, there is a jar file linked in the PR (once it builds).

kristiandg · October 14, 2021, 1:15pm

Sorry, I just got back in the office. I’ve not loaded a module directly before, but by chance has this made it into the mainstream updates by now, and if so, would just an apt-get update; apt-get upgrade do the trick now, or is there another method for keeping bindings up to date?

Thanks.

marcusb · October 14, 2021, 1:57pm

This second PR has not been merged, it would be great if you can confirm if it works before we do so.

If you drop the linked jar file in /usr/share/openhab/addons, it should take effect immediately. (And maybe uninstall the bundled version of the binding, for good measure).

kristiandg · October 14, 2021, 2:16pm

Yeah, I saw that in the comments at that link, so I tried to install the way I found… Removed the one from the addons screen, then dropped the downloaded JAR file in /usr/share/openhab/addons/.

I even changed the permissions on that file to be owned by “openhab” (instead of openhabian). I expected I’d see it in the addons list, with the new versino “3.2.0-SNAPSHOT” but I did not. Though with the binding not appearing, I can still control the UPB network, so it must be active. But, I still don’t see status updates when controlled by the local UPB switch.

I did reboot the OH box for good measure, but it didn’t seem to change - I can still control but can not see status changes.

marcusb · October 15, 2021, 3:41am

You did it correctly but I think the jar file got overwritten by a later build. I have kicked off a new one now, but it’s hard to tell when it will finish and how long it will stay up, so I’ve put a jar file on my Dropbox with the changes.

It will be most helpful if you capture debug logs when testing, in case something doesn’t work.

kristiandg · October 20, 2021, 8:00pm

Just got to test this. IT WORKS! It’s tracking changes to individual devices. Do you want me to capture a log anyway, or since it works, we’re good?

Any thoughts on some sort of “ask current status” automation routine, to ensure things triggered by links can stay in sync?

Thanks.

marcusb · October 21, 2021, 12:39am

Great, thanks for confirming! No need for logs then.

As for the status refresh on scene change, you can do the following:

Create a virtual device as described in the documentation
Create a group gScene containing the items that are affected by the scene.
Create a rule as below:

rule "Refresh on scene change"
when
  Item UPB_Active_Scene received update
then
  gScene.sendCommand(REFRESH)
end

I’m not sure if we can make the binding support this scenario directly, I guess we could refresh all devices automatically, but this workaround should do the trick.

kristiandg · November 1, 2021, 9:29pm

Thanks. If you could do it automatically, you’d want to set somewhere in the binding the resync interval. In larger installs (200-ish devices), all those status messages would saturate the network with traffic, so its best to have that only run once or twice a day for all loads to stay in sync (and at a time when there’d be low traffic like noon and 3am type of thing).

This rule looks perfect, as it will only scan the involved devices when that particular scene is executed, limiting traffic but keeping current on device status. So I think this will be ideal. I need to get a handle on the group portion, as I’m not familiar with how to create that. But this looks good.

Thanks.