Aeotec Z-Wave Gen5+ Stick stays offline after container restart / Raspi reboot

Just to note that the binding uses what ever serial library openHAB core provides (through org.openhab.core.io.transport.serial) - it’s not something the binding can change unless we move away from using the OH provided services and directly link a serial library (which was frowned upon in the past).

Personally I’ve stopped using nrjavaserial for other projects as it’s just too much hassle and causes too many problems.

Thanks @rpwong, very true points!

So then there appear to be two good hardware alternatives available (GoControl and Zooz) which have both proven to be working reliably with @Andrew_Rowe and @rpwong respectively, in case @Ardanedh‘s workaround for users of the GEN5+ doesn’t work (which it should) or @Andrew_Rowe not being successful in raising enough attention for the NRJavaserial fix.

I have to say I love this project and this forum! Thanks a lot to all of you! :slight_smile:

3 Likes

Well, from my side, I can say that before some early version 3.x (I have forgotten exactly when this nrjavaserial stuff changed), I never had issues with my serial ports for zwave. I guess it is some kind of race condition.
I would be surprised if Aeotec Gen5 is part of the problem here, like @rpwong says, it is probably the most common stick, but what do I know - it could of course be a combination of the computer and the stick.

Removing the lock file is needed, if it is left, but that is (at least in my case) not the full solution, at least not if I just restart oH. Maybe if I restart the full machine. But since I run oH on a server doing lots of other stuff, I prefer to just try to restart oH a gazillion times until the serial port is working again. But it can literally take hours of trying in worst case. Maybe this alone is a reason for start using docker with oH.

This issue is the only real issue that I have with oH, but since I have not enough knowledge or skill to fix it, I accept it and will not complain - apart from it oH and the zwave binding is the best.
If someone would have a go at a direct binding solution, or a change of the provided serial library into something else, I would be happy to put 100% effort into testing the alternative.

I could be wrong, but I don’t think this should be a problem in the zwave binding. The binding doesn’t (directly) use nrjavaserial - as above, it’s manage through the OH core proxy.

I’ve seen some issues where (IIRC) nrjavaserial was opening ports to check if they exist, and depending on the timing, this can then cause the application (ie binding) to fail as the port was not available. I’m not sure if that problem still exists or not though.

I would also agree that it’s unlikely that the Aeotec is the problem - again - always keeping an open mind, but it has worked well for a long time.

Oh, no I am not suggesting that! I think (guess) also that the issue is in the oH supplied service. I always thought this was changed somehow in the early version 3 - either changed into the nrjava serial stuff from something else, or the nrjava serial where upgraded etc.

I also only have the zwave stick attached to this box, no other serial sticks etc to compare with.

That is not my understanding.
Full disclosure: I am not a java programmer. I can only give my interpretation of the forum posts I’ve read.

The issue with nrjavaserial is that it leaks lock files. How this manifests itself is the problem that Ardanedh and Cplant are having is that when they restart openHAB, their usb zwave dongles can not use the usb port because the port is blocked by a lock file. When nrjavaserial assigns a device to a particular usb port it creates a lock file. When it shuts down, or a device is unplugged and no longer using the port, the lock file is supposed to be destroyed or deleted, freeing up that port for further use. Because of the bug in the software, these lock files are not deleted and when a new device attempts to use the port, it can not because of the lock file. Over time, there are often many lock files created.
The script Ardanedh has placed in his init file deletes all the lock files before the container is started. Stopping the host that the container is running on will often delete the files as well. Again, this is just my understanding from reading the forum posts.
The one thread I linked above has posts from one of the developers of nrjavaserial in which he states how the lock files are created, how they are supposed to be destroyed and the portion of the code which does not seem to be working as it is supposed to. He goes on to explain what steps he has taken to fix the issue unsuccessfully and what steps he thinks may have to be taken to fix it.

Edit:
I dug around on git and found some commits to core concerning nrjavaserial. One by wouter recently here on Apr 8 which was merged. Should be in 3M
This is a fix for this Modbus issue which includes a very long discussion

OK, maybe I should rephrase; the Issue I am seeing (should really only speak about my own issues), is probably some kind of race condition. Yes, also in my setup lock files are created and sometimes not deleted as they should. But that is only part of the issue. I delete many lock files during my not so lucky days trying times starting oH. In my case, it is not as easy as making sure there’s no lock files before starting oH.
For me, stopping, starting or unplugging/replugging/changing the port in UI make it start after a while. But also, like I said, sometimes it just starts at the first go. I have not been able to see a pattern when it starts and when it does not.

The things @chris mentioned (if still around in nrjava) (the serial port testing by opening), could be part of the issue.

OK Micael, it actually sounds like the same issue is the possible problem for you as well. I have been digging thru git and found one very recent commit which was by Wouter.

Most importantly this fixes a file descriptor leak when checking lock dir permissions.

Please see my edit in my above post for links. Apparently core is running a patched version of nrjavaserial. As of Apr 8, there should be a fix. What versions is everyone running?

So to summarize, this is not a Zwave binding issue. Nor is it a Aeotec Gen5 stick issue. Modbus binding users are having problems as well. Please see this post by ssalonen in Mar 2021 which include links to other discussions concerning nrjavaserial

Please be aware of known issues with serial devices with openhab3 (regression in serial library used by openhab) Serial ports getting blocked after some re-connecting · Issue #1842 · openhab/openhab-core · GitHub . See also discussion in Modbus Binding not working on OH3 . I am not aware that this would be any better in 3.0.1 unfortunately

Edit:
OK I finally found the thread about running an alternate serial library. Wouter has written a patch and some users have used it to cure the problems they were having
https://community.openhab.org/t/oh3-x-alternative-java-serial-provider/128462

1 Like

Thanks for digging into this Andrew!
Any kind of poking in nrjavaserial brings up my hope for the new 3.3 version. :slight_smile: I am still on 3.1, but have decided to upgrade to 3.3 as soon as time permit.

1 Like

Just for the sake of completeness (or more like: as a documentation for myself when I bump into the problem next time, and until I’ve come across doing @Ardanedh’s init-file fix:

  1. Log into the openHAB docker container:
    docker exec -t -i openhab /bin/bash

  2. Access the respective folder with the lock file that shouldn’t be there
    cd /var/run/lock

  3. Delete the respective file that shouldn’t be there:
    rm -f LCK..ttyACM0

  4. Restart the openHAB container (in my case via Portainer)

  5. Done.

Worked 100% of the time for me, and is at least a bit more elegant than re-creating the entire container.

2 Likes

I also recall there were permissions issues surrounding this file on past posts about Docker. I’m on openhabian on a Pi4 & the file is setup so openhab can read write. Don’t know if that is relevant, but I have seen notes in the log that the stale file was removed by openhab on a restart.
varlock

Bob

Quick question on this, since, from what I can see, the nrjavaserial-bug appears to be still open: Any chance of setting incentives (e.g. on Bountysource) to set some incentives to get this fixed? Or is this naive?

1 Like

I am still experiencing this exact same issue on 3.4 Build #3029 … I logged the bug report here:

I would love if this issue could get some attention :slight_smile:

Maybe let’s start something on bountysource on this. :wink: Being able to reboot my openHAB Raspi without the need to manually fiddle around with my Aeotec Z-Wave stick would be something of actual value for me. :wink:

FYI … not sure if you have the same experience, but with mine it seems that if I dont turn off the Pi and just do a Stop / Start it comes up reliable every time (not enough testing to say this for sure) … also, if I turn it off, unless I remove the ethernet cable and let the unit sit for a while before rebooting it, it seems to always have this issue. (If that is true, I suspect some HW bit getting stuck in a state) but again very little data to make any conclusive statement. The only thing conclusive is that it’s a total pain in the ass to go through several boots to get it to come up (I’m on the dual-usb zwave/zigbee) and until zigbee is all Online, zwave is stuck.

Very repeatable in my case.

You can also try to disable the modem manager service with:

sudo systemctl disable ModemManager.service

It seems to help according to this HA issue. :slight_smile:

If you use Docker and can fix it every time by manually removing those lock files… we could also automatically remove the lock files when the container starts. Normally they would also have been removed by the OS on startup.

Interesting. I also don‘t have that much data (maybe n=15 or so over the past 6 months), but in my case it is exactly the opposite, since I never turn off the Raspi. So most likely no connection to that. :smile:

This one here helps me every time btw. Quite annoying, but at least it brings the stick back online.

It will now always cleanup those lock files when the Docker container starts in the 3.4.0 snapshot builds with PR #406.

Thanks a lot in advance! :slight_smile:

@vespaman, if this turns out to be working, it’ll mean “serious happiness and joy for the rest of the day” every day. :smiley:

1 Like

Upgraded to 3.4 and (so far) have not had the error anymore described above. So thanks a lot for everyone involved in the fix. :pray: :slight_smile:

1 Like