Stability -- openhab knows nothing but crapping itself

jimduchek · July 1, 2019, 11:18pm

Is there a project out there or anybody to talk to about stability??? I literally run a script that checks every 5 minutes for OpenHAB being broken ("events just.,… stop happening… the ZWave binding just claims port {0} doesn’t exist… whatever). I’m tired of this shit. Does anyone care about having their OpenHAB installation run for more than 10 minutes without failing?

dastrix80 · July 1, 2019, 11:30pm

For me stability is everything and far more important than features. My system has been quite stable in recent months on 2.4.0. We can try and help you if you detail more about what’s going on

Bruce_Osborne · July 1, 2019, 11:47pm

In my experience, OpenHAB nightly snapshots are more stable than Home Assistant “released” code. My Z-wave network is running much better with OpenHAB than with the HA openzwave cr*p.
You must have something wrong with your Z-Wave network. Perhaps @chris can give some troubleshooting pointers, unless you would rather just vent.
If you are that tired and not motivated to try & fix things, feel free to move on to a different solution. It seems most all the major players are going through some reorganization pains right now. This project looks the most stable to me, even though I am biased against Java.

mhilbush · July 2, 2019, 12:15am

It’s fair to say that we all care about stability. Personally, I run two installations that have been running reliably for many months. I only recently upgraded these two installations to the latest snapshot release, and they’re still running very reliably.

Both installations run zwave networks (in addition to a variety of other bindings), one of which has about 100 nodes.

I’m not sure what version you’re running, and what bindings you’ve installed. A stable, reliable installation definitely is possible.

rlkoshak · July 2, 2019, 12:20am

openHAB is pretty stable for most users which is frankly amazing given that nearly every binding has its own set of developers.

And for those who do run into trouble there are a lot of people on this forum who will jump to help where we can.

Since you’ve made a grand total of 3 postings, none of which are an actual proper request for help I’m not surprised you are frustrated. But if you want to spend months flailing about in frustraition on your own I won’t judge. But then I wouldn’t expect you to cast blame about either.

The problem is you’ve not asked for help. If there is a stability problem, you’ve not told us about it before now and even now have given us no details we can use to help or fix the software if there are bugs.

To answer your question, yes, we do care about stability. Why don’t you care enough to tell us about the problems you are experiencing except to bitch about it?

MDAR · July 2, 2019, 12:20am

Just to throw my hat in the ring…

I would say that the core code is absolutely robust.

The ONLY time openHAB2 falls over is when strange things happen in some bindings.

I extensively use

Velbus binding
DMX binding

and I would go as far as to say that my openHAB2 instance that’s running on an ODroid XU4 never falls over.

I hope that my gratitude for the efforts that have been put in to bring this combination together is clear.

In conclusion…

It’s all in the bindings.

Please find out what is causing the crash, rather than blame everything.

rak · July 2, 2019, 4:43am

I second that. Some years ago I was designing my system (mainly knx those days) in a way no critical function was built in OH.

Since more than 12 months I am running OH in a docker on hypriot with NO reboot for now. Extremely stable. The total list of bindings I use is

binding = amazonechocontrol,astro,exec,ntp,sonos,systeminfo,netatmo,denon1,knx1,weather1,fritzboxtr0641,hue,robonect

Its all about bindings.

From an end user view it’s hard to distinguish between OH and it’s bindings. I understand that view.

Give yourself a favour and ask some questions, provide some logs and it will work out …

Regards
Ralf

chris · July 2, 2019, 7:22am

Maybe it’s worth trying to work through the issues with your installation. If the binding is saying that the port doesn’t exist, then probably there is a problem accessing the port and you should try and work out what it is. Serial ports under Java, and Linux especially, can be problematic and are probably one of the most common issues people have, so I can appreciate it can be annoying.

However, if you want to resolve the issues, then checking the logs, and working through the issues systematically is likely to be more fruitful than a general rant. The rant might make you feel better, but it won’t in any way help solve your problems, or make your system more stable which is what you really want .

Gordon_Geist · July 3, 2019, 12:42pm

If on a Raspberry Pi - sounds like a power supply inadequate to provide enough amps. Official Pi power supply is a bargain - especially the pi 4 supply at $8. All my issues on Pi’s dissapeared years ago when I ensured they had the correct amp feed and that their writes to the sd-card were minimum or 0 (external boot/log-db drive)

If you have a bench DC power supply - hook it up and check the amp draw. If its anywhere near the rating of your supply then transients will cause the hiccups your describing. In the pi log you should see it complain about power issues.

My last hang was after two months of running and the watchdogs didnt catch it (thats a whole topic in and of itself) but I run on the latest image whenever I update the OS.

jimduchek · July 3, 2019, 4:58pm

Sorry for my tone guys. I was exhausted already from a long day and had just come home to a 90 degree (no kidding) house, and could not get my A/C to come on because OpenHAB had frozen in some previously unseen way that my watchdog script wasn’t monitoring, and could not get a clean restart because of the new broken serial code (This is a known startup-order issue, and may be fixed now – I haven’t updated in a couple weeks – the fact that OpenHAB does not have a deterministic startup order is insane).

I feel like debugging OH is frustrating, as the code is so scattered between ESH and OH core. Slowly moving my rules over into Jython has helped some with stability. I will probably quit using snapshots at the next major release, which is probably part of my problem too, though it also means submitting bug reports and patches becomes pointless, so I’m kinda loathe to go that route, but at this point, I need a stable house.

rlkoshak · July 3, 2019, 5:51pm

Not any more. ESH has been merged with OH now. That doesn’t mean the code isn’t scattered. There are a ton of repos still. But at least they are all now under the openhab project and there isn’t an unnatural split (e.g. some bindings over at ESH while others in openhab2-addons). It’s one of the reasons there hasn’t been a new milestone release in the past six months. Merging the code back into OH and changing the build system at the same time was a little too ambitious it would seem.

If you will only be running one instance of openHAB then yes, you should be using the release or maybe the milestone releases. Those have received some testing and have no known major bugs. Trying to chase the snapshots means you are constantly chasing after a moving target. You can get a whole new version every day with new bug fixes and new bugs. If you are willing to file issues and PRs then this is the only release you can run. But it means you can kiss stability goodbye. It’s impossible to have stability under these circumstances. You can’t be a tester and have stability at the same time.

Many users who use the snapshots run multiple instances of openHAB. A frequently changing test instance where new releases are installed and given a first set of tests. Then a less frequently changing production instance which only gets upgraded after the release has been tested as a test instance.

These users also/instead have a backup/restore/rollback procedure defined and tested so when they do run into a problem with a release, they can roll back to a working state. openhab-cli backup is awesome for this but there are other approaches of course. I personally use git and run in Docker which makes switching versions and restoring to a known good state super easy.

Now, to address some of the problems you saw. I think this is a good example of why the end devices should be as smart as feasible and OH should influence their behavior. For example, the thermostat should control the AC whether or not OH is running. It may not have the ideal target temp set but you should at least be able to walk up to a physical thermostat and turn the dial up or down.

This doesn’t help you at all and I realize that. But it is food for thought for future readers who might consider centralizing everything in openHAB. Pushing some of the smarts and controls out to the edge will result in a more robust system over all as each individual device will be able to operate in a reasonable manner independently.

mstormi · July 3, 2019, 6:44pm

AFAIK this didn’t (and can’t) change inside OH but openHABian comes with a workaround based on this that at least makes rules start last (which is the problem most of the time).

David_Graeff · July 3, 2019, 7:05pm

It should also be noted that openHAB and all addons and rules run in one single process. The core doesn’t do anything unusual, but a binding can cause high cpu or memory usage. I once added a cache to the discovery part of the network binding for example and forgot to clear it. Memory usage went up the roof until openHAB crashed. Because of one line of code missing.

So carefully watch what bindings you are using. Also use jython and javascript instead of DSL rules whenever possible.

Cheers, David

Bruce_Osborne · July 3, 2019, 7:23pm

Somebody said Jython is Python 2 which is completely EOL in January. Shouldn’t we use the Python 3 module instead?

Kim_Andersen · July 3, 2019, 10:05pm

Those users, are they having the same several devices as well? (ie uses the same bindings etc)…
I have a slight feeling this is not the case… So testing mutiple instance of this, is a bit pointless.

I agree that snapshots are taking a risc. But I guess the worse thing that can happen is, if people stop testing snapshots. You´ll never reach the next milestone then, or at least not without tons of errors.

Sometimes snapshots are needed. Specially for Zigbee and even more Zwave, (maybe other devices/things), you can risc the chance, that your device or a function is not supported, unless you get the latest snapshot. So recommending the latest stable version isn´t always the best answer.

matt1 · July 4, 2019, 12:51am

Often there is nothing stopping you from running 2.4 stable and upgrading the bindings to the latest 2.5 to gain new features. Unless the binding requires something new in the snapshot core framework often it can run on older cores just fine, although at the moment it is a little bit different with the recent build system changes.

I have been doing this a lot with bindings that are actively being changed and I am running 2.4 Openhab.

I would love it if Openhab separated the two and used a Marketplace for all bindings allowing upgrading to newer bindings that would give an error in the logs if you were running them on a core version that does not have all the required framework. Release notes also warn of what may not work on older Openhab cores.

rlkoshak · July 4, 2019, 1:09am

People use different strategies based on what works for them. Most who have serial devices either clone the devices (e.g backup and restore the zwave dongle so they have two} or the only run one instance are a time.

The later us what I do. I stop my production container and start up my testing container, test things out. If all is well I’ll upgrade the production container with the test container. If not I’m just to commands away from restoring back to a known good place.

You can do the same with an apt installed production instance and a manually installed test instance. Just don’t run the two at the same time.

The people who run the snapshots and report errors, file issues, and submit prs, contribute every but as much to oh as the maintainers and binding developers. But it comes at the drawback if a less stable OH, especially if running the snapshots as production without a backup.

Only an individual use can assess what is the best option in their case. But running the see snapshot to get that new feature it sorted device will come with the risk if running untested software.

It has been discussed. We can hope it goes farther than just discussion.

Kim_Andersen · July 5, 2019, 10:16am

I believe this is where the “testing” phase is quite challenged…
Question is, how long do you test? What kind of testing?
In my opinion, this is often a short term testing period. But how do you know if there is a need of a longer test period?
A smart home system is not just a question about, if things can start… A smart home system will have to be stable in a long period as well…
For bindings, I guess a short term test could be enough. But for the core, I would assume a longer term is needed.

At least thats my opinion.

rlkoshak · July 5, 2019, 4:10pm

This comes down to personal preference and appetite for risk I think. I believe some users will run the testing version for days or a week, particularly if they are looking to see if a rare or accumulative bug (e.g. memory leak) has been fixed or not.

I usually run it long enough to see if there are changes needed to my configs or any immediate errors. If it’s a big upgrade (e.g. from M1 to M2) I’ll let it run for a day to make sure stuff like Astro are still working as expected. Then I’ll switch over if all worked well. If not I’ll revert back and start watching the issues and merges to see if my problem was reported (if not I’ll file an issue) and when the problem is fixed. In my case pretty much all the Rules will run through most of their execution paths in a typical day.

The specific testing I do is to just let OH run with my config. Others have set up a whole unit testing framework and extensive testing Rules. It all depend on how much down time you can afford on your production server if you do run into problems.

But remember, part of what makes this work is the ease of going back to the previous known good version. So if you do some quick tests at first and decide it’s safe to run as production and then later, maybe a week later, discover a memory leak or something, it’s still easy for you to go back. But please file an issue on the testing version and keep it around to help debug and discover the cause of the problem.

Actually, the core is probably the least of our problems. If there is a problem in the core it will make itself known almost immediately. It’s the bindings where most of the subtle problems seem to appear.

But, because this is software that isn’t developed under some CMMI Level 5 development process like is used by NASA to write code upon which lives depend, every day that we are running OH is a test. Just by running OH to drive your home automation you are testing the software. That’s the nature of software. And to be honest, this is why I cringe every time I see someone using openHAB (or any other home automation system) to drive a commercial or industrial process. The kind of rigor in development and testing is simply impossible in an open source project and affordable for the commercial offerings.

matt1 · July 5, 2019, 11:44pm

I run tests on the weekend whilst I am home and able to look at any issues that crop up. Then come Monday I put the known working system back in place and head back to work so I don’t get an angry frustrated wife calling. This is key if you are married and want $1000+ to buy some new automation toy Testing that I do usually is targeted, I look at the changed list (release notes) and then attack those areas that have changed in areas that I need to be stable and look for an issue by just using it.

Docker, Proxmox or make a copy of your uSD /SSD to a second drive, whatever works at giving you the ability to flick to a test setup and back to your known good setup quickly and pain free.