Stability -- openhab knows nothing but crapping itself

I second that. Some years ago I was designing my system (mainly knx those days) in a way no critical function was built in OH.

Since more than 12 months I am running OH in a docker on hypriot with NO reboot for now. Extremely stable. The total list of bindings I use is

binding = amazonechocontrol,astro,exec,ntp,sonos,systeminfo,netatmo,denon1,knx1,weather1,fritzboxtr0641,hue,robonect

Its all about bindings.

From an end user view it’s hard to distinguish between OH and it’s bindings. I understand that view.

Give yourself a favour and ask some questions, provide some logs and it will work out …

Regards
Ralf

Maybe it’s worth trying to work through the issues with your installation. If the binding is saying that the port doesn’t exist, then probably there is a problem accessing the port and you should try and work out what it is. Serial ports under Java, and Linux especially, can be problematic and are probably one of the most common issues people have, so I can appreciate it can be annoying.

However, if you want to resolve the issues, then checking the logs, and working through the issues systematically is likely to be more fruitful than a general rant. The rant might make you feel better, but it won’t in any way help solve your problems, or make your system more stable which is what you really want :wink: .

12 Likes

If on a Raspberry Pi - sounds like a power supply inadequate to provide enough amps. Official Pi power supply is a bargain - especially the pi 4 supply at $8. All my issues on Pi’s dissapeared years ago when I ensured they had the correct amp feed and that their writes to the sd-card were minimum or 0 (external boot/log-db drive)

If you have a bench DC power supply - hook it up and check the amp draw. If its anywhere near the rating of your supply then transients will cause the hiccups your describing. In the pi log you should see it complain about power issues.

My last hang was after two months of running and the watchdogs didnt catch it (thats a whole topic in and of itself) but I run on the latest image whenever I update the OS.

1 Like

Sorry for my tone guys. I was exhausted already from a long day and had just come home to a 90 degree (no kidding) house, and could not get my A/C to come on because OpenHAB had frozen in some previously unseen way that my watchdog script wasn’t monitoring, and could not get a clean restart because of the new broken serial code (This is a known startup-order issue, and may be fixed now – I haven’t updated in a couple weeks – the fact that OpenHAB does not have a deterministic startup order is insane).

I feel like debugging OH is frustrating, as the code is so scattered between ESH and OH core. Slowly moving my rules over into Jython has helped some with stability. I will probably quit using snapshots at the next major release, which is probably part of my problem too, though it also means submitting bug reports and patches becomes pointless, so I’m kinda loathe to go that route, but at this point, I need a stable house.

Not any more. ESH has been merged with OH now. That doesn’t mean the code isn’t scattered. There are a ton of repos still. But at least they are all now under the openhab project and there isn’t an unnatural split (e.g. some bindings over at ESH while others in openhab2-addons). It’s one of the reasons there hasn’t been a new milestone release in the past six months. Merging the code back into OH and changing the build system at the same time was a little too ambitious it would seem.

If you will only be running one instance of openHAB then yes, you should be using the release or maybe the milestone releases. Those have received some testing and have no known major bugs. Trying to chase the snapshots means you are constantly chasing after a moving target. You can get a whole new version every day with new bug fixes and new bugs. If you are willing to file issues and PRs then this is the only release you can run. But it means you can kiss stability goodbye. It’s impossible to have stability under these circumstances. You can’t be a tester and have stability at the same time.

Many users who use the snapshots run multiple instances of openHAB. A frequently changing test instance where new releases are installed and given a first set of tests. Then a less frequently changing production instance which only gets upgraded after the release has been tested as a test instance.

These users also/instead have a backup/restore/rollback procedure defined and tested so when they do run into a problem with a release, they can roll back to a working state. openhab-cli backup is awesome for this but there are other approaches of course. I personally use git and run in Docker which makes switching versions and restoring to a known good state super easy.

Now, to address some of the problems you saw. I think this is a good example of why the end devices should be as smart as feasible and OH should influence their behavior. For example, the thermostat should control the AC whether or not OH is running. It may not have the ideal target temp set but you should at least be able to walk up to a physical thermostat and turn the dial up or down.

This doesn’t help you at all and I realize that. But it is food for thought for future readers who might consider centralizing everything in openHAB. Pushing some of the smarts and controls out to the edge will result in a more robust system over all as each individual device will be able to operate in a reasonable manner independently.

1 Like

AFAIK this didn’t (and can’t) change inside OH but openHABian comes with a workaround based on this that at least makes rules start last (which is the problem most of the time).

It should also be noted that openHAB and all addons and rules run in one single process. The core doesn’t do anything unusual, but a binding can cause high cpu or memory usage. I once added a cache to the discovery part of the network binding for example and forgot to clear it. Memory usage went up the roof until openHAB crashed. Because of one line of code missing.

So carefully watch what bindings you are using. Also use jython and javascript instead of DSL rules whenever possible.

Cheers, David

Somebody said Jython is Python 2 which is completely EOL in January. Shouldn’t we use the Python 3 module instead?

Those users, are they having the same several devices as well? (ie uses the same bindings etc)…
I have a slight feeling this is not the case… So testing mutiple instance of this, is a bit pointless.

I agree that snapshots are taking a risc. But I guess the worse thing that can happen is, if people stop testing snapshots. You´ll never reach the next milestone then, or at least not without tons of errors.

Sometimes snapshots are needed. Specially for Zigbee and even more Zwave, (maybe other devices/things), you can risc the chance, that your device or a function is not supported, unless you get the latest snapshot. So recommending the latest stable version isn´t always the best answer.

1 Like

Often there is nothing stopping you from running 2.4 stable and upgrading the bindings to the latest 2.5 to gain new features. Unless the binding requires something new in the snapshot core framework often it can run on older cores just fine, although at the moment it is a little bit different with the recent build system changes.

I have been doing this a lot with bindings that are actively being changed and I am running 2.4 Openhab.

I would love it if Openhab separated the two and used a Marketplace for all bindings allowing upgrading to newer bindings that would give an error in the logs if you were running them on a core version that does not have all the required framework. Release notes also warn of what may not work on older Openhab cores.

People use different strategies based on what works for them. Most who have serial devices either clone the devices (e.g backup and restore the zwave dongle so they have two} or the only run one instance are a time.

The later us what I do. I stop my production container and start up my testing container, test things out. If all is well I’ll upgrade the production container with the test container. If not I’m just to commands away from restoring back to a known good place.

You can do the same with an apt installed production instance and a manually installed test instance. Just don’t run the two at the same time.

The people who run the snapshots and report errors, file issues, and submit prs, contribute every but as much to oh as the maintainers and binding developers. But it comes at the drawback if a less stable OH, especially if running the snapshots as production without a backup.

Only an individual use can assess what is the best option in their case. But running the see snapshot to get that new feature it sorted device will come with the risk if running untested software.

It has been discussed. We can hope it goes farther than just discussion.

I believe this is where the “testing” phase is quite challenged…
Question is, how long do you test? What kind of testing?
In my opinion, this is often a short term testing period. But how do you know if there is a need of a longer test period?
A smart home system is not just a question about, if things can start… A smart home system will have to be stable in a long period as well…
For bindings, I guess a short term test could be enough. But for the core, I would assume a longer term is needed.

At least thats my opinion.

This comes down to personal preference and appetite for risk I think. I believe some users will run the testing version for days or a week, particularly if they are looking to see if a rare or accumulative bug (e.g. memory leak) has been fixed or not.

I usually run it long enough to see if there are changes needed to my configs or any immediate errors. If it’s a big upgrade (e.g. from M1 to M2) I’ll let it run for a day to make sure stuff like Astro are still working as expected. Then I’ll switch over if all worked well. If not I’ll revert back and start watching the issues and merges to see if my problem was reported (if not I’ll file an issue) and when the problem is fixed. In my case pretty much all the Rules will run through most of their execution paths in a typical day.

The specific testing I do is to just let OH run with my config. Others have set up a whole unit testing framework and extensive testing Rules. It all depend on how much down time you can afford on your production server if you do run into problems.

But remember, part of what makes this work is the ease of going back to the previous known good version. So if you do some quick tests at first and decide it’s safe to run as production and then later, maybe a week later, discover a memory leak or something, it’s still easy for you to go back. But please file an issue on the testing version and keep it around to help debug and discover the cause of the problem.

Actually, the core is probably the least of our problems. If there is a problem in the core it will make itself known almost immediately. It’s the bindings where most of the subtle problems seem to appear.

But, because this is software that isn’t developed under some CMMI Level 5 development process like is used by NASA to write code upon which lives depend, every day that we are running OH is a test. Just by running OH to drive your home automation you are testing the software. That’s the nature of software. And to be honest, this is why I cringe every time I see someone using openHAB (or any other home automation system) to drive a commercial or industrial process. The kind of rigor in development and testing is simply impossible in an open source project and affordable for the commercial offerings.

I run tests on the weekend whilst I am home and able to look at any issues that crop up. Then come Monday I put the known working system back in place and head back to work so I don’t get an angry frustrated wife calling. This is key if you are married and want $1000+ to buy some new automation toy :wink: Testing that I do usually is targeted, I look at the changed list (release notes) and then attack those areas that have changed in areas that I need to be stable and look for an issue by just using it.

Docker, Proxmox or make a copy of your uSD /SSD to a second drive, whatever works at giving you the ability to flick to a test setup and back to your known good setup quickly and pain free.

4 Likes

I think you’re doing that wrong. I run the test version, then when it breaks I tell the better half that I need to spend $1,000 to fix it. Asking for money to change something that’s working is a fool’s errand. :slight_smile:

(Of course, not true at all, the better half is quite indulgent)

5 Likes

I guess its a matter of balance between the time without the production setup and the test setup. But since the is a smart home system, that down time is often very limited.
Going back to the stable version is offcouse an opputunity, when having a decent backup procedure. But going back only makes sense if you found some problems and are able to file an issue from it… This is where things are getting tricky… If you found a problem, and you file an issue, the developer would need someone to test fixes etc…And you might end up going forward and back many times. and each time it means yet another down periode for the production setup (ie your system, which perhaps drives all the lighs in your home).

I know, these are the game rules for things like this… It cant hardly be anything else… But it´s always a problem with developing like this.

I wish I could agree on this, but I dont…
Think about the problematic serial driver issues which suddenly apperead in one of the 2.5 M1 builds. If you´re “lucky” this error would have showen the first time you boot up the test. But not always. Then it would maybe show the next time you boot.
Most new builds are for the core, I guess. The rest are the bindings. Bindings are in my opinion alot easier to test, except if you start up with a binding, which doesn´t work at all, ofcouse.

But isn´t that the goal of openhab as well?
And isn´t this the reason why it´s important to have a stable version, which doesn´t need fixes?
If openhab is just for testing purpose only, lots of all this doesn´t really make any sense to me.
Every time a software reach a stable stage, developers should have cleared all (known) bugs and troubbles, and start focusing on developing new features only.
At that stage, the software should be ready for any kind of productions, as well as all documentations should have been updated. But every time something fails, which require the user to update the core system to a later version, the developing has failed!

It depends on the nature of the problem. If it’s a problem not involving a shared resource like a Zwave dongle, then you can always run them both at the same time. Just use different networking ports. In all likelihood, once you’ve identified a problem well enough to file an issue, you don’t need a fully functional configuration in your test instance to further debug the found problem.

If down time is really that important to you, then you need at least two completely separate yet fully functional instances of OH which can become quite a challenge. So that becomes a place where you should probably stick with the release versions and only rarely update. You can’t be a tester and have a perfectly stable OH instance with limited down time at the same time. The one precludes the other.

There is one prominent OH user (has spoken at several of the openHAB days in Germany who’s mantra is “never ever update.” If you don’t want downtime, get it to work and never touch it again.

Another way to mitigate this is to avoid putting too much into the sole control of OH itself. This is why I constantly advocate for putting/keeping as much of the smarts in the end devices as possible. Even if OH goes down, I can operate the light switches manually, turn on/off the heat (and the heat will still automatically maintain the last commanded target temp), and so on. Down time isn’t such a big deal with the house itself is still operable even when OH goes offline. It just might be a little less convenient.

Well, that problem isn’t exactly something new. It’s been known since OH 2.0 two years ago that the startup order is undefined. The problem here occurs because the serial bundle doesn’t start before some other bundles that depend upon it. There is no further testing or debug that needs to be done for this. It’s a long term known problem so all you have to do as a user is watch the issues for a fix.

There is a new build of everything every day. The snapshot includes all the latest merged changes for both the core and the bindings.

It has “home” in the name. I’ve not seen any maintainer advocate for it’s use in a commercial or industrial setting. As is probably clear, I’d advocate against that.

There is almost no such thing in software. Even mission critical systems upon which human lives depend have instabilities and need fixes. An open source project rarely has the manpower nor the expertise to achieve anything like the rigor and quality control as those systems. If you think development of OH is slow consider this. A ten line function written for something like a NASA rocket launch control will require about six months from the time it is written until it gets merged. In that six months it has been independently reviewed by at least a dozen different people including peers, testers, integration and test engineers, quality control, and an engineering review board. It will have undergone hundreds of hours of dedicated testing on special built hardware and test suites which usually took longer and involve orders of magnitude more code and engineering effort than the functional code does.

In short, what commercial program, let alone an open source project, is going to dedicated 3/4 of all of their effort towards testing and review of the production code instead of writing said production code?

I’ve worked in the defense industry my entire career. I’ve seen what it takes to write stable code like this. It could never happen in an open source project.

Then there will never be a release of OH, ever.

In a commercial project you can enforce something like this. In an open source project like OH, we depend on volunteer efforts. Volunteers are going to only donate their time to work on what they want to work on. Consequently some known bugs never get fixed and some important features never get implemented because no one volunteers to fix them. If we say “no, you can’t work on that new binding or new feature until someone fixes Issue X” then we will shed volunteers like an oak tree in the autumn.

A successful open source project without commercial backing is as much about managing the community and keeping the volunteers motivated to continue to contribute as it is about the code. Consequently, there will never be a time where there are no known bugs. There will never be a time where the docs are complete, clear, and accurate. This is because developers will always be more passionate and motivated to work on something new than on fixing bugs in someone else’s (or sometimes even their own) code.

Even if someone is willing to work on it, sometimes the developers cannot reproduce the reported behavior. If they can’t reproduce it they can’t fix it.

And there are always tons of latent bugs lurking in the software that are not known. And this is why I say that all users of OH, or any software, open source or commercial, are all testers.

If you want stable and don’t want to be a “tester” you need to go out and purchase an industrial control software with a service contract. If you use a commercial or open source home automation software, you will never get to 99% up time. It’s simply never going to happen.

1 Like

The shit comes from java and linux…would be nice to have a real platform supported by real computer pros.

pros = paid for. What did you have in mind?

I’ve been a loyal OH user for many years. My skills are totally homegrown, IT is not my profession but is definitely my obsessive hobby. I have a core automation/alarm panel (Omnipro) that I try to use for any heavy lifting/must always work type stuff and then use OH to bridge enhanced functionality my board won’t support, like controlling my mix-mash of multimedia devices, z-wave (thank you @chris) and provide push notifications to our cell phones for various happenings around my house.

I will say that I have only had a handful of “melt-down” moments using snapshots and I just had one happen July 3rd when I updated to snapshot #1630. Suddenly, at 10 pm, during what I routinely do many times during the week, my OH system decided to crap the bad news out via the logs. Something wasn’t working and no real (obvious) clue as to what was now broke. I backed up my conf folder (only :frowning: ) and proceeded to install, reinstall, purge, revert to previous snapshots, all without success. This lasted until well after midnight and (after many bad OH dreams) I awoke to continue working to get my system up and running again.

Under normal circumstances, I know the routine and what to save, however, time pressures mounted because my wife and I were hosting the neighborhood 4th of July party with 50+ guests expected to show midday the next day. I wanted my remotely located HDMI switch, Onkyo receiver, and 6 zone audio amp to work with my Harmony remote and LG televisions as well as show off my fancy colored Milight LED’s (thank you @matt1) and my guests to be in awe of my fancy house.

I did eventually get up and running about 30 minutes before my guests arrived by reverting back to 2.5M1 but (because I didn’t do a proper back up) I had to rebuild all my channels and Paper configurations manually. The big loss for me, however, was I neglected to back up my Habpanel.config which, for some frustratingly unknown reason, resides in the bilges of the OH user settings and not the conf folder with everything else. I’m willing to do all the other set up for a fresh install but that one file…it’s always that one file…

I share this with all of you because I do think it would be helpful to have a snapshot status page on the forum for the developers to report (known) breaking changes and let those of us who don’t live in the IT world by trade have some obvious place to go look for issues before we make a plunge. I do scour Github daily to try to see what changed, but breaking changes are at best cryptic in a hidden Github issue or pull request and not easy to determine until actually making the plunge to upgrade.

I think it would also be helpful if someone could share a layman’s version of how to set up a Git repository and/or some other backup strategies that are tested and proven. Obviously, backing up via the OH CLI would have been the smart thing to do, but I’d be interested in figuring out a strategy for having a test and production environment set up and how others manage devices like z-wave sticks with the two environments.

End of the day, I am super appreciative of the OH developers and community as I am constantly learning (and plagiarizing) something off this forum (thank you @rlkoshak, et al). I will say, however, I think we are missing basic information for snapshots. This is the first place I go when I have problems like this and this time I appeared to only have two other users reporting an issue with 1630 but no real idea as to what changed in 1630 that I should have been cautious of before upgrading.