openHAB 5.0.1 locking up every month + my temporary remote OH4 solution

My remote cabin OH 5 server is down, and i had to spin up a backup OH4 server at home, now i am taking a note of what happened and asking on why OH5 locks up after a month. So what is the solution?

  • Platform information:
    • Hardware: n150 / 8g / 240g SP ssd
    • OS: Pepermint 12, debian 12 based
    • Java Runtime Environment: Probablly 21.x? Installed via openhabian on August 27th, 2025
    • openHAB version: 5.0.1
  • Issue of the topic: The server runs fine for about 3 weeks to a month with normal cpu load about 0.8, then suddenly it boggs donw by openhab consumming loads of CPU, cpu load like 68, and RAM is normal about 40% when and if i was even able to SSH in, cause of such high load.
  • Please post configurations (if applicable):
    • Items - about 400, i have mqtt boud items only, and mosquitto server
    • Sitemap - 1, i do have sitemap it seems to work fine no errors
    • Rules - about 50 clasic Rules DSL, some use rrd4j to gain stats, some control devices and some ingest strings
    • Services openhab cloud, jsonpath, rtl433
  • If logs where generated please post these here using code fences:
  • I may fetch logs when i get access to the machine, it is located in remote cabin and it freaking aint forkin right now it’s silent as dead and we have some snow in the way.

OH5 install

So i installed openhab 5.0.1 on August 27th, 2025, before that i had the same config running on OH 4.1 for at least a year. Upon migrating i fixed some errors and left it running until it decided to start acting up after a month, i restart, seems fine like new, after 3 weeks again no access hot PC to the touch.

I seen OH 3 and OH4 crach on high RAM usage but now OH5 seems CPU overload related, upon some type of internal error. I did look at loogs in autmn but it didn’t seem obvoius as that like rule would cause it or whatnot.

Anyway i don’t know that 5.0 beast yet but it seems it has some ghost inside like others post that it wont terminate some scripts when they hang. I supose that it is missing some type of watchdog, as some say it won’t start after restarting in some x situation.

My OH4 Experience

I do like openhab, and i am used to it running all the time as i have 3 more systems running OH4.

The only reason i thaught OH5 is fine to install back then is that i just let openhabian install wat was recomended.

Now i am running a backup server at home on OH4 on a chincy zotac box and at least i have VPN access to my pi4 mqtt server i dropped in before winter.

I did suspect RTL433 for drama, but it aint as it seemed to not consume CPU like openhab did. Now i only miss that data from wireless sensors that are offline cause of the remote server is down.

Temporary OH4 remote server

I see that spare backup server is the only thing to depend on. Also looking at keeping seperate MQTT and RTL433 +VPN server apart form OH from now on.

I really plan to just rollback to OH4 and sit on it cause its a bit more reliable untill i can see OH5 on a stage setup that it runs fine.

Verdict

OH5 locking, It sure can be my confing related, but who has time thinkering on remote cabin, just becase an OH5 upgrade brings in new squirel that eats at my system when it decides.

It needs some type of watchdog.

Cheers Matej

Well, you’re throwing quite a number of components and potential reasons for your issue into the mix.
Generally speaking, openHAB 5 is no less reliable or prone to locking, memory leak or similar errors than is OH4.
Forget about the idea of downgrading, that won’t get you anywhere in the long run.
You’re just shooting at the dark then because you don’t know where the target is.

Getting to the bottom of it is never easy or fast. Most of the time, in the end it’s some specific binding or hardware that few others users use so your 433MHz stuff sound like a prime suspect, but to find out, you have to analyze issues systematically when they occur.
Load-caused lockups are almost always related to your system setup. Whenever OH consumes more memory than the HW/OS can provide at that time. Debugging that is a story of its own, but not really OH related. If you lose access, that’s a strong sign of some system and not OH related cause. Using a standard, reliability proven setup like a openHABian RasPi is much more reliable than to use some homebrew HW+OS+system config setup. Particularly so for remote installations.
Having a watchdog doesn’t hurt, but it’s not a let alone THE solution. It’s a workaround at best,
unfortunately quite some people use this or similar procedures rather than to investigate and fix the root cause.

Long story short, get to analyze your box when in lockup state.
A very useful tool for that is documented here: Runtime Commands | openHAB, that’s often giving a hint which rule or binding to look into.

Hi Markus, thanks for the insight.

The tools are esential, i remember i used some, but its used to be quite hard to track.

I see thread info has expanded as per doc, looking forward to that showing some hints.

Well i am glad that OH4 is running fine with same config. I admit that there are some rule eŕors here and there, due to devices publishing json messages that change or go offline. I dont have all edge cases ironed out yet as device count grew, but OH4 copes fine with json and item null errors.

OH 4 It is currentlly running for months so its like OH4 engine is protected diferently and errors dont backfire into the system in my cases.:sweat_smile:

Talking about edge cases, when all goes wrong, i know that OH4 is configured to self restart when it runs out of memory, but i suspect OH5 isnt, right.

So how do i set that up natively via javaopts or it has changed?

I rather have both self healing and clearing the error cause, like every robust server restarts on error OH should.

I dont see it as too much homebrew debian 12 with openhabian on it, its debatable sure but i have had trouble with pi4 and usb storage, so i somehow outgrew pi. Now pi5 is the first with real ssd, while i dont have it.

Anyhow, what would be the recomended way to go as i say OH4 is running just fine and i just want the new setup the more capable n150 mini pc be the upgrade in performance for future expansions.

For context i have the OH to monitor garden watering, well water, outdoor lighting, electric backup heating and power shedding via smart plugs as we have only 20A supply. I also built 12V DC UPS i monotor that. Mainlly i only have my own ESP8266 modules and Tasmota smart plugs.

And i like to do lots off data analisis so there the server is running hard, like realtime power analisis for turning plugs on and off, predicting future load and such and all is cooking in rulesšŸ˜†, so cpu isnt idle, while rrd4j is light oposed to influx i run at home.

And posiblly i have like over 50 charts for those 30 devicesšŸ˜…

Oh man it would be great to just have it working, like i am glad that any stupid things i did on OH4 they rarely break it, as its not as self explanatory to handle data in rules as it is in arduino for example where all is defined by default.

I do enjoy building and mining the data but fetching ghosts not so mutch as seeing inside whats breaking in OH silentlly after weeks is a bit much without serious tools and experience, while i use OH for 6 years its still a grey box under the lid i grasp some of it.

For reference my openhab.log is silent for hours and i disabled event log to save on disk writes.

So what steps do you recomend for minimal fuss, that OH5 survives my torture.

Well, much of this like when OH crashes or hangs and if the system restarts is not about OH itself but about your system setup which is homegrown, so I can’t comment on that.

My recommendation is always to go with the mainstream standard: Raspi, openHABian.
Because then you will be having the very same HW+OS as everyone to use it, too, and all of them can potentially help you with debugging that. That’s the benefit of going mainstream.
Meanwhile as soon as you homebrew anything, only you know what you did so (essentially) only you can effectively debug it.

Certainly the way to go for a remote unit. You can even have that mirror SD cards for resilience.

On your local main system you can use a x86 Debian + openHABian system or attach some USB SSD to a RPi and move the DB there if you want to improve data mining speed.
But I also do a lot of that sort of computation on various RPi installations of mine. It’s just a question of your programming efficiency, one really barely needs anything bigger in terms of CPU and I/O power.
And openHABian takes care of optimizing the system upfront (DB and logging on ZRAM etc).

I might add two suspects

  • heavy logging: This can cause zram partitions to run out of space. So logging should never remain on debug or trace for a long time. But that has nothing to do with versions. So my first bet for you case is:
  • mqtt: postCommand: true. This can cause trouble creating loops even if it worked in v4.x environments. In general it should almost always be false

further info here:
https://community.openhab.org/t/openhab-5-1-release-discussion/167620/278

:backhand_index_pointing_up:

I’m willing to bet this is the root cause of the CPU usage.

As for the rest. :person_shrugging: Nothing mentioned actually changed in OH 5 from OH 4. OH 5 is configured to restart when it crashes. But that’s not OH, that’s systemd that does it. I bet OH 5 isn’t crashing though, it’s just not responding.

You turned off the one thing that would tell us anything useful here. Is OH processing lots of events and that’s what’s using the CPU, or is something else using the CPU? events.log would tell us that. If you had OH 5.1, events.log would even tell you the source of all those events.

Neither does OH 4. One could be implemented through systemd or other external services if you want. But OH has never had the ability to restart itself.

Can you point me to these posts?

But if it’s generating tons of events…

You don’t set that up via javaopts. It’s not a Java option. It’s the script that starts OH in the first place: openhab.service either somewhere in /etc/systemd or /var/lib/systemd. But this file didn’t change between OH 4 and OH 5.

If it’s your config it probably will never get fixed. If you don’t have time to help figure out if that’s the case or if there’s a problem, perhaps an FOSS system isn’t the way to go. You need to find a platform where you can pay someone to spend the time to figure it out.

Nothing changed between OH 4 and 5 in this reguard. What probably has changed is you’ve encountered a new bug in OH 5 or something else changed around the same time as the upgrade. But there are no fewer protections between OH 4 and OH 5. Neither really have any external protections.

Me too but OH has never claimed to offer that.

A quick log:set debug org.openhab and/or the aformentioned ttop console cmd should get you an idea what it’s doing most of its time if not hung.
If in MQTT, I bet Rich is right.

PS log:set default org.openhab to reset

Bridge mqtt:broker:MQTT_Bridge [ host="192.168.1.100", secure=false ]
{
    Thing mqtt:topic:Vodnjak "Vodnjak" {
    Channels:
        Type number : Razdalja_Do_Vode "Razdalja Do Vode" [ stateTopic="/Vodnjak/Razdalja" ]
        Type number : Volumen_Vode "Volumen Vode" [ stateTopic="/Vodnjak/Volumen" ]
        Type number : Globina_Vode "Globina Vode" [ stateTopic="/Vodnjak/Globina" ]
        Type number : Signal "Signal" [ stateTopic="/Vodnjak/Signal" ] 
        Type number : Stevec "Stevec" [ stateTopic="/Vodnjak/Stevec" ]
    }

    Thing mqtt:topic:Sonoff4Pro_Terasa "Sonoff Terasa" {
    Channels:
        Type switch : Rele_1 "Luc Terasa" [ stateTopic="/S1_Sonoff4_Dule/Status/K1", commandTopic="/S1_Sonoff4_Dule/vklop/K1", on="1", off="0"]
        Type switch : Rele_2 "Luc Delavnica" [ stateTopic="/S1_Sonoff4_Dule/Status/K2", commandTopic="/S1_Sonoff4_Dule/vklop/K2", on="1", off="0"]
        Type switch : Rele_3 "Luc Miza" [ stateTopic="/S1_Sonoff4_Dule/Status/K3", commandTopic="/S1_Sonoff4_Dule/vklop/K3", on="1", off="0"]
        Type switch : Rele_4 "Luc Dovoz" [ stateTopic="/S1_Sonoff4_Dule/Status/K4", commandTopic="/S1_Sonoff4_Dule/vklop/K4", on="1", off="0"] 

        Type string : Status "Status" [stateTopic="/S1_Sonoff4_Dule/Status"] 
    }

// more Things bellow

}

My bad for not including some config, I still run legacy thing configuration so i don’t see that mqtt: postCommand: true would even be configured anywhere.

I can stage the OH5 system at home and load it with same config and let it cook, then inspect where it would start smoking.

It will be a property on an MQTT Channel. It will be there somewhere among stateTopic and commandTopic.

If it’s not on any Channel than that’s not the problem and we don’t have any other information to go on. This is the only thread I’m aware of that is having CPU problems with OH 5 that isn’t caused by an infinite loop in MQTT caused by incorrect use of postCommand.

Hi

1.A reference of stability, OH4 runs fine now on fresh insall

I am happy to anounce that my weekend cabin OH setup is running now.

Tho i was in frenzy to rebuild it and chose OH 4.3.x latest stable build, i can say that it is runing stable over this month.

The stage setup i run at home on n100 cpu now is Debian 13, Java 21, Openhab 4.3.9, and mosquitto.

I can confirm that i have no MQTT post comand set up as i use only on MQTT.things and has no such comand insinside.

A real Homebrew as it looks.

I did not install OH4 with openhabian as that day the openhabian repo was down i supose that means it has no zram and additional stuff as it runs on SSD anyways. And that may be a benefit of eliminating causes of crash like zram overload and unknown honeypot of errors of OH5.

2.Staging OH5, and errors

I am still waiting that OH5 staged setup crashes, as it is running only 5 days, mwanwhile i have same setup as i did on 31.10.25 when i harvested a sys log for OH5 crash.

oh_log_31_10_25.txt (814.1 KB)

This log may or may not show the root cause od why it crashed, but sure it shows it strugled.

As Rich said it seems OH5 may be in sombie state and not restarting as systemd doesnt see it crash.

If i find openhab.log that has some reference i will ad it later.

I think that a hint of heavy logging may prove to hold some truth on OH5 and zram.

I supose running out of zram could crash OH5 right? So maybe there was a log storm of errors from rules like cant cast null or alike, and it may have helped drag it down.

I am glad that OH4 runs fine on backup server for cabin, that i run a home, that Zotac at it was running for many months without crashing down at the cabin as well, but it is slow, hence the upgrade to n150.

I just hope that OH5 on n150 cpu will run fine with Debian 13. It seems to have kernel that finally supports n150 completely without backports.

I may still observe OH5 if it still breaks, as i am not familiar with it the same as OH4, regarding its robust setup and error handling.

Anyway i do asume that some rule may put OH5 on knees as i have many rules that parse strings still.

That seems to be a honeypot for erors when strings from web or non json device services ocasionally change, anyway i seen OH2 and OH4 cope with any trash without crashing.

For now i just started stress test on OH4, intentionally removed null filter, and incimplete message, for items and it seems to not choke at all even on like 20 errors per second that seems good.

  1. Sidenote on OH4 Home setup for comparison

At home setup the main OH4 server i really have wild west of rules and the real error debt shows as more ram consumed, but no crashing OH or dull OS meltdown, and i finally recruited some REGex to trash the parse rules.

I see that the home server is running hard like a NASA control system based on number of over 50 things and 2000 items being monitored mostly every second, cause i have many real time controls like power management and many greenhouse controllers for clients that are masive subsystems by themselves. And i have like 30GB influxdb for 5 years of data.

I supose that some OH users also have some extraordinary setups bsed on what i have read.

Thanks to everyone participating.

It really helps me to understsnd the situation with OH5.

Cheers, Matej

Nah. Sure, OH will crash when it cannot write its logs to disk, but if properly setup, maximum log size is limited to less than your available disk space (no matter if that’s on zram or SD or SSD or wherever). And that has nothing to do with OH v4<->v5.

Oh, well. It doesn’t make sense to gather and retain that much sensor data that often and retain all of it in full bloom rather than to process and compress it and just keep the essentials.
I hope you don’t also persist data on every item value change?

As I already said, please stop your speculative shooting at the dark and start analysing.

Analyse first if it is CPU or I/O that is your bottleneck. But check real CPU usage, not the ā€œloadā€ that’s misleading. If CPU you really need to look at your data processing code, but I bet it’s I/O.
Check I/O with Linux tools like iostat. Check your persistence config if you record every change and change that to use e.g. everyMinute strategy.
Just for a test, move from Influx to RRD4J have auto-compression and much less CPU + I/O overhead.

Again: Nothing substantial has changed in OH5 vs OH4. That’s not the reason.
Your issue is much more likely that you have setup your system to run OH with bad (inadequate) system parameters (like a heap size limit) or config that affects I/O (like presistence frequency does).

A potential infinite loop related to persistence — Persistence extensions: fix potential infinite loop by mherwege Ā· Pull Request #5018 Ā· openhab/openhab-core Ā· GitHub — is fixed in 4.3.8 and 5.0.2. In the mentioned here version 5.0.1 the potential infinite loop can cause high CPU load.

Since he lists that he has 8GB RAM I guess you can say it’s true here - but one should not forget about the 32 bit to 64 bit transition that almost half your effective memory (because most things you put in memory take almost twice space on 64-bit). So, setups that could work on 32-bit might completely choke on 64-bit. 32-bit can’t address/use more than 4GB of RAM though, so in this case, it will allow utilizing the full 8GB, compensating for the less effective RAM use.

With all that persistence data, this sounds like a prime suspect to me.

There may be some can of worms in OH 5.0.1 as code is live and evolving it really isnt true to say its same as OH4 per say.

I would appriciate to measure the cause but pushing the system into that state is like catching a mice by hand.

Its not really anything sectacular to measure regarding CPU and I/O when all runs fine the cpu load avg is 0.5 over 5m, and when it actually stopped 2 weeks ago the server was often inaccesible.

By that said on persistence race condition i actually can force a bit of load on persistence and see if it has issues, just to see if 5.0.1 does choke up.

I may actually then upgrade to 5.1.1 to see if that fixes the situation.

I like to have a polite note on the situation as you all see i am not like conplaining that oh broke and i have no solution, that would be typical anoying talk, ya dint got a backup. I am exploring around limitations of silent error state that i cant snif out the cause when the rig is silent when it works and when it does not.

And i appriciate everyone that looks from perspective of their experience with the specific version.

As i mentioned before my home server with 2000 items for comparison, i am fixing cabin setup now that does not have that crazy persist, but it has still about 100 items that store every minute, and that may cause bursts, that hang it.

Again that version specific error was not on my radar yet and it is eye opening.

Thanks it looks promising now that i can say i want to make 5.0.1 choke just to see how it behaves on that persistence pressure if the symptoms will be same as when it locked up.

And i am looking forward to 5.1.1 actually it may be working smooth based on what i know so far.

The data IO sidenote:

Regarding the home setup haha its midwest with 2000 items that needs a strategy to solve i agree. I have 3 types of data digital on off states 1/0, digital state order like counters and analog sensor data only that can be agregated.

The sampling rate of many items like counters cant be low, but analog data i will group it and set like to every minute to see how it improves.

The main improvement i seen is to migrate to on demand data analisis instead on continuous. So i have switches on dash to generate reports.

CPU load of rules on home server is like noticable in bursts, normally is like load avg 0.5 but it spikes to 3 every hour.

Now IO isnt too bad as Influx just writes and never agregates to rewrite in my setup, it uses less cpu and disk as RRD4J did and it ate my drive alive then thi IO was crazy.

The only thing is that i dont know how to measure what rule is loading the system how much as that ttop in cli console hides rules under hanler IDs.

Thanks

Generally, when OH is locking up or being very slow/not responding properly, one of the best tools to figure out what is going wrong is making a thread dump (and post it somewhere that somebody who knows how to read it can look at it). It will reveal if there are deadlocked threads, long thread pool queues, etc.

If you want to capture memory use over time and even more live data of what’s going on, you can create a JFR recording:

While I can’t guarantee that absolutely nothing ā€œof interest to somebodyā€ can be revealed in such files, both thread dumps and JFRs typically contains must less ā€œuser dataā€ than a log file. They capture information about the JVM itself, not so much the application running in it.

You have received a polite hint to upgrade to at least version 5.0.2.

If you read the description of the potential infinite loop, you will see that, whether it happens, does not depend on ā€œpersistence pressureā€. You can dig into that infinite loop posts, learn how to trigger the loop, and then describe the effect of the loop on your system.

And a polite hint to exchange your persistence setup for an A-B test.

The most valuable advice however contained therein, too, is to tackle the situation systematically rather than to continue shooting at the dark (i.e. to wildly guess rather than to analyse).

It’s really not different from 4.3 in that regard. And that’s the problem with your conclusions. There were no changes in the code to the rules engine between 4.3 and 5.1 that could cause this, at least not directly.

Maybe I missed something but I looked through all the merged prs in OH core between those two versions and saw nothing relevant to the rules engine and this behavior.

Persistence or some other part of OH could be an issue but I don’t think the rules engine would be the culprit.

5.0.1 is released and not even the latest release. It isn’t evolving at all. At this point only rarely will it receive point updates. It is done, stable, and not evolving. It is as static and unchanging as 4.3. Though I would move to at least the latest patch release of 5.0 if not to 5.1.

It’s important to understand what ā€œloadā€ means on a Linux machine. It is not a measure of the CPU. It is the average number of processes that are stuck waiting for any resource over that time period. That resource could be CPU but usually it’s more often waiting on SWAP or some other IO. A load of 3 means you have on average three processes that want to run but can’t because they have to wait for some resource.

It’s pretty common to see a system with a high load but almost no CPU utilization. Everything is stuck waiting for something and the CPU has nothing to do.

Information like what you get from iostat, thread dumps, etc which have been bright up above will help identify what resource is becoming the bottleneck.

Hi Rich, you put warm smile on my face no mater what the topic is, i value how you teach the process on approaching to solution.

That iostat is great, i also used iotop before and i also used iftop sure, but usually i see broad picture not specifically the component that is giving issue in OH. I can revisit that for sure.

The reason 5.0.1 is in the game is from septeber 2025 when cabin server was installed and it stared crashing in september and october, amd now it is still stuck in remote cabin snowed in, haha.

I staged 4.3.9 at home and is stil runnig fine, and i will keep it on backup machine.

I also staged 5.0.1 and it locked up again silentlly nothing to track upon a failiure.

With that said it looks like stable release is pushed a bit early if it hat such silent bugs, i let my realeases of greanhouse automation in testing release for 6 months for that exact reason.

As mechatronic i look at OH that is should not need to debug the core of OH.

None take it personally but for sure 4.x is diferent in the core then 5.0.1, othervise there would be no error with same config, but i digress from that.

The new setup:

So i will do fresh install of 5.1 and call it new stage unit, hook up monitoring and take a look at it after a month and hopefully still runing.

I supose 5.1 is fixed, beter of that it is, as i am not fealling in mood to wait around for 5.0.1 to crash just to see the reason as @dpa said it is prone to gohst behaviour.

I have a wierd question: When will openHAB get a proper Task Manager where we can see real-time CPU/RAM usage per rule or script by their UI names?.

I just enjoy more to go continue building my automated greenhouse logic on ESP8266 then dig into all the logs of OH5.

With joy that i even get to learn how to diagnose such stuff i sure learned new things from the talk.

I will report back when i get some data to look at, hopefully the new server will be up all the time.

Cheers Matej

For the most part OH is under continuous testing. And as far as I know you are the only one reporting this problem. So unless you had tested this before the release how would we know there was a problem?

This is an unrealistic expectation from any software but especially from an open source volunteer driven project. There is simply not enough total vounteered hours to even approach reliability like that.

Absolutely there are changes in core. You can see each and every change made to core by looking at the release notes. Every merged change is listed. For OH 4.3 to OH 5.0.3 (the latest point release) you can find that list:

But your problem may not even be in core. There may be a bug in some other part (e.g. Persistence). We know there was a bug that could have caused this behavior in 5.0.1 which was fixed in 5.0.2.

5.1 is released. But like the 5.0 release had a 5.0.0, 5.0.1, 5.0.2, and 5.0.3 patch releases, 5.1 is and will likely continue to receive patch releases as rarely encountered bugs and edge cases are discovered and fixed. So far there has been only one patch release, 5.1.1 and I believe there is a 5.1.2 coming soon.

But these patch releases only include bug fixes to major problems that were discovered after the initial release. There are no new features or major changes to architecture, just bug fixes.

As soon as someone volunteers to implement something like that.