Openhab 2.5 is blocking after one week of runtime

Since the update to 2.5 my Openhabian (on a Raspberry Pi 4) stops handling item commands after some time. (mostly after about one week of runtime) The UI’s continue to run, but I cannot send commands to items anymore and this also affects every binding. Also there are no clues in the log of what could cause this problem.

I searched in the forum and I guess something could be blocking the threads in the background. I started to disable some bindings but I did not find the one that is blocking the threads. I also executed “threads --monitors --locks” in the console, but I’m not sure what those locks mean, so I’ve uploaded the output to pastebin: https://pastebin.com/Y33qVmVD

Can anyone please take a look at this and give me a hint what could cause this problem in my system?

You should check your rules first. Most likely cause is some rule(s) that have a sleep, while-loop and/or lock-wait in it. Bindings are a much less likelier cause…

Thanks for the hint. I only have a few rules with a sleep or while loop. Is there a way to restart the rule system in the karaf console so I could test this the next time the problem occurs to be sure that the rule system is the cause of the problem?

When the problem occured I also tried to send a command to an item using the rest interface, but nothing happened. The item state did not change at all. Could this also be caused from the rule engine?

As rules can run in parallel, even ‘a few’ can cause trouble. You should get rid of loops and sleep in rules - use timers or expire items. :grinning: Depending on timing, use of Group members, they can quickly consume all threads and cause long queues. And the behaviour of OH has changed a bit related to threads and also to Group membership triggers and handling for instance. They are ‘subtle’ but can have side effects not seen before.

Is there anything in the Linux log files around the time the issue starts to gives a clue? And just to confirm: no entries in events.log or openhab.log at all around the time it stalls? You said ‘no clues’ but also no eventbus activity at all?

Just a wild guess: did you at update, or with the old version, change any Java memory setting?

I just found something. The syslog logfile has a size of 3.5 gb just after the problem occured. :dizzy_face:

I did find a text editor who can open such large files. There are always the same lines and they repeated until I restarted openhab. Those are the lines that repeat all the time:

Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #12" java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #13" Exception in thread "EventAdminAsyncThread #11" #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:216)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #14" java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:213)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #15" java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #16" #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:213)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)
Feb  5 00:05:32 openhab karaf[632]: java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:213)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #17" java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:213)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)
Feb  5 00:05:32 openhab karaf[632]: java.lang.NullPointerException
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.dequeue(LinkedBlockingQueue.java:213)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:444)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
Feb  5 00:05:32 openhab karaf[632]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
Feb  5 00:05:32 openhab karaf[632]: Exception in thread "EventAdminAsyncThread #18" #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Feb  5 00:05:32 openhab karaf[632]: #011at java.lang.Thread.run(Thread.java:748)

Any idea what this means?

Place this line org.eclipse.smarthome.threadpool:ruleEngine=20 in etc/openhab2/services/runtime.cfg and restart OH and see if that helps.

Ok thanks, I’ve added the line and restarted Openhab. I guess now it’s time to wait for another week to see if the problem is solved.

You may want to post any of your rules that use thread sleep, loops, and locks for review. There may be a more proficient way to write the code for OH.

I just searched through my rules. I don’t use any locks, one rule with a while loop and a sleep command and then also 4 small rules with a sleep command. Here are the rules:

This is a rule that I use to capture images from my Doorbird camera if the motion sensor was triggered. I use the while loop and sleep command to capture 5 images each one second apart:

rule "Doorbird ZWave motion sensor"
	when
	    Item SW_ZWAVE_Entrance_Motion changed to ON
	then	
		if (NUM_Entrance_ZWave_MotionEventCounterDaily.state == NULL) {
			NUM_Entrance_ZWave_MotionEventCounterDaily.postUpdate(0)
		}
		
		var int counter = (NUM_Entrance_ZWave_MotionEventCounterDaily.state as DecimalType).intValue
		counter = counter + 1
		NUM_Entrance_ZWave_MotionEventCounterDaily.postUpdate(counter)
		
		if (NUM_Entrance_ZWave_MotionEventCounterHourly.state == NULL) {
			NUM_Entrance_ZWave_MotionEventCounterHourly.postUpdate(0)
		}
		
		counter = (NUM_Entrance_ZWave_MotionEventCounterHourly.state as DecimalType).intValue
		counter = counter + 1
		NUM_Entrance_ZWave_MotionEventCounterHourly.postUpdate(counter)
		
		if (!zwaveMotionCapturing && SW_Entrance_ZWave_MotionCapturingEnabled.state == ON)
		{
			zwaveMotionCapturing = true

			//Licht einschalten wenn es zu dunkel ist.
			if (NUM_ZWAVE_Entrance_Luminance.state != NULL && NUM_ZWAVE_Entrance_Luminance.state <= 5) {
				SW_Entrance_DoorBird_Light.sendCommand(ON)
			}
			
			createTimer(now.plusSeconds(1),
			[|
				try {
					var i = 0
					var countPhotos = 5
					while((i=i+1) <= countPhotos) {
						var URL url = new URL(doorbirdImageUrl)
						var URLConnection uc = url.openConnection();
						uc.setRequestProperty("Authorization", doorbirdAuth);
						
						var InputStream in = uc.getInputStream()
						var timestamp = new Date()
						var SimpleDateFormat dfFolder = new SimpleDateFormat("YYYY_MM_dd")
						var SimpleDateFormat dfFile = new SimpleDateFormat("YY-MM-dd_HH-mm-ss-SSS")
						var String filePath = String::format("/mnt/Public/Capture/%s/motion/%s_motion.jpeg", dfFolder.format(timestamp), dfFile.format(timestamp))
				
						var File targetFile = new File(filePath)
						targetFile.mkdirs()
						
						var Path targetPath = targetFile.toPath()
						Files::copy(in, targetPath, StandardCopyOption::REPLACE_EXISTING)
						try { Thread::sleep(1000) } catch(Exception e) {}
					}
		        } catch (Exception e) {
		           logError("Doorbird ZWave", "Error in 'ZWave motion sensor' {}", e)
		        }
				
				zwaveMotionCapturing = false
			])
		}
	end

And then I have 4 rules to increase and decrease the brightness of my lamps using a ZWave button. I use a timer and the sleep command so that the INCREASE command is only sent once per second. They all look like this:

var Timer livingColorDimmerTimer = null

rule "Szene Wohnzimmer"
	when
	    Item NUM_ZWAVE_Living_Scene received update
	then
		var Number command = NUM_ZWAVE_Living_Scene.state as Number
				
		if (command == 1.2) {
			if (livingColorDimmerTimer === null) {
				livingColorDimmerTimer = createTimer(now,
					[|
						COLOR_HUE_Group_Living_All.sendCommand(INCREASE)
						Thread::sleep(1000)
						livingColorDimmerTimer = null
					])
			}
		}
	end

If you did not know, the IpCamera binding can work with Doorbird cameras and it can also create the snapshots for you including a animated GIF without using the sleep and it would make the rule needed only be around 5% of the size. Example rule is found in the readme for the binding.

If you wanted the snapshots to be named in a particular format, that needs the rule to do that but the binding tells you when the files are created and ready to use preventing the need for any sleeps. It can even allow you to create snapshots from before the motion event even occurred that triggered the conversion using the preroll config.

I’d be a bit suspicious about what happens if you get a second motion trigger while a first is part way through that process. A fight over image file access could be fun.

An alternative strategy is initiating a cascading timer (and not creating another while one is already in progress)

@rossko57 To prevent this I’ve implemented the “zwaveMotionCapturing” boolean. When the capturing starts this is set to true. When the capturing ends it is set back to false. While the variable is true the timer is not started to prevent multiple image capture timers.

By the way, my Openhab now runs over a week without any problems. I guess the problem could be solved.