Help in identifing a thread id

  • Platform information: openhabian or raspberry pi 3B

I a beginner in openhab and I notice that my rules execution starts to stop working in a period from one day to a week.

I identified the problem to some blocked threads

openhab> shell:threads --list |grep "BLOCKED"
21452 ? Thread-12472                                                                                  ? BLOCKED       ? 1779     ? 760
21455 ? WATCHDOG                                                                                      ? BLOCKED       ? 1760     ? 520
40361 ? pipe-grep "BLOCKED"                                                                           ? RUNNABLE      ? 679      ? 660
openhab>   

these blocked threads increase over time

Here is an extract of my shell:threads

"Thread-12472" Id=21452 in BLOCKED on lock=org.apache.commons.exec.Watchdog@1b193a1
     owned by WATCHDOG Id=21455
    at org.apache.commons.exec.Watchdog.stop(Watchdog.java:67)
    at org.apache.commons.exec.ExecuteWatchdog.stop(ExecuteWatchdog.java:122)
      - locked org.apache.commons.exec.ExecuteWatchdog@1780e91
    at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:360)
    at org.apache.commons.exec.DefaultExecutor.access$200(DefaultExecutor.java:46)
    at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecutor.java:188)

"WATCHDOG" Id=21455 in BLOCKED on lock=org.apache.commons.exec.ExecuteWatchdog@1780e91
     owned by Thread-12472 Id=21452
    at org.apache.commons.exec.ExecuteWatchdog.timeoutOccured(ExecuteWatchdog.java:144)
    at org.apache.commons.exec.Watchdog.fireTimeoutOccured(Watchdog.java:55)
    at org.apache.commons.exec.Watchdog.run(Watchdog.java:81)
      - locked org.apache.commons.exec.Watchdog@1b193a1
    at java.lang.Thread.run(Thread.java:748)

my question is how can identify what is causing this problem is there a way to know for example where the Thread-12472" Id=21452 is initiated from

Thanks

Somewhere to start:

Thanks For your reply

I m very new to openhab Can you explain a bit more? I have openHAB 2.4.0-1 (Release Build) I do not have 2.5

I have notice that when these blocked threads become 4 then the rule execution slows down (maybe due to the thread limit)

here is the resent output of

openhab> shell:threads --list |grep "BLOCKED"
21452 ? Thread-12472                                                                                  ? BLOCKED       ? 1958     ? 810
21455 ? WATCHDOG                                                                                      ? BLOCKED       ? 1934     ? 590
43587 ? Thread-25116                                                                                  ? BLOCKED       ? 23       ? 20
43590 ? WATCHDOG                                                                                      ? BLOCKED       ? 23       ? 0




and these are the running rules 

openhab> shell:threads --list |grep "ruleEngine"                                                                                                                       
21423 ? ESH-ruleEngine-500                                                                            ? TIMED_WAITING ? 35414    ? 16760
43577 ? ESH-ruleEngine-1047                                                                           ? TIMED_WAITING ? 1143     ? 540
44157 ? pipe-grep "ruleEngine"  

I do not have many rules running but when the Bloked threads will become 4 then the rule execution becomes slow

Yes because you only have one left to run the whole show
You need to find out why the other threads become blocked

This can give you an idea as to where to start looking:

I have read the above post before writing here

I do not know if the BLOCKED threads 12472 and 43587 are indeed caused by a rule. I may not have any rules running but I have BLOCKED threads that remain blocked and they increase day by day . I deleted almost all of my rules and for the remaining rules I have to press a button to start them

my question is more simple. Is there a way in openhab to identify where the thread 12472 and 43587 started from ?

Read down on the thread that Vincent posted. there are various things you can do to figure out the state of the thread pools.

But based on your description you have:

  • long running Rules (calls to Thread::sleep, executeCommandLine, sendHttp*Request, etc.)
  • locked Rules (ReentrantLocks)

But ultimately, unless you know programming and can use a Java profiler, there really isn’t much you can do to identify the what started a thread based on the thread ID, or if there was it wouldn’t be all that meaningful to you anyway. It’s not going to tell you what Rule is stuck which is really what you want to know.

  • other errors preventing a Rule from returning.

I would like to thanks the community for all the support. What I did was to rewrite and streamline all of my rules. I discovered that I had a rule for a movement sensor that was starting a timer and that timer was executing a file with a return but the brief moment when the movement was retriggered that timer was canceled, therefore I had a hanging thread

Even after that, the problem remained. After making a clean install on another raspberry pi and movng all of the rules there, then the problem stopped.

I am happy that the problem is solved. I still do not know what was causing the problem. I believe that it was a combination of a bad rule with something else

@rlkoshak Thanks your posts were very helpful in learning openhab

Just to be clear, you had a call to executeCommandLine in a Timer and the Timer was cancelled while the executeCommandLine was already running? And therefore the executeCommandLine stuck around consuming a thread?

I’m just trying to understand the exact behavior you are describing. If you can, posting the Rule would be useful.

Sure I keep a history off all the changes I make


rule "Kitchen Sensor"
when
    Item Kitchen_Sensor_Occ changed     
then

	
	if (Settings_Kitchen_Sensor_Disable.state != ON && Settings_Kitchen_Sensor_Time.state !== null && Settings_Kitchen_Sensor_Lux.state !== null)  {

		if (Kitchen_Sensor_Time !== null){
				Kitchen_Sensor_Time.cancel
				Kitchen_Sensor_Time = null 
		}


		if (Kitchen_Sensor_Occ.state == ON)  {
			if (Kitchen_Sensor_Il.state < Settings_Kitchen_Sensor_Lux.state || Kitchen_Sensor_Il.state === null) {
				if (Settings_Kitchen_Sensor_Switch_On_Light1.state == ON){
					Kitchen_Light.sendCommand(ON)
					}
				if (Settings_Kitchen_Sensor_Switch_On_Light2.state == ON){
					Corridor_Spots.sendCommand(ON)
					}
			}
		}


	    	if (Kitchen_Sensor_Occ.state == OFF){
				Kitchen_Sensor_Time = createTimer(now.plusSeconds((Settings_Kitchen_Sensor_Time.state as DecimalType).intValue)) [|
				Kitchen_Light.sendCommand(OFF)
				Corridor_Spots.sendCommand(OFF)
	       	 		Kitchen_Sensor_Time = null  	
	   			 ]

		}
	}

end

and this is the Corridor_Spots.sendCommand(ON)

rule "Corridor_Spots"
when
    Item Corridor_Spots received command
then
        val result15=executeCommandLine("bash /etc/openhab2/cmd/Corridor_Spots.sh " + (receivedCommand), 2000)
	Corridor_Spots.postUpdate(result15)
end

Now changed to

rule "Kitchen Sensor"
when
    Item Kitchen_Sensor_Occ changed     
then

	
	if (Settings_Kitchen_Sensor_Disable.state != ON && Settings_Kitchen_Sensor_Time.state !== null && Settings_Kitchen_Sensor_Lux.state !== null)  {

		if (Kitchen_Sensor_Time !== null){
				Kitchen_Sensor_Time.cancel
				Kitchen_Sensor_Time = null 
		}


		if (Kitchen_Sensor_Occ.state == ON)  {
			if (Kitchen_Sensor_Il.state < Settings_Kitchen_Sensor_Lux.state || Kitchen_Sensor_Il.state === null) {
				if (Settings_Kitchen_Sensor_Switch_On_Light1.state == ON){
					executeCommandLine("bash /etc/openhab2/cmd/Kitchen_Light.sh ON")
					Kitchen_Light.postUpdate(ON)
					}
				if (Settings_Kitchen_Sensor_Switch_On_Light2.state == ON){
					executeCommandLine("bash /etc/openhab2/cmd/Corridor_Spots.sh ON")
					Corridor_Spots.postUpdate(ON)
					}
			}
		}


	    	if (Kitchen_Sensor_Occ.state == OFF){
				Kitchen_Sensor_Time = createTimer(now.plusSeconds((Settings_Kitchen_Sensor_Time.state as DecimalType).intValue)) [|
				executeCommandLine("bash /etc/openhab2/cmd/Kitchen_Light.sh OFF")
				Kitchen_Light.postUpdate(OFF)
				
				executeCommandLine("bash /etc/openhab2/cmd/Corridor_Spots.sh OFF")
				Corridor_Spots.postUpdate(OFF)

	       	 		Kitchen_Sensor_Time = null  	
	   			 ]

		}
	}

end

AND TO

rule "Corridor_Spots"
when
    Item Corridor_Spots received command
then
   createTimer(now, [ |
        val result15=executeCommandLine("bash /etc/openhab2/cmd/Corridor_Spots.sh " + (receivedCommand), 2000)
	Corridor_Spots.postUpdate(result15)
      ])
end

Also I removed all the loops I had in my rules and replace them with timers

I think the Timer is a red herring. But what you did do is you move the calls to executeCommandLine from the Rules thread pool to the Timer thread pool. I’d add some logging and perhaps some alerting so you can determine whether executeCommandLine is returning or not. There might be a bug here where the timeout is being ignored and not killing the call if the script takes too long to execute. If that is the case, you will run out of Timer threads eventually too.

HOW DO I CHECK how many timed threads are running ? or if they are failing ??

When I created the scripts I tested them extensively

The script will take commands ON,OFF, 0-100, READ (read returns the real state of the fixture).

All the other commands change the state of the fixture and they give back the real state

the script will run for exactly 1.5 seconds if withing 1.5 seconds there is no reply from the equipment the script will return the last known state of the fixture and exit.

I already said, the easiest will be to add some logging and perhaps some alerting to verify that the call to executeCommandLine actually returned.

I already did that there is always a return command. I created a rule that was calling the script every 5 minutes for 3 days

Since I do not know that the problem was caused by a bad installation or something else is there any other way to check

to find the blocked commands I was running
shell:threads --list |grep “BLOCKED”
and
shell:threads --list |grep “ruleEngine”

I imagine that this
shell:threads --list |grep “TIMED”
Will give me all the timmed events

if there is a timed event that is hung is there a way to identify it ?

Not unless you know how to use and interpret a Java Profiler. The Java Console can be used as well. There is lots of good advice and things to try in the replies to Why have my Rules stopped running? Why Thread::sleep is a bad idea.

What might causing this behavior

  1. A script that does not return a command back but exits OR
  2. A script that does not exit

if its 2) what if on linux we execute the command like this

        val result15=executeCommandLine("timeout 2s bash /etc/openhab2/cmd/Corridor_Spots.sh " + (receivedCommand), 3000)
	Corridor_Spots.postUpdate(result15)

The results of the command (what I assume you mean by “command back”) is just what the script happens to print to stdout. If the script doesn’t print anything, result15 will be the empty String.

There is a whole lot of stuff between your call to executeCommandLine and the script actually being executed. Something could go wrong at any place in that stuff that would cause the call to never return. I couldn’t narrow it down without some sort of stack trace. But honestly, I personally don’t trust executeCommandLine. I’ve seen lots of unexpected behaviors from it on this forum so it ranks pretty high on my list of suspects.

It’s worth a try for sure. We have no idea what is causing the problem so it might work. It also might not. At this point, we don’t really even know if the call to executeCommandLine is in fact a problem. Or if it is a problem, that the root cause of the problem was really something to do with your old setup which has been “fixed” when you moved to the new SD card.

Thanks for everything.

I will give it a try and post the results here. Since I do not know jprofiller, shall i revert to the old setup where I can monitor the process with

shell:threads --list |grep “BLOCKED”

or what if i use shell:threads --list |grep “TIMED” |wc -l

is the above command going to give me all the timed events. If yes then if there is a problem I will be seeing all the timed events increase day by day

I don’t know the name of the timer thread pool. It’s handled by the Quartz library I think but that’s as much as I know.

There is only 10 threads on the timer thread pool (assuming a recent OH 2.5). If you see it grow bigger than that, you are not looking at the right pool. You should have a general feel for how many Threads should be executing at the same time based on your Rules (Astro, Timers, and Time triggered Rules all run out of this pool). If you see more than that you know either what you understand about how your Rules execute is wrong or you have something getting stuck.