[SOLVED] executeCommandLine sometimes not returning and blocking rule thread

pacive · January 9, 2019, 5:57pm

Platform information:
- Hardware: Raspberry Pi 3B
- OS: Openhabian
- Java Runtime Environment: Oracle build 1.8.0_191-b12
- openHAB version: 2.4
Issue of the topic: please be detailed explaining your issue

Recently I have had problems with cron triggered rules that stop working after a while (a few days after restart). When this last happened I opened and ran the threads command, and noticed this:

"openHAB-job-scheduler_Worker-1" Id=87 in TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at org.apache.commons.exec.DefaultExecuteResultHandler.waitFor(DefaultExecuteResultHandler.java:119)
    at org.eclipse.smarthome.io.net.exec.ExecUtil.executeCommandLineAndWaitResponse(ExecUtil.java:125)
    at org.eclipse.smarthome.model.script.actions.Exec.executeCommandLine(Exec.java:63)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1086)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeOperation(XbaseInterpreter.java:1061)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._invokeFeature(XbaseInterpreter.java:1047)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.invokeFeature(XbaseInterpreter.java:992)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.invokeFeature(ScriptInterpreter.java:151)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:902)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:865)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:224)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:226)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:204)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:827)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:264)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:226)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:204)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter._doEvaluate(XbaseInterpreter.java:447)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.doEvaluate(XbaseInterpreter.java:228)
    at org.eclipse.smarthome.model.script.interpreter.ScriptInterpreter.doEvaluate(ScriptInterpreter.java:226)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.internalEvaluate(XbaseInterpreter.java:204)
    at org.eclipse.xtext.xbase.interpreter.impl.XbaseInterpreter.evaluate(XbaseInterpreter.java:190)
    at org.eclipse.smarthome.model.script.runtime.internal.engine.ScriptImpl.execute(ScriptImpl.java:82)
    at org.eclipse.smarthome.model.rule.runtime.internal.engine.ExecuteRuleJob.execute(ExecuteRuleJob.java:67)
    at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
      - locked java.lang.Object@173adaa

"openHAB-job-scheduler_Worker-2" Id=88 in TIMED_WAITING on lock=java.lang.Object@1b7ad49
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

One of the execution threads is blocked, and when the other one becomes as well, cron rules stop running.

Since this points to the executeCommandLine being the culprit I enabled debug logging for org.eclipse.smarthome.io.net.exec And added logging statements to my rules that use it.

Today I noticed that one thread was locked again and checked the logs. This is what it normally looks like:

2019-01-09 16:20:38.031 [INFO ] [marthome.model.script.security.rules] - Checking locks
2019-01-09 16:20:38.042 [DEBUG] [lipse.smarthome.io.net.exec.ExecUtil] - executed commandLine 'verisure@@state'
2019-01-09 16:20:42.826 [DEBUG] [lipse.smarthome.io.net.exec.ExecUtil] - exit code '0', result '{"2AQU S2N3":"LOCKED","2AYG XH5E":"LOCKED"}'
2019-01-09 16:20:42.902 [INFO ] [marthome.model.script.security.rules] - Checking locks - Done

And this is when it failed:

2019-01-09 16:25:38.028 [INFO ] [marthome.model.script.security.rules] - Checking locks
2019-01-09 16:25:38.037 [DEBUG] [lipse.smarthome.io.net.exec.ExecUtil] - executed commandLine 'verisure@@state'

No error, but the method just never returns, and thus blocking the thread indefinetely.

This is the rule that faulted:

rule "Doorlocks periodic check"
	when
		Time cron "38 0/5 * * * ?" //or
		//Item Test received command
	then
		logInfo("security.rules", "Checking locks")
		val states = executeCommandLine("verisure@@state", 5000)
		if (states === null || states == "") return;
		lockLabels.forEach[label |
			val rawState = transform('JSONPATH', "$.['" + label + "']", states)
			val item = ScriptServiceUtil.getItemRegistry.getItem(transform("MAP", "verisure.map", label))
			val state = switch (rawState) {
							case 'LOCKED': ON
							case 'UNLOCKED': OFF
			}
			item.postUpdate(state)
		]
		logInfo("security.rules", "Checking locks - Done")
end

I have a timeout at 5000ms, but this script never takes that long. I have another that sometimes takes too long, but then I still get a return value:

2019-01-09 16:28:19.096 [DEBUG] [lipse.smarthome.io.net.exec.ExecUtil] - exit code '143', result ''

Meaning that openhab kills the process if it takes too long, so that shouldn’t be the cause.

Any idea how I can continue investigating? I know that other users have had problems with cron rules suddenly stops triggering, perhaps this can be a possible cause?

rlkoshak · January 9, 2019, 8:51pm

While I agree that this seems the most likely candidate, the way the Rule is written we really don’t know. If executeCommandLine were killed because it timed out states would be null and the logs would look exactly like your “when it failed” set.

Add a log statement immediately after the executeCommandLine just to verify that the command is indeed getting stuck.

If this is your problem, it’s going to be hard to solve because even you can’t get the error to happen consistently and it is working correctly elsewhere in your Rules.

There is nothing in the Rule posted above that stands out as a problem.

pacive · January 9, 2019, 9:06pm

Since the ExecUtil logs the exit-code and result before it returns I am pretty certain that it doesn’t complete. If I look at the source code, and compare it with the stack trace from karaf, it seems to randomly get stuck at resultHandler.waitFor() (line 125)

https://github.com/eclipse/smarthome/blob/master/bundles/io/org.eclipse.smarthome.io.net/src/main/java/org/eclipse/smarthome/io/net/exec/ExecUtil.java

The question is where it goes wrong…

But I can add some more logging just to be extra certain.

unparagoned · January 11, 2019, 7:34pm

I’m also having with most of my rules just stopping to run at apparently random times. I also have an executecommandline, but with a 200,000 timeout, but to be honest I have pretty much everything in my rules and to be honest am surprised I never had problems until version 2.4.
I had a look at my logs but nothing stood out. How do I figure out if I’m having the same problem as you.

executeCommandLine("/usr/bin/python /etc/openhab2/scripts/spotify.py " + receivedCommand.toString, 200000)

pacive · January 11, 2019, 9:13pm

When your rules stop working, go into the karaf console and run threads | grep -A 20 openHAB-job-scheduler_Worker. If you get something like in my first post above, it means that a rule is waiting for executeCommanLine to return.

If you have it running for up to 200 seconds it could be just a normal run though. OH has just 2 threads for Cron rules, and executeCommandLine blocks one of the threads until it returns, so I wouldn’t recommend having such a long timeout.

unparagoned · January 12, 2019, 12:44am

Here are my threads. I don’t know what they mean. But pretty much none of my rules are working now, not even non-chron or timers/

"openHAB-job-scheduler_Worker-1" Id=104 in TIMED_WAITING on lock=java.lang.Object@352a0cd4
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

"openHAB-job-scheduler_Worker-2" Id=105 in TIMED_WAITING on lock=java.lang.Object@5f11f4f9
    at java.lang.Object.wait(Native Method)
    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:568)

"EMF Reference Cleaner" Id=108 in WAITING on lock=java.lang.ref.ReferenceQueue$Lock@690caa59
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at org.eclipse.emf.common.util.CommonUtil$1ReferenceClearingQueuePollingThread.run(CommonUtil.java:70)

"SocketListener(JmDNS-/192.168.1.104)" Id=111 in RUNNABLE (running in native)
    at java.net.PlainDatagramSocketImpl.peekData(Native Method)
      - locked java.net.PlainDatagramSocketImpl@52bc60e4
    at java.net.DatagramSocket.receive(DatagramSocket.java:743)
      - locked java.net.DatagramPacket@382974bb
      - locked java.net.MulticastSocket@18dbbf9f
    at javax.jmdns.impl.SocketListener.run(SocketListener.java:41)

"SocketListener(JmDNS-/192.168.2.18)" Id=114 in RUNNABLE (running in native)
    at java.net.PlainDatagramSocketImpl.peekData(Native Method)
      - locked java.net.PlainDatagramSocketImpl@44568099

pacive · January 12, 2019, 5:49am

When it looks like that everything is just fine, they aren’t currently doing anything, just waiting for something to do. So if you have problems with your rules anyway I’m afraid you have to look elsewhere, there doesn’t seem to be a thread lock. (At least not that I can see, but I’m no expert)

pacive · January 17, 2019, 7:09am

After some research it turned out to be lazy and really bad exception handling in my script that caused it to not return. At the end of the script (written in ruby) I caught all exeptions in hoping it would just exit without outputting anything. Turns out this also caught exit-calls, and it got stuck in an endless loop, even though openhab tried to terminate the process.

If my theory is correct however, it might be wise to handle this somehow, to prevent badly written external scripts to cause openhab to malfunction.

rlkoshak · January 17, 2019, 4:08pm

It is probably worth filing an issue. If you provide a timeout parameter to executeCommandLine, one expects that if the script takes longer than that amount of time to return that the script gets killed and the call to executeCommandLine should return. Any weird state in the script called shouldn’t override this.

The Issue goes in the ESH Repo I think. Maybe openhab1-addons.

pacive · January 18, 2019, 5:22am

Issue raised: