Help with reentrant locks - goes wrong somewhere

CDriver · September 3, 2018, 7:48pm

Folks,

I’m hoping someone can point out where by hole is in the code below -

Rule:


val Functions$Function3<GenericItem, String, ReentrantLock, Boolean> getPosition = [
	GenericItem 	itemBlind,
        String          mac,
        ReentrantLock   fronRoomBlindLock
	| 
                try{
                        if (fronRoomBlindLock.getHoldCount() > 0){logWarn("Blinds - Front room", "getPosition - Waiting on other locks. Count:" + fronRoomBlindLock.getHoldCount().toString)}
                        if (!fronRoomBlindLock.tryLock(5, TimeUnit.SECONDS)){
                                logInfo("Blinds - Front room - getPosition", "Could not lock!")
                                return false;
                        }
                       // logInfo("Blinds - Front room - getPosition", "Locked")

                        val value = sendHttpGetRequest("http://rpiblinds01:8080/getposition/" + mac, 1000)
                        val int pos =  Integer.parseInt(transform("JS", "blindsExtractValue.js", value))

                        if (pos < 0){
                                logWarn("Blinds - Check Position", "Position for blind "  + itemBlind.name + " reports less than zero")
                                return false;
                        }                                          
                        
                        itemBlind.postUpdate(pos)
                }
                catch(Throwable t) { 
                        logError("Blinds - getPosition", t.getMessage())
                }
                finally {
                        fronRoomBlindLock.unlock()
                        //logInfo("Blinds - Front room - getPosition", "Unlocked")
                }        
                return true;
	]


rule "Check Blind Position"
when
        Time cron "0 0/5 * 1/1 * ? *"
then
        getPosition.apply(FrontRoomBlindsLeft_PC, "F3:95:30:3E:DD:4A", fronRoomBlindLock)
        Thread::sleep(500)
        getPosition.apply(FrontRoomBlindsRight_PC, "C1:03:74:91:05:6C", fronRoomBlindLock)
end

logs:

2018-09-03 03:30:02.459 [WARN ] [model.script.Blinds - Check Position] - Position for blind FrontRoomBlindsLeft_PC reports less than zero
2018-09-03 03:30:03.023 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:30:04.083 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-03 03:30:04.137 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-03 03:30:06.685 [INFO ] [ipt.Blinds - Front room - getBattery] - Could not lock!
2018-09-03 03:35:00.044 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:35:01.166 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-03 03:35:01.280 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-03 03:35:01.839 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:40:05.043 [INFO ] [pt.Blinds - Front room - getPosition] - Could not lock!
2018-09-03 03:40:10.597 [INFO ] [pt.Blinds - Front room - getPosition] - Could not lock!
2018-09-03 03:45:00.059 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:45:01.143 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-03 03:45:01.196 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-03 03:45:01.736 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:45:05.056 [INFO ] [ipt.Blinds - Front room - getBattery] - Could not lock!
2018-09-03 03:45:10.636 [INFO ] [ipt.Blinds - Front room - getBattery] - Could not lock!
2018-09-03 03:50:00.028 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1
2018-09-03 03:50:01.087 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-03 03:50:01.149 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-03 03:50:01.702 [WARN ] [ome.model.script.Blinds - Front room] - getPosition - Waiting on other locks. Count:1

So, what’s happening?
I have the rule which fires on cron. This calls the lambda and gets the position.
The endpoint it’s hitting is a python web server running on an rpi. Said service connects to blind controller via BT, gets the details and then returns. Sometimes, it does not connect or return. When this happens, for the getPosition, I don’t really care.

What I am seeing is that it works fine for a while then at some point it goes wonky and stops. it has a http timeout (I care not) but then the lock seems to be left locked. Subsequent calls are not made because they timeout waiting on the locks.

What I do see a lot of is the timeout exception but it handles normally -

2018-09-02 23:15:01.046 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:15:01.119 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:15:02.926 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:15:02.980 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:17:47.106 [ERROR] [twaverf.internal.LightwaveRfWifiLink] - Error converting message: *?{"trans":257440,"mac":"20:20:0C","cmd":"get_duskdawn","lat":51.73,"long":-0.21,"offset":0}
2018-09-02 23:20:01.042 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:20:01.131 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:25:01.046 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:25:01.106 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:25:02.677 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:25:02.778 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:30:01.068 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:30:01.121 [ERROR] [me.model.script.Blinds - getPosition] - null
2018-09-02 23:30:02.164 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:30:02.316 [ERROR] [ome.model.script.Blinds - getBattery] - null
2018-09-02 23:30:03.377 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
2018-09-02 23:30:03.431 [ERROR] [me.model.script.Blinds - getPosition] - null

Ok, the fact that it’s failing so much is interesting but not important to this problem.
When this happens (the timeouts) I can still access it from a browser on my laptop so it’s not python which is sick.
For some reason, the call from the OH RPi times out when talking to the blinds one.
Again, interesting but the code should handle it.

Why is lock not unlocking?
Edit: Is the catch type wrong?

Thanks.
C

CDriver · September 3, 2018, 8:12pm

So, setting up a mock version to better test with -

import java.util.concurrent.locks.ReentrantLock
import org.eclipse.xtext.xbase.lib.Functions
import java.util.concurrent.TimeUnit

val ReentrantLock ddff = new ReentrantLock

val Functions$Function3<GenericItem, String, ReentrantLock, Boolean> getPosition = [
	GenericItem 	itemBlind,
        String          mac,
        ReentrantLock   fronRoomBlindLock
	| 
                try{
                        logWarn("SSSS", "Starting")
                        if (fronRoomBlindLock.getHoldCount() > 0){logWarn("Blinds - Front room", "getPosition - Waiting on other locks. Count:" + fronRoomBlindLock.getHoldCount().toString)}
                        if (!fronRoomBlindLock.tryLock(5, TimeUnit.SECONDS)){
                                logInfo("Blinds - Front room - getPosition", "Could not lock!")
                                return false;
                        }

                        logWarn("SSSS", "Count: " + fronRoomBlindLock.getHoldCount().toString)
                        
                       // logInfo("Blinds - Front room - getPosition", "Locked")
                        logWarn("SSSS", "started")

                        val value = sendHttpGetRequest("http://XXXrpiblinds01:8080/getposition/" + mac, 1000)
                        val int pos =  Integer.parseInt(transform("JS", "blindsExtractValue.js", value))

                        if (pos < 0){
                                logWarn("Blinds - Check Position", "Position for blind "  + itemBlind.name + " reports less than zero")
                                return false;
                        }                                          
                        logWarn("SSSS", "Finishing")
                        
                        //itemBlind.postUpdate(pos)
                }
                catch(Exception t) { 
                        logWarn("SSSS", "In catch")
                        //logError("Blinds - getPosition", t.getMessage())
                }
                finally {
                        logWarn("SSSS", "In in finally")
                        fronRoomBlindLock.unlock()
                        //logInfo("Blinds - Front room - getPosition", "Unlocked")
                }        
                logWarn("SSSS", "Finished")
                logWarn("SSSS", "Count: " + fronRoomBlindLock.getHoldCount().toString)
                return true;
	]




rule "Test test" 
when         
    Item DDDHHH received update
then

    getPosition.apply(FrontRoomBlindsLeft_PC, "F3:95:30:3E:DD:4A", ddff)

end

logs:

21:09:08.827 [INFO ] [smarthome.event.ItemStateChangedEvent] - DDDHHH changed from 5 to 4
21:09:09.321 [INFO ] [smarthome.event.ItemStateChangedEvent] - LocalServer_Cpu_Load changed from 7.5 to 29.8
21:09:09.396 [INFO ] [smarthome.event.ItemStateChangedEvent] - TumbleDryer_Power changed from 1.136172 to 1.072799
21:09:09.439 [INFO ] [home.event.GroupItemStateChangedEvent] - g_Appliances_Power_Sum changed from 2.136172 to 2.072799 through TumbleDryer_Power
21:09:10.180 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - Starting
21:09:10.231 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - Count: 1
21:09:10.277 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - started
21:09:10.339 [ERROR] [e.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.ExecutionException: java.net.UnknownHostException: xxxrpiblinds01: No address associated with hostname
21:09:10.386 [INFO ] [smarthome.event.ItemStateChangedEvent] - LocalServer_Cpu_Load changed from 29.8 to 36.0
21:09:10.457 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - In catch
21:09:10.530 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - In in finally
21:09:10.575 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - Finished
21:09:10.626 [WARN ] [g.eclipse.smarthome.model.script.SSSS] - Count: 0

So there is nothing wrong with my code / logic (that I can see) and inducing a fault in the rule causes the exception handler to act correctly.

Thoughts:
It does not like the lmabda options
There is some odd bug going on with openhab.

rossko57 · September 3, 2018, 9:14pm

I think its more ESH, but I am sure there ar.e ways to ‘silently’ break lambdas, without executing catch or finally blocks. That would fit with your reentrant lock getting held.
From my own experience, it was associated with unexpected null values.

The bit in your code I would be suspicious about … what if that parseInt fails (i.e. parsing an unexpected response). Maybe code defensively and test if it
(your integer) is null before otherwise using it.

CDriver · September 4, 2018, 6:02am

I realised last night that touching the file - simply saving it - fixed in. I guess it all gets cleaned up when it’s re-parsed.
This morning it was broken again. Wife is making “tutting” noises again…

You can see from the logs the exception is raised within the http function. My try-catch should have caught it and suppressed it. The fact that it’s bubbling back up into the logs suggests it’s not being handled “normally”.
The fact that the message is null is also odd.

In C# - which I know better - this would not happen. It would be suppressed and rely on me doing something.

A valid point. The JS parse though has a check and either returns valid or -1.
Safe(ish)
I’ll change it and see if it makes a difference.

rossko57 · September 4, 2018, 9:15am

The rules file? Well yes - one of the first bits of your code will be to (re)create the offending lock variable, executed when the file is loaded.

I’m sure. But it’s the old ass-umptions thing - what if the JS transform service or file access glitches.

Have you somehow tested the “-1” condition works throughout? I’m asking as (a) I think the JS return value is passed as a string? So the Lambda assignment has to parse it? (b) I’ve fallen in holes with a simple var xx = -1 and needed to instead var xx = 0 - 1, something funny about parsing -.

We might be barking up the wrong tree. Another defensive move might be to null-test your http return before processing.

I’m sure all this is not what’s meant to happen, but it’s what we got. Lambdas are really good at hiding root causes! It’s good that your glitchy http souce is showing it up - else you might be trying to track down a once-in-three-months error.

CDriver · September 4, 2018, 1:23pm

I said “ish” after safe

I need to rethink this whole method; It’s not reliable enough at the moment.
Current architecture is OH → HTTP → RPi → Python → bluez → blinds
and reverse for response.

I sometimes issue a move_up command which responds but nothing happens. I need to poll (via the above) for position and work out if position is changing, did it respond.

I have some others on order, I will crack open one and see what the brains are. Hoping it’s something I can work with like esp32. I can then get some custom code on it and read/write to mqtt for a better experience.
It will also have some sort of positional sensor in it so I can get positive feedback.

Failing that. plan B

CDriver · September 5, 2018, 8:29pm

Found it!!!

I peppered the code with entry and exit debugs. I also put in another cron job which checks periodically if the lock is locked. If it is, I unlocked it.

I then left it for a day, trawled through the logs; find the first force-unlock and watch the flow from there. See below

2018-09-05 08:29:28.966 [WARN ] [eclipse.smarthome.model.script.Entry] - Front door was opened
2018-09-05 08:29:50.634 [WARN ] [cript.Controls - Front Door - Motion] - Dimming screen.
**1 2018-09-05 08:30:00.026 [INFO ] [pt.Blinds - Front room - getPosition] - Locked
2018-09-05 08:30:00.479 [WARN ] [eclipse.smarthome.model.script.Entry] - Front door was closed
2018-09-05 08:30:01.046 [WARN ] [eclipse.smarthome.model.script.Entry] - Front door was opened
**2 2018-09-05 08:30:01.158 [ERROR] [.smarthome.model.script.actions.HTTP] - Fatal transport error: java.util.concurrent.TimeoutException
**3 2018-09-05 08:30:01.341 [ERROR] [me.model.script.Blinds - getPosition] - null
**4 2018-09-05 08:30:01.448 [INFO ] [ipt.Blinds - Front room - getBattery] - Locked
**5 2018-09-05 08:30:01.447 [INFO ] [pt.Blinds - Front room - getPosition] - Unlocked
2018-09-05 08:30:01.708 [WARN ] [eclipse.smarthome.model.script.Entry] - Front door was closed
**6 2018-09-05 08:30:02.157 [INFO ] [pt.Blinds - Front room - getPosition] - Locked
**7 2018-09-05 08:30:02.144 [INFO ] [ipt.Blinds - Front room - getBattery] - Unlocked
**8 2018-09-05 08:30:03.314 [INFO ] [ipt.Blinds - Front room - getBattery] - Locked
**9 2018-09-05 08:30:03.314 [INFO ] [pt.Blinds - Front room - getPosition] - Unlocked
**10 2018-09-05 08:30:03.958 [INFO ] [ipt.Blinds - Front room - getBattery] - Unlocked

Get Position manages to lock it
An exception is thrown
Catch block entered and null message returned.
crucially - which it’s in catch the getBattery fires!!!
ghet battery manages to lock it.
Oddly, getposition now unlocks it.

From there on, we’re in a weird place…

So, it’s pretty clear there is a problem there reentrants and/or lambdas.

rossko57 · September 9, 2018, 2:14pm

To be fair, it’s the lambda error handling that’s screwy.
I still have an old OH1.8 system chuntering along while I completely restructure it for OH2. Extensive use of lambdas, locks, and try-catch. But I found out the hard way to code defensively, e.g. validate, test for null values. It works reliably when I intercept unexpected errors in the lambdas.
I suppose really I’ve made the whole try-catch thing redundant and unnecessary