Reliable message to actuator

gahlawat.home · October 31, 2017, 11:51am

Hi Guyz,

Tried searching the forum around this area but could not find anything.
I am trying to work on a reliable message delivery to an actuator. My actuator is currently a MySensor Kitchen Light but I want to have a mechanism for reliable message sent out from controller to at any type of critical actuators like boilers etc.

I am currently using mqtt broker for my messages to and from actuator. I setup my item with autoupdate=“false” and changes state when I receive message from actuator directly and wrote below rule for automatically retry.

I have two items UnderCounterLight switch and UnderCounterLightRetry as number to keep count of retries and stop retries if for example sensor is switched off.

rule "Kitchen lights switch retry"
when Item UnderCounterLight received command
then
	createTimer(now.plusSeconds(2)) [|
		if(UnderCounterLight.state != receivedCommand){
			var Number retries = UnderCounterLightRetry.state as Number;
			if(retries < 50){
				UnderCounterLight.sendCommand(receivedCommand);
			}
			retries =  retries + 1;
			UnderCounterLightRetry.postUpdate(retries);
		}
	]
	
end

The count is reset everynight and I know if this reaches 50 that I might have some failed messages.
My plan is to convert this rule to a group rule where I would attache a gReliable group to an Item which needs reliable delivery. Will think about retry item.

Now I want to know from all those experienced folks here if I am going in the right direction or if there is something similar already available. This is by no means replacement to ignore the transmission unreliability rather a mechanism to ensure absolute reliability and transmission health analysis.

binderth · October 31, 2017, 2:22pm

I don’t get the purpose of this. The item “UnderCounterlight” is a MySensor light, which you adress via MQTT. Why does it need constant updates on its status? MQTT is pretty reliable in my eyes, but it won’t come through in your environment?

gahlawat.home · October 31, 2017, 3:55pm

Hi Thomas, The setup is OpenHAB -> MQTT -> GW --> [Repeater] -> ACTUATOR. I always send a command to actuator with autoupdate=“false” and once the actuator has changed its state it sends another state update information back which actually updates the state of Item UnderCounterLight. As my message could loose anywhere while going to actuator, I want to ensure that actuator actually turned lights on or not. Hope this clarifies

binderth · November 1, 2017, 9:21am

Do you experience loss of Information in any way? Normally everything in between should work pretty reliable!
I don’t recall any lost messages in my setup (using KNX, Amazon dash buttons, 433MhZ outlets, MQTT-driven WLAN devices…). But then again, the direction seems all right from my perspective, if you need some double-checking.
What makes me wonder is, your construct would not work, I think, unless you configured “retries” as a global variable (but perhaps you did this already, as you said, it would be reset every night?).

I would make the rule logic different and head for a while-loop like this:

Thread::sleep(2000) // wait 2secs
while(UnderCounterLight.state != receivedCommand) {
     UnderCounterLight.sendCommand(receivedCommand)
     Thread::sleep(2000) // wait another 2secs
}

That way, you only have one thread for each “receivedCommand” - and don’t open up to 50 threads. You also don’t need the global retries variable.

But: If you for some reason turn the light ON and OFF within 2secs, your rules will interact with each other. So you should make sure, the while-loop gets exited somehow like this:

val Number retries = 0  // retries within that thread

Thread::sleep(2000) // wait 2secs
while(UnderCounterLight.state != receivedCommand || !reachedRetries) {  // either UnterCounterLight is updated OR retries are reached
     UnderCounterLight.sendCommand(receivedCommand)
     Thread::sleep(2000) // wait another 2secs
     retries = retries + 1  // add retries
     if (retries >= 50) { // if retries meet 50, exit the loop
          reachedRetries = true
     }
}

But again: I would recommend making the communication between all devices fail-proof.

PS: please remove the semicolons at the end of lines…

gahlawat.home · November 1, 2017, 4:16pm

Thanks again for detailed answer.

Well I would say not generally but I have seen this happen a couple of times, especially on this actuator which is quiet far from gateway.

Thanks for confirming. Just wanted to make sure that I do not end up reinventing the wheel.

retries is intentionally kept local as real Retries are kept in Number item UnderCountLightRetry (for graphs and analysis) For some reason I could not increment the value of Item and hence had to use this variable.

I have already tested and it works as expected at least for a single toggle of switch (single threaded)

I try to avoid Thread:sleep as one long living thread is worse than 50 short lived threads from my point of view. Also while loop is not needed as UnderCounterLight.sendCommand(receivedCommand) triggers the rule again. Setting another timer.

Exit already works for me fine but good point about someone pressing the button again. Probably I can cancel the timer if it is still valid (before 2 seconds it should be) and start another one just to make sure that actuator is really OFF.

Indeed and thanks about semicolons. I am always confused around this area and made it a habit to use semicolons unless designer complains

rlkoshak · November 1, 2017, 7:04pm

Are you aware of QOS settings in OH? If you use QOS 1 or 2 the broker doesn’t consider the message delivered until the subscriber acknowledges the receipt. I would say that yes, it does look like you are reinventing the wheel.

binderth · November 1, 2017, 7:19pm

I understood the issue lies in the communication between the Gateway and the actuator. I guess, the Gateway would acknowlegde the message - but it could be, the actuator won’t get the command…? The TO would like to be 100% sure, the actuator changed its status - then the gateway gets the updated status and tells OH.

So the QOS-settings will make sure, the communication is understood between OH and the gateway, but leaves the important one out… If there’s a way to ensure the communication between gateway and actuator (via repeater perhaps), then you won’t need the rule.

rlkoshak · November 1, 2017, 7:43pm

You would need to set the QOS on both OH and on the devices. A QOS of 2 will ensure that and message sent by OH will be delivered exactly once to the subscribers on that topic. The broker will hold on to that message until it is acknowledged by the active subscribers. If you use QOS 1, it makes sure that the message is received at least once by the active subscribers on the topic.

To get the same behavior on messages coming back from the device, the device would need to be configured with QOS 1 or 2 as well.

Since all the comms are MQTT, guaranteed delivery and retry is already implemented by MQTT itself.

What I don’t know, because I’ve never had a message fail, is what sort of indication one can get in case the maximum number of tries or the time to live for a message runs out. But I’m pretty sure there is one. You might need to have a rule to handle that case but really, if that happens it is because the actuator is offline or otherwise in a bad state and need to be looked at.

MQTT was developed for monitoring and controlling SCADA equipment on oil pipelines where the network connections are slow and intermittent. This sort of guaranteed delivery has been built in from the beginning. With a QOS of 1 or 2 MQTT is exceptionally reliable.

binderth · November 1, 2017, 7:48pm

I also never encountered a message delivery failure in my setup. But certainly, if the MySensors Gateway (I don’t know this one) also uses MQTT and there the (same?) broker can be configured to use high QOS settings, that should solve the initial problem for the TO, I guess! no complicated rules or workarounds, if the light button is pressed in short timeframes and all!

gahlawat.home · November 2, 2017, 11:22am

Thanks Rich for your suggestions around MQTT and I agree with using QOS for those but as Thomas already pointed out this will just guarantee delivery to my gateway. Same way my gateway has mechanism to ensure delivery to actuator but due to this being a long chain (I can have repeaters between gateway and actuators located far away from gateway). Probably your are right in the sense that rather than retry I should put more effort in ensuring reliability of communication.

Agreed

My Sensors uses NRF radio with 2.4GHz frequency. Also radios can be pretty unreliable in general, I am very happy with MySensors in terms of reliablity. The issue only happens if My actuator is placed at the range limit and there is some disturbance in between.

Thanks you guyz seem to be convincing me to get rid of this however I was more or less thinking of correcting my actuator position for better radio reception in case I see any failures rather than relying on retries to get the job done. As radio reception can change due to many external factors I thought I should constantly monitor this rather than assuming it will work once it has worked. Last night many people missed Bake Off winners due to TV signal problems due to high pressure in UK.

rzylius · November 9, 2017, 9:11pm

I have OH2 server and serial mysensors gateway (no MQTT), and I experience undelivered messages. What I find in logs is that I update mysensors switch status channel, but which changes the channel status, but I see no mysensors activity, thus actuator does not receive the message.

Could you describe your setup in more detal? “OpenHAB -> MQTT -> GW --> [Repeater] -> ACTUATOR” - do you use mqtt mysensors gateway? I wonder how to use this MQTT QOS feature with mysensors serial gateway setup?

gahlawat.home · November 22, 2017, 1:33pm

My setup is OpenHAB -> MQTT -> Mysensor MQTT GW -> Actuator. (No repeater). The QOS feature settings are available in OpenHAB mqtt configuration file. Have not explored the MQTT gateway but it should be easy to do.
However my messages are never lost at MQTT transport level rather those are lost from GW to Actuator (which is wireless and we cannot guarantee that 100% if someone using a jammer etc). Hence I thought of retrying the message from OpenHAB untill it is successful and take action right there if its not successful. This also removes the need of any QoS settings in MQTT.

Do you know that if your gateway receives this message or not?

rzylius · November 26, 2017, 1:41am

I see, I guess Mysensors MQTT gateway shoudl have QOS setting somewhere. I am having the same problem as you described. on actuators I use signing feature, so my observation is that more messages get not-delivered.

I myself use serial gateway, so with my question i was contemplating if MQTT somewhere in serial gateway design could provide this level of ensuring reliability.

ACK feature of mysensors for me did not work satisfactory, so I basically tried to make ACK feature programmed in the rule. so actuator code I modified in the way that when actuator receives a command and implements it, it posts back the current state of the actuator (in my example the routine of changing the actuator state takes about 6 seconds, so trigger is set on 8 seconds).

so the whole my idea is that after the command to actuator is sent, the rule after some seconds checks if actuator responded with the new state and if the new state is equal to what command we sent to it. If either of conditions are not met, we send command again.

I am still testing the rule so there are logging garbage.


rule "HVAC received SPEED command"
when Item hvac_Speed received command
then
	logInfo(filename, "HVAC RELIABLE DELIVERY RULE SPEED" )
	logInfo(filename, " hvac_Speed.state="+hvac_Speed.state+" :::  receivedCommand="+receivedCommand+" ::: hvac_lastUpdate="+hvac_lastUpdate.state )
	createTimer(now.plusSeconds(8)) [|
		if ((!hvac_lastUpdate.changedSince(now.minusSeconds(8))) || (hvac_Speed.state != receivedCommand)) {
			var Number retries = hvac_Speed_retry.state as Number;
			if(retries < 10){
				hvac_Speed.sendCommand(receivedCommand);
				logInfo(filename, "HVAC RELIABLE DELIVERY. resending a command. was change in last 8 seconds? = "+hvac_lastUpdate.changedSince(now.minusSeconds(8)))
				retries =  retries + 1
				hvac_Speed_retry.postUpdate(retries);
				}
		}
		else {
				logInfo(filename, "HVAC RELIABLE SUCCESS: hvac_Speed.state =="+hvac_Speed.state+"   :::  receivedCommand=="+ receivedCommand+" ::: ::: hvac_lastUpdate.changedSince(now.minusSeconds(8)="+hvac_lastUpdate.changedSince(now.minusSeconds(8) ))
				hvac_Speed_retry.postUpdate(0)
			}
		}	
	]
	
end

gahlawat.home · December 4, 2017, 2:08pm

It is absolutely same as what I have mentioned in OP. Since than I have added a lock object to avoid rule getting confused if two updates are received.

Also I am keeping a number item to keep track of retry so that I can see graphs of retries done.

I would say that I am quite happy with the solution however, the graphs are making me think that It is not mysensor protocol rather some radio interference which is resulting in lost messages. Have a look at the graph for last month

It is almost always the case that if it retries it will retry for all 50 times, indicating that there was a message block for quite some time and not just one message. Also I am noticing that it happens most of the times on Saturdays when I am home and probably messing up with some thing or other, so could be an interference. I have 2 wifi routers and made sure that channel used for mysensor is completely isolated from those and also has very little disturbance (i.e neighbouring wifi s). I will post if I notice more however any idea’s/suggestions are most welcome

RayMYP · December 4, 2017, 3:49pm

binderth:

I would make the rule logic different and head for a while-loop like this:
Thread::sleep(2000) // wait 2secs
while(UnderCounterLight.state != receivedCommand) {
     UnderCounterLight.sendCommand(receivedCommand)
     Thread::sleep(2000) // wait another 2secs
}
That way, you only have one thread for each “receivedCommand” - and don’t open up to 50 threads. You also don’t need the global retries variable.

Whilst developing my own rules I’ve considered this difference a few times between doing some rule using Thread:sleep versus createTimer and I usually come out with Timer as the preference. Just for me to get a better understanding of the Rules Engine can I question your statement “That way, you only have one thread for each “receivedCommand” - and don’t open up to 50 threads”?

I am ignorant of the implementation level of the Rules Engine (RE) so this is my assumption about how it likely works:

an event happens
RE finds a rule to run and picks an idle thread from it’s pool to execute the rule.
RE repeats 2 for all rules found
As a rule finishes executing they return the thread back to the RE pool.

So if you Thread::sleep you effectively tie up the RE thread. On the second trigger of the rule, RE effectively gives you another thread because you haven’t released the last one so you now have 2 and so on until you equal the same 50 threads. Except you’re holding those threads out of the RE pool rather than general application level timer threads. I don’t know if one thread pool is better than the other but I would instinctively not want to hold up the RE pool as it likely has defined limits for creating and caching etc.

And being able to control and easily cancel timers due to new events makes them my general preference over sleep. It always depends on the detail of the problem you’re trying to solve and there is always another way to do it but is it reasonable to work on the basis that “createTimer is preferable to sleep in these sort of rules”?

Would be good to see your latest complete code for reference.

So you’re thinking:
a: RF failure during those times
b: the retry code isn’t quite right

Can you tell from looking at the physical device itself whether the message got through? Does it have it’s own log files? You’re dealing with two way comms so wondering if it’s possible to rule in/out one of the directions.

One limitation I have on my RF device is it locks out RX when doing TX. So although I’ve been tempted to sometimes send my message twice to give it a good chance to get through I’ve stopped doing that because it blocks other messages coming in. To be honest I need another device to just leave it in RX mode so I miss less but also then I can position the two devices to increase the RF coverage. I’m not using NRF though so may not be the same for you.

rlkoshak · December 4, 2017, 5:14pm

Based on my understanding your analysis is correct. Long sleeps are discouraged exactly for this reason.

binderth · December 4, 2017, 5:25pm

of course you can! At first, I got the impression that the TO spawned a new thread with each iteration within the if-clauses, as I thought the “sendcommand” would trigger the whole rule again. But after some deeper consideration, that wasn’t the case…
so yes - timers are the better solution at least for timespans > 5secs always!

gahlawat.home · December 7, 2017, 4:34pm

Code as requested

rule "Kitchen lights switch retry"
when Item UnderCounterLight received command
then
	lock.lock()
	try{
		if(underCounterLightRetryTimer != null){
			underCounterLightRetryTimer.cancel()
		} 
		underCounterLightRetryTimer = createTimer(now.plusSeconds(4)) [|
			if(UnderCounterLight.state != receivedCommand){
				logInfo("rules", "Trying again " + UnderCounterLightRetry.state)
				var Number retries = UnderCounterLightRetry.state as Number
				if(retries < 50){
					UnderCounterLight.sendCommand(receivedCommand)
				}
				retries =  retries + 1;
				UnderCounterLightRetry.postUpdate(retries)
			}
		]
	} finally {
		lock.unlock()
	}
	
	
end

You are right in terms of your analysis and suggestions.
Regarding above I would rule out b as I do see that retry works well if I manually turn the actuator off and it acknowledges the command and stops retrying as soon i switch on. So let me correct myself, “Although it always does max tries I have tested and seen that sometimes it only retries few times”
I am currently suspecting RF failure as NRF frequencies are very common (Wi Fi, Bluetooth, some car security systems etc) and there could be some disturbance. I live in urban area with at least 5 wifi networks always visible to me.

Unfortunately my devices and Gateway are Arduino and there is no logging/tracking available. however I would try keeping an eye on any option to find out. You are right about 2 way communication and if any fails I am treating that as a failure.

Regarding RF device TX/RX restriction NRF has the same restriction. This could be the case but I think this would not happen 50 times in a row. I have no message going to my actuator other than OH messages. So the radio should be freely listening when openhab sends a message and should Transmit immediately after this.

Thanks for bringing these up though!

RayMYP · December 7, 2017, 6:08pm

Firstly to make your lock more reliable you need to relock the lock inside the timer. May not make sense right away but when you think and remember that createTimer is going to fire a new thread for you you’ll hopefully realise your new thread has no lock. Had to test this myself to be sure but try this sample:

import java.util.concurrent.locks.ReentrantLock

val ReentrantLock testLock = new ReentrantLock

rule "Test locking rule..."
when
 	Item dbg_Debugging changed
then
	testLock.lock()
	try{
		logInfo("testrule", "1. about to create timer lockstate:" + testLock.isLocked())
		createTimer(now.plusSeconds(5)) [|
			logInfo("testrule", "2. in timer code timer lockstate:" + testLock.isLocked())
			testLock.lock()
			try{
				logInfo("testrule", "3. still testing timer lockstate:" + testLock.isLocked())
			} finally {
				testLock.unlock()
			}
		]
	} finally {
		testLock.unlock()
	}
end

and you should get this in the logs:

2017-12-07 17:07:13.258 [INFO ] [ipse.smarthome.model.script.testrule] - 1. about to create timer lockstate:true
2017-12-07 17:07:18.269 [INFO ] [ipse.smarthome.model.script.testrule] - 2. in timer code timer lockstate:false
2017-12-07 17:07:18.274 [INFO ] [ipse.smarthome.model.script.testrule] - 3. still testing timer lockstate:true

That shows at logInfo stage 2 you have no active lock therefore soon as you UnderCounterLight.sendCommand another thread can actually run the rule as it likes whilst your triggered timer thread is trying to finish it’s work of UnderCounterLightRetry.postUpdate(retries) etc.

To be clean I would also add the following line inside your timer:

    underCounterLightRetryTimer = null

Surprised it’s not complaining on subsequent calls where you are doing .cancel() on a potentially expired timer but maybe the timer system handles it OK. I guess it’s still a valid Timer object in it’s own right even if it’s thread has expired.

None of that helps the actuator and RF issue though, just gives you confidence the rules are OK, so as you say seems most likely caused by disturbance.

Sounds semi urban to me I can see at least 14 WiFi networks ignoring mine

If as you say it is quite a noticeable difference when you are home then you could try finding out what but even if you know what this time, could something else be doing the same again in six months, if you are on the edge of range you could be forever chasing it. Is it time to cut your loses on finding the disturbance and either change a component for different tech or boost it with a repeater or both?

Arduino has WiFi modules available as well, depends what your WiFi signal is like but WiFi repeaters/extenders are a lot easier to come by and deploy than RF, in my mind anyway. ESP8266 are also suitable replacements for Arduino in some cases and have the WiFi/MQTT stack more or less built in so are quite strong candidates for device control like relays and actuators. Haven’t tried ESPEasy yet but looks promising for a fairly simple client software control solution with the ESP8266 if you don’t want to roll your own.

Sorry not a here’s your fix solution.